Introduction to Bayesian inference

Learning Objectives
  • Introduce Bayesian inference
  • Learn two simple Bayesian models (Beta-binomial & normal-normal)
  • Discuss practical advantages and disadvantages of Bayesian approach

1. Introduction to Bayesian approach

The Bayesian approach to estimating parameters stems from Bayes’ theorem:

Let \(\theta\) be the parameter of interest and \(y\) be the observed data,

\[\begin{aligned} P(\theta \mid y) & = \frac{P(y \mid \theta)P(\theta)}{P(y)} \\ & = \frac{\text{likelihood of data given parameter} \times \text{prior}}{\text{marginal distribution of data free of the parameter}} \\ & \propto \text{likelihood}(y \mid \theta ) \times \text{prior}(\theta) \end{aligned}\]

  • \(P(y)\) is called a normalizing factor, it’s in place to ensure that \(\int P(\theta \mid y) d\theta = 1\), that is the posterior distribution of \(\theta\) is a proper probability distribution with area under the density curve equals to 1.

  • Its value is not of interest, unless we are comparing between data models.

  • The essence of Bayes theorem only concerns the terms involving the parameter, \(\theta\), hence \(P(\theta \mid y) \propto P(y\mid \theta)P(\theta)\).

Likelihood vs Probability distribution
  • Probability distribution: A function of the data given fixed parameters.
    • A probability distribution \(f_y(y\mid \theta)\) sum or integrate to 1 overall all possible values of the random variable Y. \(\theta\) is fixed.
    • It answers the question: “If the parameter were \(\theta\), how likely (probable) is each possible outcome y?”, “What data could I see if \(\theta\) is \(\theta^*\)
  • Likelihood: The same mathematical formula, but viewed as a function of the parameters given fixed observed data.
    • The likelihood function \(L(\theta \mid y_{obs})\) reuse the same math expression \(f_y(y\mid \theta)\), s.t., \(L(\theta \mid y_{obs}) = f_y(y\mid \theta)\).
    • It answers the questions: Given the observed data \(y_{obs}\), how plausible is each parameter value \(\theta\)?“,”Which \(\theta\) best explains the observed data?”
    • Likelihood does not require sum or integrate to 1.\(\theta\) can vary but \(y_{obs}\), data is fixed.

1.2 Working Example - Estimating a proportion

Frequenist approach

Suppose you have observed 20 patients in a Phase I RCT on a given dose of drug,

  • 0 out of 20 patients have had an adverse event (\(=0\%\))

  • decision to escalate to a higher dose if it’s unlikely that the current dosing results in a true proportion of adverse events above 20%

  • This is a classic phase I estimate, let \(\theta\) be the true risk of adverse event (parameter of interest).

  • Under frequentist test (One-sided Exact Binomial Test), we are interested to test \(H_0: \theta = 0.2\) vs \(H_a: \theta \leq 0.2\) (testing whether the risk of AE is trending below 0.2).

  • If the \(H_0\) is true, we would expect to see \(0.2\times20=4\) patients experiencing AE. We find our \(p-value\) for this test by considering the probability of seeing an outcome as, or more extreme than the observed \(0\) AE among the 20 patient samples.

\[P-value = P(Y=0) = {20 \choose 0}0.2^0\times 0.8^{20} = 0.01152922\]

  • If we have a significance level of \(5\%\), the p-value indicates that we have evidence that is significant enough to reject the null. Given this test results, we will escalate the dose.
Code
binom.test(x=0, n=20, p = 0.2,alternative = "less")

    Exact binomial test

data:  0 and 20
number of successes = 0, number of trials = 20, p-value = 0.01153
alternative hypothesis: true probability of success is less than 0.2
95 percent confidence interval:
 0.0000000 0.1391083
sample estimates:
probability of success 
                     0 
Code
#calculate P-value using binomial likelihood;
choose(20,0)*0.8^20
[1] 0.01152922
Code
#calculate P-value using binomial probability density function;
pbinom(q = 0, size = 20, p = 0.2, lower.tail = TRUE)
[1] 0.01152922
  • The observed proportion \(\hat{\theta}=0\) with 95% CI: 0 - 0.14.
Quiz - Interpreting P-value

Given the p-value = 0.012 from the binomial exact test, Which of the following are true?

  1. The probability that the null hypothesis is true is 1.2%.
  2. The probability that the alternative hypothesis is true is 1-1.2% = 98.8%.
  3. There’s 1.2% chance of making a mistake by rejecting the null hypothesis.
  4. Assuming the risk of AE is 0.2 (\(H_0\) is true), we would obtain sample estimated risk of AE as extreme or more than the observed 0%, in 1.2% of studies in the long run when repeating the trial.
Quiz - Interpreting Confidence Interval

Given the 95% CI of \(\theta\) is (0, 0.14), Which of the following are true?

  1. If we recruit another sample of 20 patients, there is 95% chance the sample proportional of AE would be between 0% and 14%.
  2. About 95% of patients in the sample will have risk of AE between 0 and 14%.
  3. We are 95% confidence that the interval (0,0.14) captured the true risk of AE.
  4. If we repeat this trial many times, then about 95% of the intervals produced will capture the true risk of AE.

What would a Bayesian do?

To make probability statements about \(\theta\) after observing data \(y\), we need a probability distribution for \(\theta\) given \(y\) (the posterior distribution).

1.First, we need to specify a prior distribution for \(\theta\), \(P(\theta)\).

  • Example 1: We might have no idea about \(\theta\) other than that it lies in the interval [0,1] and thus specify a unif(0,1). Let \(\theta \sim U(0,1)\), the prior probability distribution (p.d.f) is \[ P(\theta) = \frac{1}{1-0} = 1.\]
  1. We assume the \(P(y \mid \theta)\) follows a binomial distribution, thus the likelihood of the observed data given \(\theta\) is \[ P(y = 0 \mid \theta) = {20 \choose 0} \theta^0 (1-\theta)^{20} = (1-\theta)^{20}\]

  2. The posterior then becomes (given example prior 1)

\[\begin{align} P(\theta \mid y = 0) &= \frac{P(y = 0 \mid \theta) \times P(\theta)}{P(y=0)} \\ & = \frac{(1-\theta)^{20} \times 1}{P(y=0)} \\ & = \text{Constant} \times (1-\theta)^{20} \\ & \propto (1-\theta)^{20} \end{align}\]

Code
d <- tibble(p_grid = seq(from = 0, to = 1, length.out = 1000),
         y      = 0, 
         n      = 20) %>% 
    mutate(prior      = dunif(p_grid, 0, 1),
         likelihood = dbinom(y, n, p_grid)) %>% 
    mutate(posterior = likelihood * prior )

d %>% pivot_longer(prior:posterior) %>% 
  # this line allows us to dictate the order in which the panels will appear
    mutate(name = factor(name, levels = c("prior", "likelihood", "posterior"))) %>% 
    ggplot(aes(x = p_grid, y = value, fill = name)) +
    geom_area(show.legend = FALSE) +
    scale_fill_manual(values = c("grey", "red", "blue")) +
    facet_wrap(~ name, scales = "free")+
    theme_bw()

Approximate posterior distribution obtained using Bayes’ rule with UNIF(0,1) prior. In this example, the normalizaing term P(y=0) is not considered.

Posterior estimation with brms package in R

we will introduce MCMC in later sessions.

Code
dat1<-data.frame(y=rep(0,20))
# fit1 <- brm(y ~ 0+Intercept,
#             data = dat1,
#             family = brmsfamily(family="bernoulli", link="identity"),
#             init = 0,
#             prior = c(prior(uniform(0, 1), coef="Intercept")),
#             iter = 1000,
#             seed = 123,
#             sample_prior = TRUE)
# saveRDS(fit1, file = "brms_lecture2_binomail.rda")

fit1 <- readRDS("brms_lecture2_binomail.rda")
posterior_summary(fit1, variable = c("b_Intercept"), probs = c(0.025, 0.975) )
              Estimate  Est.Error         Q2.5     Q97.5
b_Intercept 0.04159862 0.03913757 0.0008174098 0.1442491
Code
hypothesis(fit1, 'Intercept < 0.2')
Hypothesis Tests for class b:
             Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio
1 (Intercept)-(0.2) < 0    -0.16      0.04     -0.2    -0.07       1999
  Post.Prob Star
1         1    *
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.
Code
# plotting prior and posterior in brms;
hyp1 <- hypothesis(fit1, 'Intercept = 0')
data.frame(prior = hyp1$prior_samples$H1, posterior = hyp1$samples$H1) %>%
    gather(type, value) %>%
    ggplot(aes(x = value) ) +
    geom_histogram(bins = 50, alpha = 0.8) +
    facet_wrap(~type, scales = "free") +
    xlab(expression(theta) ) +
    theme_bw(base_size = 20)

Bayesian estimation of the AE example using brms package in R

2. The Beta-binomial model

  • With a single sample of \(n\) binary outcomes, we have one unknown parameter: \(\theta\) and \(0 \leq \theta \leq 1\).
  • We need a prior distribution for \(\theta\) (e.g., proportion, risk, probability of event etc): \(P(\theta)\)
    • To express indifference between all values of \(\theta\), we can use a uniform distribution on \(\theta\), as we did in the previous example
    • To express belief (e.g. based on external evidence) that some values of \(\theta\) are more likely that others, it is convenient to use a beta distribution
    • This has two parameters, often labelled as \(\alpha\) (also written as a) and \(\beta\) (also written as b), which we can choose to represent the strength of the external evidence
    • If a parameter has a Beta(a,b) distribution, then the prior mean is \[\frac{a}{a+b}\]
    • The beta distribution prior for the binomial is useful for illustrating how the Bayesian approach combines prior information and new data
Beta-binomial model
  • Recall the likelihood for a binomial outcome of x successes in n trials,

\[ P(x \mid \theta) \propto \theta^x (1-\theta)^{n-x}\]

  • The \(Beta(\alpha,\beta)\) prior has the same functional form for \(\theta\),

\[ P_0( \theta) \propto \theta^{\alpha-1} (1-\theta)^{\beta-1}\]

  • We find the posterior as

\[\begin{align} P(\theta \mid x) & \propto P(x \mid \theta) \times P_0( \theta) \\ & \propto \theta^x (1-\theta)^{n-x} \times \theta^{\alpha-1} (1-\theta)^{\beta-1} \\ & \propto \theta^{x+\alpha-1} (1-\theta)^{n-x+\beta-1} \end{align}\]

  • Thus, comparing to the \(Beta(\alpha,\beta)\) prior, the posterior just changes the exponents by adding x and n-x, respectively.

    • Comparing to the prior, make two changes to get the posterior:
      1. \(a \rightarrow a+x\), [a + number of events]
      2. \(b \rightarrow b+(n-x)\), [b + number of non-events]
    • Quite simply, when \(x\) events have been observed in n subjects, the prior

    \[ \theta \sim Beta(\alpha, \beta) \]

    • gives the posterior

    \[ \theta \mid x \sim Beta(\alpha+x, \beta+n-x) \]

  • The prior and posterior are both beta distributions!

Interpretation of Beta Prior

  • Suppose we start with a beta prior with small parameters

\[ \theta \sim Beta(0.001, 0.001) \]

  • Observe x events in n trials, the posterior

\[ \theta \mid x \sim Beta(0.001+x, 0.001+n-x) \approx Beta(x,n-x)\]

  • Posterior mean of \(\theta \approx \frac{x}{n}\), the equivalent to the MLE based only on the data

  • Interpretation of the \(Beta(\alpha,\beta)\) prior.

    • Like having seen \(\alpha\) events and \(\beta\) non events in a sample size of \(\alpha + \beta\)
    • Strength of prior information equivalent to prior “sample size” \(\alpha + \beta\)
    • Prior mean = \(\frac{\alpha}{\alpha + \beta}\)
Working example of Beta-binomail model

Consider Beta(3,7)and Beta(12,28) priors

  • Gold line: prior belief that assumes approximately 3 events in 10 subjects
  • Blue line: prior belief that assumes approximately 12 events in 40(=12+28) subjects

Summarizing Posterior Distribution

  • Since we know the form of the posterior distribution, we can easily calculate functions such as:
    1. Posterior mean \(E(\theta) = \frac{\alpha+x}{\alpha + \beta+n}\)
    2. 95% Credible intervals (we will talk more about them next week)
    3. \(P(\theta < 0.2)\), \(P(\theta < 0.5)\), \(P(0.4< \theta < 0.6)\), etc, which can be directly used to make probabilistic statement about the \(\theta\). E.g., the probability of the posterior adverse event rate < 0.2 is about 0.95.
  • Generate informative plots for assessing priors and posteriors. All this can easily be done using R.

Data overwhelming the prior

  • The posterior for the beta-binomial model after seeing x events in n trials is \(\theta \mid x ~ Beta(\alpha + x, \beta + n - x)\) with posterior mean as

\[E(\theta \mid x) = \frac{\alpha + x}{\alpha + \beta + n}\]

  • In \(n \gg \alpha + \beta\) and \(x \gg \alpha\) (sample size and number of events is large), recall when using prior \(Beta(0.001, 0.001)\), \[ E(\theta \mid x) = \frac{\alpha + x}{\alpha + \beta + n} \approx \frac{x}{n}\]

  • Here, prior is of little importance!

A Beta(1,1) is Unif(0,1)

  • When \(\alpha = \beta = 1\), \(P(\theta) \propto \theta^{1-1}(1-\theta)^{1-1} = 1\), which is the probability density of a unif(0,1).

Posterior mean as a weighted average

  • The posterior mean is

\[\begin{align} E(\theta \mid x) & = \frac{\alpha + x}{\alpha + \beta + n} \\ & = \frac{\alpha }{\alpha + \beta + n} + \frac{x }{\alpha + \beta + n}\\ & = \Big( \frac{\alpha+ \beta }{\alpha + \beta + n} \Big) \times \Big( \frac{\alpha }{\alpha + \beta} \Big) + \Big( \frac{n }{\alpha + \beta + n} \Big) \times \Big( \frac{x }{n} \Big) \\ & = w \times \Big( \frac{\alpha }{\alpha + \beta} \Big) + (1-w) \times \Big( \frac{x }{n} \Big) \\ & = w \times (\text{prior mean}) + (1-w) \times (\text{sample estimate}) \end{align}\]

  • Where \(w\) is the ratio of the “prior sample size” to the “total sample size” \[ w = \frac{\alpha+ \beta }{\alpha + \beta + n}.\]

  • This is a common theme in Bayesian models with actual prior information

  • The posterior distribution will lie between the prior and likelihood

  • The posterior mean is a weighted average of the prior mean and data-based estimate

  • As the sample size increases, the contribution of the prior diminishes

3. The Normal-normal model

Parameters with normal priors and likelihoods

  • Continuing with a single parameter model

  • The likelihood of the parameter given some data is that of a normal distribution with known variance \(\sigma^2\)

\[ y \mid \theta \sim N(\theta, \sigma^2) \]

  • suppose the prior for \(\theta\), the mean, follows a normal distribution

\[ \theta \sim N(\mu_0, V_0)\]

  • We want to make inference about \(\theta\) given the data y, what is the posterior distribution \(P(\theta \mid y)\)?
Normal-normal model

1. The normal model (likelihood) of one data point

  • Consider a single observation \(y\) from a normal distribution parameterized by a mean \(\theta\) and variance \(\sigma^2\), we assume that \(\sigma^2\) is known.

  • The normal likelihood for \(y\) given \(\theta\) is \[ P(y\mid \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(y-\theta)^2}{2\sigma^2}}, y \in (-\infty, \infty)\]

  • The \(E(y) = \theta\) and \(V(y) = \sigma^2\) and \(SD(y) = \sigma\).

  • Given \(y \mid \theta \sim N(\theta,\sigma^2 )\), \(z = \frac{y-\theta}{\sigma} \sim N(0,1)\). Thus, if we rescale \(y\) by \(\sigma\), we have

    • 95%CI of y: \(\theta \pm 1.96 \times \sigma\)

2. Prior

We can often use a normal prior to represent \(\theta \sim N(\mu_0, V_0)\)

  • \(\mu_0\) is the prior mean

  • \(V_0\) is the prior variance, which expresses the uncertainty around the mean \(\mu_0\)

  • \(\tau_0 = \frac{1}{V_0}\) is called precision, it can also be used to express the uncertainty around the mean, e.g. \(\theta \sim N(\mu_0, \tau_0)\)

  • High precision = low variance, in other words, the distribution of the normal random variable \(\theta\) is closely centered around its mean.

3. Posterior

  • With a normal prior and a normal likelihood of the data given parameter \(\theta\), (here we use precision to express the normal distributions)

\[ \theta \sim N(\mu_0,\tau_0 ) \text{ and } y \mid \theta \sim N(\theta,\tau_y )\]

  • where \(\tau_y = \frac{1}{\sigma^2}\) is the precision of the observed data

  • It can be proven mathematically using the Bayes’ rule that the posterior distribution given a normal prior and normal likelihood is also normal,

\[ \theta \mid y \sim N(\mu_1, \tau_1)\]

  • The posterior mean \(\mu_1\) is

\[ \frac{\tau_0 \times \mu_0 + \tau_y \times y}{\tau_0+ \tau_y} \]

  • The posterior precision is

\[ \tau_1 = \tau_0 + \tau_y\]

  • Again, the prior and posterior are both normal distributions!

Posterior mean as a weighted average of the prior mean and data-based estimate

  • Using the precision parameters

\[ \tau_y = \frac{1}{\sigma^2_y} \text{ and } \tau_0 = \frac{1}{V_0}\]

  • The posterior distribution of \(\theta\) is then

\[\theta \mid y \sim N(\mu_1, \tau_1)\]

  • where the mean and precision are

\[\begin{align} & \mu_1 = \Big(\frac{\tau_0}{\tau_0 + \tau_y}\Big) \times \mu_0 + \Big(\frac{\tau_y}{\tau_0 + \tau_y}\Big) \times y \\ & \tau_1 = \tau_0 + \tau_y \end{align}\]

  • The posterior mean is a weighted sum of the prior mean and the observed value.
  • The weights are the relative precisions of the prior and the likelihood \[\mu_1 = w_0 \times \mu_0 + w_y \times y\] \[w_0 = \frac{\tau_0}{\tau_0 + \tau_y}\] and

\[w_y = 1-w_0 = \frac{\tau_y}{\tau_0 + \tau_y}\]

Data overwhelming the prior and vice versa

  • With relatively low prior precision (imprecise prior and substantial data), i.e., \(\tau_0 \ll \tau_y\) and \(0 \approx w_0 \ll w_y \approx 1\),

\[ \mu_1 = w_0 \times \mu_0 + w_y \times y \approx y\]

  • With relative little data and aprior that is not too diffuse, i.e., \(\tau_y \ll \tau_0\) and \(0 \approx w_y \ll w_0 \approx 1\),

\[ \mu_1 = w_0 \times \mu_0 + w_y \times y \approx \mu_0\]

  • In most cases, the posterior mean is a compromise between what we believed before and what we observed in the data.

Normal model with multiple observations

  • Consider a set of \(n\) observations \(y = (y_1, \ldots, y_n)\) sample from a \(N(\theta, \sigma^2)\), where \(y_1,\ldots, y_n\) are identically, independently distributed and \(y_i \sim N(\theta, \sigma^2)\), \(i = 1, \ldots, n\).

  • The normal likelihood of the joint distribution \(y = (y_1, \ldots, y_n)\) is

\[\begin{align} P(y_1, \ldots, y_n \mid \theta) &= \prod_{i=1}^n P(y_i \mid \theta) \\ & = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(y_i-\theta)^2}{2\sigma^2}}, y \in (-\infty, \infty) \end{align}\]

  • Given prior \(\theta \sim N(\mu_0, \tau_0)\), the posterior of \(\theta \mid y\) is

\[P(\theta \mid y_1, \ldots, y_n) = N(\mu_n, \tau_n)\]

  • where the posterior mean \(\mu_n\) is

\[\mu_n = \frac{\tau_0 \times \mu_0 + n\tau_y \times \bar{y}}{\tau_0+ n\tau_y} = \Big( \frac{\tau_0}{\tau_0+ n\tau_y} \Big) \times \mu_0 + \Big( \frac{n\tau_y}{\tau_0+ n\tau_y} \Big) \times \bar{y}\]

  • and the posterior precision is

\[ \tau_n = \tau_0 + n\tau_y\]

  • \(\bar{y} = \frac{\sum_{i=1}^2y_i}{n}\) is the sample mean and \(\tau_y = \frac{1}{\sigma^2}\)

  • Again, the posterior mean is a weighted sum between prior and data. As sample size \(n \rightarrow \infty\), \(\Big( \frac{\tau_0}{\tau_0+ n\tau_y} \Big) \rightarrow 0\) and \(\Big( \frac{n\tau_y}{\tau_0+ n\tau_y} \Big) \rightarrow 1\), data will overwhelm the prior and dominate the posterior distribution.

The use of normal-normal model in clinical applications

  • Consider estimating a parameter \(\theta\), e.g. mean, log-hazard ratio, log-odds-ratio, log-relative-risk

  • Suppose we have an estimate of \(\theta\); We will call it \(y\) and let \(\hat{\sigma}^2_y\) be the estimated variance

  • This estimate \(y\) can be viewed as a single data observation of \(\theta\)!

  • In large samples, it is approximately true that \(y \sim N(\theta, \sigma^2_y)\), where \(\theta\) is the true value and \(\sigma^2_y\) is the true variance (sample size dependent) and the 95%CI of \(y\) is \(\theta \pm 1.96 \times \sigma\).

  • We can construct a normal likelihood of data for the parameter \(\theta\) as \(y \mid N(\theta, \hat{\sigma}^2_y)\)

Working exmaple - Treatment effect estimate on the mean
  • For this example, we will simulate data for a small study of n=100 subjects and randomized 50% to treat=0 and 50% to treat=1 (mimicking an RCT). We will then generate the outcome Y from a normal distribution depending on the treatment indicator.

  • Suppose we have the following table of results from a linear regression estimating treatment effect \(\theta\) on the outcome

Code
n <- 100
set.seed(1234)
treat <- 1*(runif(n, min = 0, max = 1)<0.5)
y <- 5*treat+rnorm(n, mean = 0, sd =10)
fit<-lm(y~treat)
summary(fit)

Call:
lm(formula = y ~ treat)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.8662  -6.7095  -0.5731   5.7503  24.1866 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    1.303      1.423   0.916   0.3618  
treat          4.080      1.918   2.127   0.0359 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.543 on 98 degrees of freedom
Multiple R-squared:  0.04414,   Adjusted R-squared:  0.03438 
F-statistic: 4.525 on 1 and 98 DF,  p-value: 0.03591
Code
round(confint.default(fit, 'treat', level=0.95),3) # based on asymptotic normality
      2.5 % 97.5 %
treat 0.321   7.84
  • We are interested in using this published results to make Bayesian inference on \(\theta\)

  • Our observed data (estimate) of \(\theta\), denoted as \(y\) is 4.08 with standard error 1.918. Here, \(y\) plays the role of data! We call \(y\) the observed datum.

  • Relying on the large sample approximation, we assume that the estimate y arose from a normal distribution with true mean \(\theta\), We treat the observed standard error \(\hat{\sigma}_y = 1.918\) as the true standard deviation of \(y\)

    • Recall, the standard deviation of the sampling distribution of a parameter estimate is called its standard error
  • This gives us the likelihood \[y \mid \theta \sim N(\theta, 1.918^2)\]

  • The width of the 95% CI is approximately equal to \(2 \times 1.96 \times \sigma_y\)

  • Suppose we are not provided with \(\sigma_y\) from the table of results, but know the 95% CI of y is (0.321,7.84), we can calculate \(\sigma_y\) with

\[\sigma_y = \frac{7.84- 0.321}{2\times 1.96}=1.918\]

  • The normal likelihood is this example is \(y \sim N(4.08, 1.918^2)\), if we assume a prior \(\theta \sim N(0, 2 \times 1.918^2)\), we have a posterior

\[\begin{align} & \theta \mid y \sim N(\mu_1, \tau_1) \\ \mu_1 &= \frac{ \frac{1}{prior.sd^2} }{\frac{1}{prior.sd^2} + \frac{1}{se.obs^2} } \times prior.mean + \frac{\frac{1}{se.obs^2}}{\frac{1}{prior.sd^2} + \frac{1}{se.obs^2}} \times y.obs \\ & = \frac{ \frac{1}{2 \times 1.918^2} }{\frac{1}{2 \times 1.918^2} + \frac{1}{1.918^2} } \times 0 + \frac{\frac{1}{1.918^2}}{\frac{1}{2 \times 1.918^2} + \frac{1}{1.918^2}} \times 4.08 = 2.72 \\ \tau_1 & = \tau_0 + \tau_y \\ & = \frac{1}{prior.sd^2} + \frac{1}{se.obs^2} \\ & = \frac{1}{2 \times 1.918^2} + \frac{1}{1.918^2} = 0.408 \end{align}\]

Code
d <- tibble(theta = seq(from = -10, to = 15, length.out = 200),
         y.obs=4.08,
         se.obs=1.918,
         prior.mean=0,
         prior.sd=sqrt(2)*1.918) %>% 
  mutate(priorDensity = dnorm(theta, prior.mean, prior.sd)) %>%
  mutate(likelihood.0 = dnorm(theta, y.obs, se.obs)) %>%
  mutate(w = (1/se.obs^2)/(1/se.obs^2 + 1/prior.sd^2)) %>%
  mutate(post.prec = 1/se.obs^2 + 1/prior.sd^2) %>%
  mutate(posteriorDensity = dnorm(theta, w*y.obs+(1-w)*prior.mean, sqrt(1/post.prec))) %>%
  mutate(likelihood = likelihood.0/sum(diff(theta)*likelihood.0[-1])) %>%
  pivot_longer(cols=c(priorDensity,likelihood,posteriorDensity),names_to="Function",values_to="p")
  
d$Function <- factor(d$Function, 
                     levels=c("priorDensity","likelihood","posteriorDensity"),
                     labels=c("Prior","Likelihood","Posterior"))

p<-ggplot(d, aes(theta, p,colour=Function))+
  geom_line(size=1)+
  xlab(expression(theta))+
  ylab(expression(p(theta)))+
  theme(legend.position = "bottom")+
  scale_colour_manual(values=c("goldenrod","steelblue","black")) +theme_bw()

p

Working example - Bayesian reanalysis of RCT - Odds Ratio

\[\hat{OR} = \frac{74/138}{92/120} = 0.699\] \[SE\{ln(\hat{OR})\} = \sqrt{\frac{1}{74}+\frac{1}{138}+\frac{1}{92}+\frac{1}{120}} = 0.2\]

\[95\% CI = exp(log(\hat{OR} - 1.96\times SE\{ln(\hat{OR})\}) \text{ to } exp(log(\hat{OR} + 1.96\times SE\{ln(\hat{OR})\}) = (0.473, 1.035)\]

Intervention Death within 28d Alive by 28d Total
Peripheral Perfusion-Targeted Resuscitation 74 138 212
Lactate Level-Targeted Resuscitation 92 120 212
Total 166 258 424
  • We treat this as a data evidence!

1. The normal likelihood on log(OR)

  • The estimated OR on 28-day mortality, 0.699 (0.473 to 1.035), is called the observed datum (viewed as a single data observation on \(\theta\), denoted as \(y\))

  • It’s known that log of the parameter estimate, such as HR, RR and OR, follow a normal distribution around its true log value.

  • Let \(\theta\) denote the random variable representing log(OR)

  • Let \(y\) denote an observed data point of \(\theta\), we have \(y \sim N(\theta, \sigma_y^2)\), where \(\theta\) is log(OR) and \(\sigma_y\) is the standard error of the log(OR). \(\sigma_y=0.2\) from the result table.

  • If we don’t have \(\sigma_y^2\) but only the reported 95% CI of the log of OR, we can take the width of the 95% CI and divided by \(2\times 1.96\) to obtain \(\sigma_y^2\). For example, we have

\[\hat{\sigma}_y = \frac{log(1.035) - log(0.473)}{2\times 1.96} = 0.2\]

  • Given the requentist analysis results, we have a data likelihood (evidence) \(y \mid \theta \sim N(\theta,0.2^2)\) where we have observed \(y=log(0.699)=-0.358\)

2. The prior for log(OR)

  • From (Zampieri et al. 2020), “a neutral prior in which the log[OR] of mean of 0 and an SD of 0.5, corresponding to an OR of 1 (95% credible interval, 0.37-2.75)”

Prior OR beliefs on the effectiveness of peripheral perfusion on 28 days mortality from the Bayesian reanalysis paper

Suppose we consider a neutral prior for log(OR) from the reanalysis paper (Zampieri et al. 2020)

  • The neutral prior is \(log(OR) \sim N(log(1), 0.5^2)\)

3. The posterior of log(OR)

  • Given prior \(\theta \sim N(0, 0.5^2)\) and likelihood \(y \mid \theta \sim (\theta,0.2^2)\) where \(y=log(0.699)=-0.358\)

  • Prior precision is \(\frac{1}{0.5^2} = 4\) and the likelihood precision is \(\frac{1}{0.2^2} = 25\), we can see that the likelihood has more information than the prior!

  • The posterior precision on log(OR) is \(\tau_1 = 4 + 25=29\)

  • The posterior mean is

\[ \frac{4}{29} \times log(1) + \frac{25}{29} \times log(0.699)=-0.309\]

  • The posterior variance of \(\log(OR) = \frac{1}{29}\) and the posterior standard deviation is \(\sqrt{\frac{1}{29}} = 0.186\), thus

\[\theta \mid y \sim N(\mu_1 = -0.309, \sigma_1^2 = 0.186^2)\]

  • The posterior mean OR = exp(-0.309) = 0.734 and the posterior 95% Credible Interval for OR is \[\exp(-0.309 \pm 1.96\times 0.186) \rightarrow (0.51, 1.057) \]
Code
#prior on log(OR);
mu0<- 0
sd0<- 0.5
prec0 <- 1/sd0^2

#likelihood on log(OR);
ybar<- -0.358
sd.ybar<- 0.2 
prec.y <- 1/sd.ybar^2

#posterior on log(OR)
prec1 <- prec0 + prec.y
mu1 <- (prec0 * mu0 + prec.y*ybar)/prec1
sigma1 <- 1/sqrt(prec1)


mu.plot <- seq(ybar-1.5,ybar+2,
               length.out=200)

lik <- dnorm(ybar, mu.plot, sd.ybar)
prior <- dnorm(mu.plot, mu0, sd0)
posterior<- dnorm(mu.plot, mu1,sigma1)

d<-data.frame(mu.plot=rep(mu.plot, 3),
              p =c(lik,prior,posterior),
              Function=rep(c("likelihood","prior","posterior"),each=length(mu.plot)))

p1<- ggplot(d, aes(mu.plot, p,colour=Function))+
    geom_line(size = 1)+
     xlab("log(OR)")+
     ylab("Probability Density")+
    scale_colour_manual(values=c("goldenrod","steelblue","black"))+
     theme_bw()+
     ggtitle("Neural prior on log(OR): plot on log(OR) scale")



p2<- ggplot(d, aes(exp(mu.plot), p,colour=Function))+
    geom_line(size = 1)+
     xlab("OR")+
     ylab("Probability Density")+
    scale_colour_manual(values=c("goldenrod","steelblue","black"))+
     theme_bw()+
     ggtitle("Neural prior on log(OR): plot on OR scale")

ggarrange(p1,p2, nrow=2)

Posterior summary statistics from R

Code
thresholds<-c(1,0.8)
res <-exp( qnorm(c(0.5, 0.025, 0.975), mu1, sigma1))
  res <- c(res, pnorm(log(thresholds), mu1, sigma1))
  names(res) <- c("median","q2.5","q97.5",paste0("Pr(OR < ",thresholds,")"))
  res
      median         q2.5        q97.5   Pr(OR < 1) Pr(OR < 0.8) 
   0.7344593    0.5103917    1.0568951    0.9517409    0.6773526 

Summary of Conjugate priors & models

  • There are a small number of prior-likelihood pairs where the prior and the posterior are in the same family (e.g., both beta, both normal): these are called conjugate models

  • These posterior distributions can be computed without specialized software

  • These examples are useful for illustrating how Bayesian methods combine prior information with data

  • They have only limited practical usefulness - limits on types priors, limits on number of parameters

  • They are useful teaching tools, soon we will see how we can extend beyond these simple models

4. Why Bayesian: Statistical Estimation Revisit

4.1 Classic Frequentist Inference

  • Many common intuitive data-based estimators are maximum likelihood estimators
    • sample mean, \(\bar{x}\)
    • sample proportion, \(\hat{p}\)
    • rate of an event, \(\hat{\lambda} = \frac{\text{Number Events}}{\text{Follow-up Time}}\)
    • least square estimators from linear regression model
    • OR & RR
    • HR, cox regression uses partial-likelihood approach
  • Likelihood-based inference does two things
    • Finds the MLE, the parameter value that is most likely, given the observed data.
    • Finds how much curvature there is near this maximum. This leads to estimates of standard errors and CIs.
  • The 95% CI refers to the frequency with which hypothetical future random intervals from repeated sampling constructed in the same way will include the true value
    • A particular interval will either include the value or not.
    • It does not mean that the interval contains the true value with 0.95 probability, i.e. we do not have \(P(\theta \mid data)\)

Hypothesis Testing in a nutshell

Hypothesis testing is a procedure for determining whether a hypothesis is valid or not, based on sample evidence

  1. Set up two hypotheses: Null (\(H_0\)) vs Alternative (\(H_A\))

    • \(H_0\), Specifies population parameter of interest & proposes baseline value for it. Null presumed true unless disproved
    • \(H_A\), Contains value of parameter different from H0; typically what you want to show
  2. Specify appropriate model (find test statistic)

    • Z test, T test, chi-square test etc
    • under the assumption that the model is correct
  3. Mechanics (calculate p-value)

    • We want to compare data to what we would expect to see if \(H_0\) was true and ask how likely it is to get these data, if \(H_0\) was true
    • P-value is used to quantify evidence against \(H_0\); The probability of getting equally or more extreme data (i.e. test statistic) under repeated sampling if \(H_0\) is true

    \[ P_{-value} = P\Bigg( \begin{matrix} \text{observed or more} \\ \text{extreme test statistics} \end{matrix} \mid H_0 \ is \ true \Bigg) \]

    • P-value tells how “surprising” are data you observed under \(H_0\)
  4. Arrive at conclusion: Reject \(H_0\) or not

  • Decisions about rejecting H0 or not are usually done by computing a p-value and comparing to \(\alpha\), significance level
  • significance level, an arbitrary threshold for P-value to define “rare/extreme event”. If P-value falls below this level, we conclude results are statistically significance and we reject the null

Hypothesis Tests Errors

  • Type I error (\(\alpha\)): \(H_0\) is true, but we reject it \(\alpha = P(Reject \ H_0 | H_0 \ is \ true)\)
    • We commonly control for Type I error - “innocent until proven guilty”
  • Type II error (\(\beta\)): \(H_0\) is false, but we fail reject it \(\beta = P(Fail \ to \ Reject \ H_0 | H_0 \ is \ false)\)
    • A test’s ability to detect a false null hypothesis is called the power of the test, \(1-\beta\)
  • For any test, there is trade-off between \(\alpha\) and \(\beta\) (e.g., for fixed n, decreasing\(\alpha\) increases \(\beta\), vice-versa )

More about p-value (Wasserstein and Lazar 2016)

  • By its definition, P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

  • Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

    • “A conclusion does not immediately become true on one side of the divide and false on the other.”
  • A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.

  • “By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.”

    • “For example, a p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis.”
    • absence of evidence is not evidence of absence
  • Inference is performed based on data that have not been observed, under an assumption that is almost certain not to be true

    • Would you be willing to bet that the true probability a coin comes up heads is exactly P(head)=0.5, or that the exact true adverse rate is fixed at 0.2?

4.2 Bayesian Inference

  • In Bayesian statistics, inference is done using the posterior distribution

  • Given a posterior distribution, we can calculate any summary of it we like

    • \(P(0.008 < \theta < 0.012)\), \(P( \theta < 0.1)\), interval with 95% probability of containing \(\theta\)
    • We can use quantities to summarize the posterior distribution (mean, median, sd, IQR)
    • We can plot the density function and shade areas of interest

Posterior uncertainty

  • Posterior variance is used to characterize the posterior uncertainty
    • In most cases, it is smaller than the prior variance
    • In some cases, for some models, the opposite is true, when prior is in direct conflict with the data
  • Credible intervals play the role of the confidence intervals
    • For the 95% credible interval \([\alpha, \beta]\), \(P(\alpha< \theta <\beta \mid y)\) = 0.95

Posterior predictive distributions

  • Often we are interested in extending our (Bayesian) analysis using priors and data, and predicting outcomes in data observed in the future
    • e.g., predicting number of successes in a future trial for another 25 new observations
  • Essentially we want the probability density \(P(y^* \mid y)\), where with \(y^*\) we denote the new data
    • This probability is not always possible to compute directly, it does not correspond to some known distribution
    • We will see that these kinds of questions can be answered using Markov Chain Monte Carlo method and brms (coming up in a couple of weeks)

Posterior predictive distribution

Let \(y^*\) denote a new data, the posterior predictive distribution for \(y^*\) is

\[P(y^* \mid y) = \int P(y^* \mid \theta) P(\theta \mid y) \ d \theta\]

  • The posterior predictive distribution as an average of conditional predictions over the posterior distribution of \(\theta\)

  • We assume conditional independence of \(y^*\) and \(y\) given \(\theta\)

In the beta-binomial adverse event example, let \(y^*\) denote the result of a new trial of 6 patients given posterior \(\theta \mid y \sim Beta(1,7)\),

\[P(y^* \mid y) = \int_0^1 {6 \choose y^*} \theta^{y^*} (1- \theta)^{6-y^*} P(\theta \mid y) \ d \theta, y^* \in \{0,1,2,3,4,5,6\}\]

\[P(\theta \mid y) = \frac{\Gamma(8)}{\Gamma(1)\Gamma(7)} \theta^{0}(1-theta)^{6}\]

Code
likelihood<-length(7)
for (i in 1:7) {

integrand_pred <- function(theta) {
  (choose(6, i-1)*theta^(i-1)*(1-theta)^(6-i+1))*((gamma(8)/gamma(1)/gamma(7))*(1-theta)^6)}

likelihood[i] <- integrate(integrand_pred,lower = 0, upper = 1)$value   
    }
  
posterior_pred<-data.frame(y_star=0:6, likelihood = likelihood)
knitr::kable(posterior_pred, caption = "Posterior predictive distribution for a new trial of 6 patients")
Posterior predictive distribution for a new trial of 6 patients
y_star likelihood
0 0.5384615
1 0.2692308
2 0.1223776
3 0.0489510
4 0.0163170
5 0.0040793
6 0.0005828
Code
sum(posterior_pred$likelihood)
[1] 1

Why Bayesian? Practical Reasons

  • Apart from making more sense to use then P-value, the modern Bayesian approach offers some practical advantages to data analysis
    • Prior information
    • Direct probability statements
    • Subjectivity acknowledged and formally represented

Prior information

  • Science rarely proceeds in the absence of any prior information or opinion
  • Bayesian reasoning allows the formal incorporation of this opinion
  • The classical approach introduces prior information and opinion in study planning and, informally, at the discussion stage of results - high risk of inconsistency
  • In Bayesian approach:
    • For small sample sizes, a prior will “rein in” imprecise data-based estimates
    • For large sample sizes, most priors are overwhelmed by the likelihood (data)

Probability Statements

  • A result of a Bayesian analysis is a posterior distribution, \(P(\theta \mid data)\), for an unknown parameter (e.g. an odds ratio \(\theta\)) given the data

  • We can compute probabilities directly from the posterior distribution

\[P(\theta < 1 \mid data) = \int_0^1 P(\theta \mid data) d\theta\]

  • this is the area under the posterior density \(P(\theta \mid data)\) between the value of 0 and 1 for \(\theta\)

  • this is what we prefer p-value to be!

  • we can define \(\theta < 0.9\) as a clinically important effect, then we calculate the probability of \(P(\theta < 0.9 \mid data )\). No need to choose a “significance” level, we can directly use this probably to make clinical decisions!

Summary

Practical advantages of Bayesian approach

  • Direct probability statements

  • Incorporation of external evidence

  • Complex models can be built step-by-step

  • Makes you understand the analysis better

Disadvantages including

  • Learning how to code

  • Need to know more about distributions

  • Require good probability and statistical model knowledge

  • More time consuming (computationally more expensive with complex model and multiple parameters)

References

Hernández, Glenn, Gustavo A Ospina-Tascón, Lucas Petri Damiani, Elisa Estenssoro, Arnaldo Dubin, Javier Hurtado, Gilberto Friedman, et al. 2019. “Effect of a Resuscitation Strategy Targeting Peripheral Perfusion Status Vs Serum Lactate Levels on 28-Day Mortality Among Patients with Septic Shock: The ANDROMEDA-SHOCK Randomized Clinical Trial.” Jama 321 (7): 654–64.
Wasserstein, Ronald L, and Nicole A Lazar. 2016. “The ASA Statement on p-Values: Context, Process, and Purpose.” The American Statistician. Taylor & Francis.
Zampieri, Fernando G, Lucas P Damiani, Jan Bakker, Gustavo A Ospina-Tascón, Ricardo Castro, Alexandre B Cavalcanti, and Glenn Hernandez. 2020. “Effects of a Resuscitation Strategy Targeting Peripheral Perfusion Status Versus Serum Lactate Levels Among Patients with Septic Shock. A Bayesian Reanalysis of the ANDROMEDA-SHOCK Trial.” American Journal of Respiratory and Critical Care Medicine 201 (4): 423–29.