Introduction to Bayesian inference

Learning Objectives

Introduce Bayesian inference
Learn two simple Bayesian models (Beta-binomial & normal-normal)
Discuss practical advantages and disadvantages of Bayesian approach

1. Introduction to Bayesian approach

The Bayesian approach to estimating parameters stems from Bayes’ theorem:

Let \(\theta\) be the parameter of interest and \(y\) be the observed data,

\[\begin{aligned} P(\theta \mid y) & = \frac{P(y \mid \theta)P(\theta)}{P(y)} \\ & = \frac{\text{likelihood of data given parameter} \times \text{prior}}{\text{marginal distribution of data free of the parameter}} \\ & \propto \text{likelihood}(y \mid \theta ) \times \text{prior}(\theta) \end{aligned}\]

\(P(y)\) is called a normalizing factor, it’s in place to ensure that \(\int P(\theta \mid y) d\theta = 1\), that is the posterior distribution of \(\theta\) is a proper probability distribution with area under the density curve equals to 1.
Its value is not of interest, unless we are comparing between data models.
The essence of Bayes theorem only concerns the terms involving the parameter, \(\theta\), hence \(P(\theta \mid y) \propto P(y\mid \theta)P(\theta)\).

Likelihood vs Probability distribution

Probability distribution: A function of the data given fixed parameters.
- A probability distribution \(f_y(y\mid \theta)\) sum or integrate to 1 overall all possible values of the random variable Y. \(\theta\) is fixed.
- It answers the question: “If the parameter were \(\theta\), how likely (probable) is each possible outcome y?”, “What data could I see if \(\theta\) is \(\theta^*\)”
Likelihood: The same mathematical formula, but viewed as a function of the parameters given fixed observed data.
- The likelihood function \(L(\theta \mid y_{obs})\) reuse the same math expression \(f_y(y\mid \theta)\), s.t., \(L(\theta \mid y_{obs}) = f_y(y\mid \theta)\).
- It answers the questions: Given the observed data \(y_{obs}\), how plausible is each parameter value \(\theta\)?“,”Which \(\theta\) best explains the observed data?”
- Likelihood does not require sum or integrate to 1.\(\theta\) can vary but \(y_{obs}\), data is fixed.

1.2 Working Example - Estimating a proportion

Frequenist approach

Suppose you have observed 20 patients in a Phase I RCT on a given dose of drug,

0 out of 20 patients have had an adverse event (\(=0\%\))
decision to escalate to a higher dose if it’s unlikely that the current dosing results in a true proportion of adverse events above 20%
This is a classic phase I estimate, let \(\theta\) be the true risk of adverse event (parameter of interest).
Under frequentist test (One-sided Exact Binomial Test), we are interested to test \(H_0: \theta = 0.2\) vs \(H_a: \theta \leq 0.2\) (testing whether the risk of AE is trending below 0.2).
If the \(H_0\) is true, we would expect to see \(0.2\times20=4\) patients experiencing AE. We find our \(p-value\) for this test by considering the probability of seeing an outcome as, or more extreme than the observed \(0\) AE among the 20 patient samples.

\[P-value = P(Y=0) = {20 \choose 0}0.2^0\times 0.8^{20} = 0.01152922\]

If we have a significance level of \(5\%\), the p-value indicates that we have evidence that is significant enough to reject the null. Given this test results, we will escalate the dose.

Code

binom.test(x=0, n=20, p = 0.2,alternative = "less")


    Exact binomial test

data:  0 and 20
number of successes = 0, number of trials = 20, p-value = 0.01153
alternative hypothesis: true probability of success is less than 0.2
95 percent confidence interval:
 0.0000000 0.1391083
sample estimates:
probability of success 
                     0

Code

#calculate P-value using binomial likelihood;
choose(20,0)*0.8^20

[1] 0.01152922

Code

#calculate P-value using binomial probability density function;
pbinom(q = 0, size = 20, p = 0.2, lower.tail = TRUE)

[1] 0.01152922

The observed proportion \(\hat{\theta}=0\) with 95% CI: 0 - 0.14.

Quiz - Interpreting P-value

Given the p-value = 0.012 from the binomial exact test, Which of the following are true?

The probability that the null hypothesis is true is 1.2%.
The probability that the alternative hypothesis is true is 1-1.2% = 98.8%.
There’s 1.2% chance of making a mistake by rejecting the null hypothesis.
Assuming the risk of AE is 0.2 (\(H_0\) is true), we would obtain sample estimated risk of AE as extreme or more than the observed 0%, in 1.2% of studies in the long run when repeating the trial.

Quiz - Interpreting Confidence Interval

Given the 95% CI of \(\theta\) is (0, 0.14), Which of the following are true?

If we recruit another sample of 20 patients, there is 95% chance the sample proportional of AE would be between 0% and 14%.
About 95% of patients in the sample will have risk of AE between 0 and 14%.
We are 95% confidence that the interval (0,0.14) captured the true risk of AE.
If we repeat this trial many times, then about 95% of the intervals produced will capture the true risk of AE.

What would a Bayesian do?

To make probability statements about \(\theta\) after observing data \(y\), we need a probability distribution for \(\theta\) given \(y\) (the posterior distribution).

1.First, we need to specify a prior distribution for \(\theta\), \(P(\theta)\).

Example 1: We might have no idea about \(\theta\) other than that it lies in the interval [0,1] and thus specify a unif(0,1). Let \(\theta \sim U(0,1)\), the prior probability distribution (p.d.f) is \[ P(\theta) = \frac{1}{1-0} = 1.\]

We assume the \(P(y \mid \theta)\) follows a binomial distribution, thus the likelihood of the observed data given \(\theta\) is \[ P(y = 0 \mid \theta) = {20 \choose 0} \theta^0 (1-\theta)^{20} = (1-\theta)^{20}\]
The posterior then becomes (given example prior 1)

\[\begin{align} P(\theta \mid y = 0) &= \frac{P(y = 0 \mid \theta) \times P(\theta)}{P(y=0)} \\ & = \frac{(1-\theta)^{20} \times 1}{P(y=0)} \\ & = \text{Constant} \times (1-\theta)^{20} \\ & \propto (1-\theta)^{20} \end{align}\]

Code

d <- tibble(p_grid = seq(from = 0, to = 1, length.out = 1000),
         y      = 0, 
         n      = 20) %>% 
    mutate(prior      = dunif(p_grid, 0, 1),
         likelihood = dbinom(y, n, p_grid)) %>% 
    mutate(posterior = likelihood * prior )

d %>% pivot_longer(prior:posterior) %>% 
  # this line allows us to dictate the order in which the panels will appear
    mutate(name = factor(name, levels = c("prior", "likelihood", "posterior"))) %>% 
    ggplot(aes(x = p_grid, y = value, fill = name)) +
    geom_area(show.legend = FALSE) +
    scale_fill_manual(values = c("grey", "red", "blue")) +
    facet_wrap(~ name, scales = "free")+
    theme_bw()

Approximate posterior distribution obtained using Bayes’ rule with UNIF(0,1) prior. In this example, the normalizaing term P(y=0) is not considered.

Posterior estimation with brms package in R

we will introduce MCMC in later sessions.

Code

dat1<-data.frame(y=rep(0,20))
# fit1 <- brm(y ~ 0+Intercept,
#             data = dat1,
#             family = brmsfamily(family="bernoulli", link="identity"),
#             init = 0,
#             prior = c(prior(uniform(0, 1), coef="Intercept")),
#             iter = 1000,
#             seed = 123,
#             sample_prior = TRUE)
# saveRDS(fit1, file = "brms_lecture2_binomail.rda")

fit1 <- readRDS("brms_lecture2_binomail.rda")
posterior_summary(fit1, variable = c("b_Intercept"), probs = c(0.025, 0.975) )

              Estimate  Est.Error         Q2.5     Q97.5
b_Intercept 0.04159862 0.03913757 0.0008174098 0.1442491

Code

hypothesis(fit1, 'Intercept < 0.2')

Hypothesis Tests for class b:
             Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio
1 (Intercept)-(0.2) < 0    -0.16      0.04     -0.2    -0.07       1999
  Post.Prob Star
1         1    *
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

Code

# plotting prior and posterior in brms;
hyp1 <- hypothesis(fit1, 'Intercept = 0')
data.frame(prior = hyp1$prior_samples$H1, posterior = hyp1$samples$H1) %>%
    gather(type, value) %>%
    ggplot(aes(x = value) ) +
    geom_histogram(bins = 50, alpha = 0.8) +
    facet_wrap(~type, scales = "free") +
    xlab(expression(theta) ) +
    theme_bw(base_size = 20)

Bayesian estimation of the AE example using brms package in R

2. The Beta-binomial model

With a single sample of \(n\) binary outcomes, we have one unknown parameter: \(\theta\) and \(0 \leq \theta \leq 1\).
We need a prior distribution for \(\theta\) (e.g., proportion, risk, probability of event etc): \(P(\theta)\)
- To express indifference between all values of \(\theta\), we can use a uniform distribution on \(\theta\), as we did in the previous example
- To express belief (e.g. based on external evidence) that some values of \(\theta\) are more likely that others, it is convenient to use a beta distribution
- This has two parameters, often labelled as \(\alpha\) (also written as a) and \(\beta\) (also written as b), which we can choose to represent the strength of the external evidence
- If a parameter has a Beta(a,b) distribution, then the prior mean is \[\frac{a}{a+b}\]
- The beta distribution prior for the binomial is useful for illustrating how the Bayesian approach combines prior information and new data

Beta-binomial model

Recall the likelihood for a binomial outcome of x successes in n trials,

\[ P(x \mid \theta) \propto \theta^x (1-\theta)^{n-x}\]

The \(Beta(\alpha,\beta)\) prior has the same functional form for \(\theta\),

\[ P_0( \theta) \propto \theta^{\alpha-1} (1-\theta)^{\beta-1}\]

We find the posterior as

\[\begin{align} P(\theta \mid x) & \propto P(x \mid \theta) \times P_0( \theta) \\ & \propto \theta^x (1-\theta)^{n-x} \times \theta^{\alpha-1} (1-\theta)^{\beta-1} \\ & \propto \theta^{x+\alpha-1} (1-\theta)^{n-x+\beta-1} \end{align}\]

Thus, comparing to the \(Beta(\alpha,\beta)\) prior, the posterior just changes the exponents by adding x and n-x, respectively.
- Comparing to the prior, make two changes to get the posterior:
  1. \(a \rightarrow a+x\), [a + number of events]
  2. \(b \rightarrow b+(n-x)\), [b + number of non-events]
- Quite simply, when \(x\) events have been observed in n subjects, the prior
\[ \theta \sim Beta(\alpha, \beta) \]
- gives the posterior
\[ \theta \mid x \sim Beta(\alpha+x, \beta+n-x) \]
The prior and posterior are both beta distributions!

Interpretation of Beta Prior

Suppose we start with a beta prior with small parameters

\[ \theta \sim Beta(0.001, 0.001) \]

Observe x events in n trials, the posterior

\[ \theta \mid x \sim Beta(0.001+x, 0.001+n-x) \approx Beta(x,n-x)\]

Posterior mean of \(\theta \approx \frac{x}{n}\), the equivalent to the MLE based only on the data
Interpretation of the \(Beta(\alpha,\beta)\) prior.
- Like having seen \(\alpha\) events and \(\beta\) non events in a sample size of \(\alpha + \beta\)
- Strength of prior information equivalent to prior “sample size” \(\alpha + \beta\)
- Prior mean = \(\frac{\alpha}{\alpha + \beta}\)

Working example of Beta-binomail model

Consider Beta(3,7)and Beta(12,28) priors

Gold line: prior belief that assumes approximately 3 events in 10 subjects
Blue line: prior belief that assumes approximately 12 events in 40(=12+28) subjects

Summarizing Posterior Distribution

Since we know the form of the posterior distribution, we can easily calculate functions such as:
1. Posterior mean \(E(\theta) = \frac{\alpha+x}{\alpha + \beta+n}\)
2. 95% Credible intervals (we will talk more about them next week)
3. \(P(\theta < 0.2)\), \(P(\theta < 0.5)\), \(P(0.4< \theta < 0.6)\), etc, which can be directly used to make probabilistic statement about the \(\theta\). E.g., the probability of the posterior adverse event rate < 0.2 is about 0.95.
Generate informative plots for assessing priors and posteriors. All this can easily be done using R.

Data overwhelming the prior

The posterior for the beta-binomial model after seeing x events in n trials is \(\theta \mid x ~ Beta(\alpha + x, \beta + n - x)\) with posterior mean as

\[E(\theta \mid x) = \frac{\alpha + x}{\alpha + \beta + n}\]

In \(n \gg \alpha + \beta\) and \(x \gg \alpha\) (sample size and number of events is large), recall when using prior \(Beta(0.001, 0.001)\), \[ E(\theta \mid x) = \frac{\alpha + x}{\alpha + \beta + n} \approx \frac{x}{n}\]
Here, prior is of little importance!

A Beta(1,1) is Unif(0,1)

When \(\alpha = \beta = 1\), \(P(\theta) \propto \theta^{1-1}(1-\theta)^{1-1} = 1\), which is the probability density of a unif(0,1).

Posterior mean as a weighted average

The posterior mean is

\[\begin{align} E(\theta \mid x) & = \frac{\alpha + x}{\alpha + \beta + n} \\ & = \frac{\alpha }{\alpha + \beta + n} + \frac{x }{\alpha + \beta + n}\\ & = \Big( \frac{\alpha+ \beta }{\alpha + \beta + n} \Big) \times \Big( \frac{\alpha }{\alpha + \beta} \Big) + \Big( \frac{n }{\alpha + \beta + n} \Big) \times \Big( \frac{x }{n} \Big) \\ & = w \times \Big( \frac{\alpha }{\alpha + \beta} \Big) + (1-w) \times \Big( \frac{x }{n} \Big) \\ & = w \times (\text{prior mean}) + (1-w) \times (\text{sample estimate}) \end{align}\]

Where \(w\) is the ratio of the “prior sample size” to the “total sample size” \[ w = \frac{\alpha+ \beta }{\alpha + \beta + n}.\]
This is a common theme in Bayesian models with actual prior information
The posterior distribution will lie between the prior and likelihood
The posterior mean is a weighted average of the prior mean and data-based estimate
As the sample size increases, the contribution of the prior diminishes

3. The Normal-normal model

Parameters with normal priors and likelihoods

Continuing with a single parameter model
The likelihood of the parameter given some data is that of a normal distribution with known variance \(\sigma^2\)

\[ y \mid \theta \sim N(\theta, \sigma^2) \]

suppose the prior for \(\theta\), the mean, follows a normal distribution

\[ \theta \sim N(\mu_0, V_0)\]

We want to make inference about \(\theta\) given the data y, what is the posterior distribution \(P(\theta \mid y)\)?

Normal-normal model

1. The normal model (likelihood) of one data point

Consider a single observation \(y\) from a normal distribution parameterized by a mean \(\theta\) and variance \(\sigma^2\), we assume that \(\sigma^2\) is known.
The normal likelihood for \(y\) given \(\theta\) is \[ P(y\mid \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(y-\theta)^2}{2\sigma^2}}, y \in (-\infty, \infty)\]
The \(E(y) = \theta\) and \(V(y) = \sigma^2\) and \(SD(y) = \sigma\).
Given \(y \mid \theta \sim N(\theta,\sigma^2 )\), \(z = \frac{y-\theta}{\sigma} \sim N(0,1)\). Thus, if we rescale \(y\) by \(\sigma\), we have
- 95%CI of y: \(\theta \pm 1.96 \times \sigma\)

2. Prior

We can often use a normal prior to represent \(\theta \sim N(\mu_0, V_0)\)

\(\mu_0\) is the prior mean
\(V_0\) is the prior variance, which expresses the uncertainty around the mean \(\mu_0\)
\(\tau_0 = \frac{1}{V_0}\) is called precision, it can also be used to express the uncertainty around the mean, e.g. \(\theta \sim N(\mu_0, \tau_0)\)
High precision = low variance, in other words, the distribution of the normal random variable \(\theta\) is closely centered around its mean.

3. Posterior

With a normal prior and a normal likelihood of the data given parameter \(\theta\), (here we use precision to express the normal distributions)

\[ \theta \sim N(\mu_0,\tau_0 ) \text{ and } y \mid \theta \sim N(\theta,\tau_y )\]

where \(\tau_y = \frac{1}{\sigma^2}\) is the precision of the observed data
It can be proven mathematically using the Bayes’ rule that the posterior distribution given a normal prior and normal likelihood is also normal,

\[ \theta \mid y \sim N(\mu_1, \tau_1)\]

The posterior mean \(\mu_1\) is

\[ \frac{\tau_0 \times \mu_0 + \tau_y \times y}{\tau_0+ \tau_y} \]

The posterior precision is

\[ \tau_1 = \tau_0 + \tau_y\]

Again, the prior and posterior are both normal distributions!

Posterior mean as a weighted average of the prior mean and data-based estimate

Using the precision parameters

\[ \tau_y = \frac{1}{\sigma^2_y} \text{ and } \tau_0 = \frac{1}{V_0}\]

The posterior distribution of \(\theta\) is then

\[\theta \mid y \sim N(\mu_1, \tau_1)\]

where the mean and precision are

\[\begin{align} & \mu_1 = \Big(\frac{\tau_0}{\tau_0 + \tau_y}\Big) \times \mu_0 + \Big(\frac{\tau_y}{\tau_0 + \tau_y}\Big) \times y \\ & \tau_1 = \tau_0 + \tau_y \end{align}\]

The posterior mean is a weighted sum of the prior mean and the observed value.
The weights are the relative precisions of the prior and the likelihood \[\mu_1 = w_0 \times \mu_0 + w_y \times y\] \[w_0 = \frac{\tau_0}{\tau_0 + \tau_y}\] and

\[w_y = 1-w_0 = \frac{\tau_y}{\tau_0 + \tau_y}\]

Data overwhelming the prior and vice versa

With relatively low prior precision (imprecise prior and substantial data), i.e., \(\tau_0 \ll \tau_y\) and \(0 \approx w_0 \ll w_y \approx 1\),

\[ \mu_1 = w_0 \times \mu_0 + w_y \times y \approx y\]

With relative little data and aprior that is not too diffuse, i.e., \(\tau_y \ll \tau_0\) and \(0 \approx w_y \ll w_0 \approx 1\),

\[ \mu_1 = w_0 \times \mu_0 + w_y \times y \approx \mu_0\]

In most cases, the posterior mean is a compromise between what we believed before and what we observed in the data.

Normal model with multiple observations

Consider a set of \(n\) observations \(y = (y_1, \ldots, y_n)\) sample from a \(N(\theta, \sigma^2)\), where \(y_1,\ldots, y_n\) are identically, independently distributed and \(y_i \sim N(\theta, \sigma^2)\), \(i = 1, \ldots, n\).
The normal likelihood of the joint distribution \(y = (y_1, \ldots, y_n)\) is

\[\begin{align} P(y_1, \ldots, y_n \mid \theta) &= \prod_{i=1}^n P(y_i \mid \theta) \\ & = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(y_i-\theta)^2}{2\sigma^2}}, y \in (-\infty, \infty) \end{align}\]

Given prior \(\theta \sim N(\mu_0, \tau_0)\), the posterior of \(\theta \mid y\) is

\[P(\theta \mid y_1, \ldots, y_n) = N(\mu_n, \tau_n)\]

where the posterior mean \(\mu_n\) is

\[\mu_n = \frac{\tau_0 \times \mu_0 + n\tau_y \times \bar{y}}{\tau_0+ n\tau_y} = \Big( \frac{\tau_0}{\tau_0+ n\tau_y} \Big) \times \mu_0 + \Big( \frac{n\tau_y}{\tau_0+ n\tau_y} \Big) \times \bar{y}\]

and the posterior precision is

\[ \tau_n = \tau_0 + n\tau_y\]

\(\bar{y} = \frac{\sum_{i=1}^2y_i}{n}\) is the sample mean and \(\tau_y = \frac{1}{\sigma^2}\)
Again, the posterior mean is a weighted sum between prior and data. As sample size \(n \rightarrow \infty\), \(\Big( \frac{\tau_0}{\tau_0+ n\tau_y} \Big) \rightarrow 0\) and \(\Big( \frac{n\tau_y}{\tau_0+ n\tau_y} \Big) \rightarrow 1\), data will overwhelm the prior and dominate the posterior distribution.

The use of normal-normal model in clinical applications

Consider estimating a parameter \(\theta\), e.g. mean, log-hazard ratio, log-odds-ratio, log-relative-risk
Suppose we have an estimate of \(\theta\); We will call it \(y\) and let \(\hat{\sigma}^2_y\) be the estimated variance
This estimate \(y\) can be viewed as a single data observation of \(\theta\)!
In large samples, it is approximately true that \(y \sim N(\theta, \sigma^2_y)\), where \(\theta\) is the true value and \(\sigma^2_y\) is the true variance (sample size dependent) and the 95%CI of \(y\) is \(\theta \pm 1.96 \times \sigma\).
We can construct a normal likelihood of data for the parameter \(\theta\) as \(y \mid N(\theta, \hat{\sigma}^2_y)\)

Working exmaple - Treatment effect estimate on the mean

For this example, we will simulate data for a small study of n=100 subjects and randomized 50% to treat=0 and 50% to treat=1 (mimicking an RCT). We will then generate the outcome Y from a normal distribution depending on the treatment indicator.
Suppose we have the following table of results from a linear regression estimating treatment effect \(\theta\) on the outcome

Code

n <- 100
set.seed(1234)
treat <- 1*(runif(n, min = 0, max = 1)<0.5)
y <- 5*treat+rnorm(n, mean = 0, sd =10)
fit<-lm(y~treat)
summary(fit)


Call:
lm(formula = y ~ treat)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.8662  -6.7095  -0.5731   5.7503  24.1866 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    1.303      1.423   0.916   0.3618  
treat          4.080      1.918   2.127   0.0359 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.543 on 98 degrees of freedom
Multiple R-squared:  0.04414,   Adjusted R-squared:  0.03438 
F-statistic: 4.525 on 1 and 98 DF,  p-value: 0.03591

Code

round(confint.default(fit, 'treat', level=0.95),3) # based on asymptotic normality

      2.5 % 97.5 %
treat 0.321   7.84

We are interested in using this published results to make Bayesian inference on \(\theta\)
Our observed data (estimate) of \(\theta\), denoted as \(y\) is 4.08 with standard error 1.918. Here, \(y\) plays the role of data! We call \(y\) the observed datum.
Relying on the large sample approximation, we assume that the estimate y arose from a normal distribution with true mean \(\theta\), We treat the observed standard error \(\hat{\sigma}_y = 1.918\) as the true standard deviation of \(y\)
- Recall, the standard deviation of the sampling distribution of a parameter estimate is called its standard error
This gives us the likelihood \[y \mid \theta \sim N(\theta, 1.918^2)\]
The width of the 95% CI is approximately equal to \(2 \times 1.96 \times \sigma_y\)
Suppose we are not provided with \(\sigma_y\) from the table of results, but know the 95% CI of y is (0.321,7.84), we can calculate \(\sigma_y\) with

\[\sigma_y = \frac{7.84- 0.321}{2\times 1.96}=1.918\]

The normal likelihood is this example is \(y \sim N(4.08, 1.918^2)\), if we assume a prior \(\theta \sim N(0, 2 \times 1.918^2)\), we have a posterior

\[\begin{align} & \theta \mid y \sim N(\mu_1, \tau_1) \\ \mu_1 &= \frac{ \frac{1}{prior.sd^2} }{\frac{1}{prior.sd^2} + \frac{1}{se.obs^2} } \times prior.mean + \frac{\frac{1}{se.obs^2}}{\frac{1}{prior.sd^2} + \frac{1}{se.obs^2}} \times y.obs \\ & = \frac{ \frac{1}{2 \times 1.918^2} }{\frac{1}{2 \times 1.918^2} + \frac{1}{1.918^2} } \times 0 + \frac{\frac{1}{1.918^2}}{\frac{1}{2 \times 1.918^2} + \frac{1}{1.918^2}} \times 4.08 = 2.72 \\ \tau_1 & = \tau_0 + \tau_y \\ & = \frac{1}{prior.sd^2} + \frac{1}{se.obs^2} \\ & = \frac{1}{2 \times 1.918^2} + \frac{1}{1.918^2} = 0.408 \end{align}\]

Code

d <- tibble(theta = seq(from = -10, to = 15, length.out = 200),
         y.obs=4.08,
         se.obs=1.918,
         prior.mean=0,
         prior.sd=sqrt(2)*1.918) %>% 
  mutate(priorDensity = dnorm(theta, prior.mean, prior.sd)) %>%
  mutate(likelihood.0 = dnorm(theta, y.obs, se.obs)) %>%
  mutate(w = (1/se.obs^2)/(1/se.obs^2 + 1/prior.sd^2)) %>%
  mutate(post.prec = 1/se.obs^2 + 1/prior.sd^2) %>%
  mutate(posteriorDensity = dnorm(theta, w*y.obs+(1-w)*prior.mean, sqrt(1/post.prec))) %>%
  mutate(likelihood = likelihood.0/sum(diff(theta)*likelihood.0[-1])) %>%
  pivot_longer(cols=c(priorDensity,likelihood,posteriorDensity),names_to="Function",values_to="p")
  
d$Function <- factor(d$Function, 
                     levels=c("priorDensity","likelihood","posteriorDensity"),
                     labels=c("Prior","Likelihood","Posterior"))

p<-ggplot(d, aes(theta, p,colour=Function))+
  geom_line(size=1)+
  xlab(expression(theta))+
  ylab(expression(p(theta)))+
  theme(legend.position = "bottom")+
  scale_colour_manual(values=c("goldenrod","steelblue","black")) +theme_bw()

p

Working example - Bayesian reanalysis of RCT - Odds Ratio

(frequentist) Effect of a Resuscitation Strategy Targeting Peripheral Perfusion Status vs Serum Lactate Levels on 28-Day Mortality Among Patients With Septic Shock The ANDROMEDA-SHOCK Randomized Clinical Trial (Hernández et al. 2019)
Open-access link: https://jamanetwork.com/journals/jama/fullarticle/2724361
(Bayesian reanalysis) Effects of a Resuscitation Strategy Targeting Peripheral Perfusion Status versus Serum Lactate Levels among Patients with Septic Shock. A Bayesian Reanalysis of the ANDROMEDA-SHOCK Trial (Zampieri et al. 2020)
Open-access link: https://www.atsjournals.org/doi/full/10.1164/rccm.201905-0968OC
We are interested to make inference on the odds ratio of peripheral perfusion versus Lactate level on 28-Day mortality among patients with septic shock.
We obtain a point estimate of OR and it’s 95%CI from (Hernández et al. 2019) give the following 2x2 table.

\[\hat{OR} = \frac{74/138}{92/120} = 0.699\] \[SE\{ln(\hat{OR})\} = \sqrt{\frac{1}{74}+\frac{1}{138}+\frac{1}{92}+\frac{1}{120}} = 0.2\]

\[95\% CI = exp(log(\hat{OR} - 1.96\times SE\{ln(\hat{OR})\}) \text{ to } exp(log(\hat{OR} + 1.96\times SE\{ln(\hat{OR})\}) = (0.473, 1.035)\]

Intervention	Death within 28d	Alive by 28d	Total
Peripheral Perfusion-Targeted Resuscitation	74	138	212
Lactate Level-Targeted Resuscitation	92	120	212
Total	166	258	424

We treat this as a data evidence!

1. The normal likelihood on log(OR)

The estimated OR on 28-day mortality, 0.699 (0.473 to 1.035), is called the observed datum (viewed as a single data observation on \(\theta\), denoted as \(y\))
It’s known that log of the parameter estimate, such as HR, RR and OR, follow a normal distribution around its true log value.
Let \(\theta\) denote the random variable representing log(OR)
Let \(y\) denote an observed data point of \(\theta\), we have \(y \sim N(\theta, \sigma_y^2)\), where \(\theta\) is log(OR) and \(\sigma_y\) is the standard error of the log(OR). \(\sigma_y=0.2\) from the result table.
If we don’t have \(\sigma_y^2\) but only the reported 95% CI of the log of OR, we can take the width of the 95% CI and divided by \(2\times 1.96\) to obtain \(\sigma_y^2\). For example, we have

\[\hat{\sigma}_y = \frac{log(1.035) - log(0.473)}{2\times 1.96} = 0.2\]

Given the requentist analysis results, we have a data likelihood (evidence) \(y \mid \theta \sim N(\theta,0.2^2)\) where we have observed \(y=log(0.699)=-0.358\)

2. The prior for log(OR)

From (Zampieri et al. 2020), “a neutral prior in which the log[OR] of mean of 0 and an SD of 0.5, corresponding to an OR of 1 (95% credible interval, 0.37-2.75)”

Prior OR beliefs on the effectiveness of peripheral perfusion on 28 days mortality from the Bayesian reanalysis paper

Suppose we consider a neutral prior for log(OR) from the reanalysis paper (Zampieri et al. 2020)

The neutral prior is \(log(OR) \sim N(log(1), 0.5^2)\)

3. The posterior of log(OR)

Given prior \(\theta \sim N(0, 0.5^2)\) and likelihood \(y \mid \theta \sim (\theta,0.2^2)\) where \(y=log(0.699)=-0.358\)
Prior precision is \(\frac{1}{0.5^2} = 4\) and the likelihood precision is \(\frac{1}{0.2^2} = 25\), we can see that the likelihood has more information than the prior!
The posterior precision on log(OR) is \(\tau_1 = 4 + 25=29\)
The posterior mean is

\[ \frac{4}{29} \times log(1) + \frac{25}{29} \times log(0.699)=-0.309\]

The posterior variance of \(\log(OR) = \frac{1}{29}\) and the posterior standard deviation is \(\sqrt{\frac{1}{29}} = 0.186\), thus

\[\theta \mid y \sim N(\mu_1 = -0.309, \sigma_1^2 = 0.186^2)\]

The posterior mean OR = exp(-0.309) = 0.734 and the posterior 95% Credible Interval for OR is \[\exp(-0.309 \pm 1.96\times 0.186) \rightarrow (0.51, 1.057) \]

Code

#prior on log(OR);
mu0<- 0
sd0<- 0.5
prec0 <- 1/sd0^2

#likelihood on log(OR);
ybar<- -0.358
sd.ybar<- 0.2 
prec.y <- 1/sd.ybar^2

#posterior on log(OR)
prec1 <- prec0 + prec.y
mu1 <- (prec0 * mu0 + prec.y*ybar)/prec1
sigma1 <- 1/sqrt(prec1)


mu.plot <- seq(ybar-1.5,ybar+2,
               length.out=200)

lik <- dnorm(ybar, mu.plot, sd.ybar)
prior <- dnorm(mu.plot, mu0, sd0)
posterior<- dnorm(mu.plot, mu1,sigma1)

d<-data.frame(mu.plot=rep(mu.plot, 3),
              p =c(lik,prior,posterior),
              Function=rep(c("likelihood","prior","posterior"),each=length(mu.plot)))

p1<- ggplot(d, aes(mu.plot, p,colour=Function))+
    geom_line(size = 1)+
     xlab("log(OR)")+
     ylab("Probability Density")+
    scale_colour_manual(values=c("goldenrod","steelblue","black"))+
     theme_bw()+
     ggtitle("Neural prior on log(OR): plot on log(OR) scale")



p2<- ggplot(d, aes(exp(mu.plot), p,colour=Function))+
    geom_line(size = 1)+
     xlab("OR")+
     ylab("Probability Density")+
    scale_colour_manual(values=c("goldenrod","steelblue","black"))+
     theme_bw()+
     ggtitle("Neural prior on log(OR): plot on OR scale")

ggarrange(p1,p2, nrow=2)

Posterior summary statistics from R

Code

thresholds<-c(1,0.8)
res <-exp( qnorm(c(0.5, 0.025, 0.975), mu1, sigma1))
  res <- c(res, pnorm(log(thresholds), mu1, sigma1))
  names(res) <- c("median","q2.5","q97.5",paste0("Pr(OR < ",thresholds,")"))
  res

      median         q2.5        q97.5   Pr(OR < 1) Pr(OR < 0.8) 
   0.7344593    0.5103917    1.0568951    0.9517409    0.6773526

Summary of Conjugate priors & models

There are a small number of prior-likelihood pairs where the prior and the posterior are in the same family (e.g., both beta, both normal): these are called conjugate models
These posterior distributions can be computed without specialized software
These examples are useful for illustrating how Bayesian methods combine prior information with data
They have only limited practical usefulness - limits on types priors, limits on number of parameters
They are useful teaching tools, soon we will see how we can extend beyond these simple models

4. Why Bayesian: Statistical Estimation Revisit

4.1 Classic Frequentist Inference

Many common intuitive data-based estimators are maximum likelihood estimators
- sample mean, \(\bar{x}\)
- sample proportion, \(\hat{p}\)
- rate of an event, \(\hat{\lambda} = \frac{\text{Number Events}}{\text{Follow-up Time}}\)
- least square estimators from linear regression model
- OR & RR
- HR, cox regression uses partial-likelihood approach
Likelihood-based inference does two things
- Finds the MLE, the parameter value that is most likely, given the observed data.
- Finds how much curvature there is near this maximum. This leads to estimates of standard errors and CIs.
The 95% CI refers to the frequency with which hypothetical future random intervals from repeated sampling constructed in the same way will include the true value
- A particular interval will either include the value or not.
- It does not mean that the interval contains the true value with 0.95 probability, i.e. we do not have \(P(\theta \mid data)\)

Hypothesis Testing in a nutshell

Hypothesis testing is a procedure for determining whether a hypothesis is valid or not, based on sample evidence

Set up two hypotheses: Null (\(H_0\)) vs Alternative (\(H_A\))
- \(H_0\), Specifies population parameter of interest & proposes baseline value for it. Null presumed true unless disproved
- \(H_A\), Contains value of parameter different from H0; typically what you want to show
Specify appropriate model (find test statistic)
- Z test, T test, chi-square test etc
- under the assumption that the model is correct
Mechanics (calculate p-value)
- We want to compare data to what we would expect to see if \(H_0\) was true and ask how likely it is to get these data, if \(H_0\) was true
- P-value is used to quantify evidence against \(H_0\); The probability of getting equally or more extreme data (i.e. test statistic) under repeated sampling if \(H_0\) is true
\[ P_{-value} = P\Bigg( \begin{matrix} \text{observed or more} \\ \text{extreme test statistics} \end{matrix} \mid H_0 \ is \ true \Bigg) \]
- P-value tells how “surprising” are data you observed under \(H_0\)
Arrive at conclusion: Reject \(H_0\) or not

Decisions about rejecting H0 or not are usually done by computing a p-value and comparing to \(\alpha\), significance level
significance level, an arbitrary threshold for P-value to define “rare/extreme event”. If P-value falls below this level, we conclude results are statistically significance and we reject the null

Hypothesis Tests Errors

Type I error (\(\alpha\)): \(H_0\) is true, but we reject it \(\alpha = P(Reject \ H_0 | H_0 \ is \ true)\)
- We commonly control for Type I error - “innocent until proven guilty”
Type II error (\(\beta\)): \(H_0\) is false, but we fail reject it \(\beta = P(Fail \ to \ Reject \ H_0 | H_0 \ is \ false)\)
- A test’s ability to detect a false null hypothesis is called the power of the test, \(1-\beta\)
For any test, there is trade-off between \(\alpha\) and \(\beta\) (e.g., for fixed n, decreasing\(\alpha\) increases \(\beta\), vice-versa )

More about p-value (Wasserstein and Lazar 2016)

By its definition, P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
- “A conclusion does not immediately become true on one side of the divide and false on the other.”
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
“By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.”
- “For example, a p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis.”
- absence of evidence is not evidence of absence
Inference is performed based on data that have not been observed, under an assumption that is almost certain not to be true
- Would you be willing to bet that the true probability a coin comes up heads is exactly P(head)=0.5, or that the exact true adverse rate is fixed at 0.2?

4.2 Bayesian Inference

In Bayesian statistics, inference is done using the posterior distribution
Given a posterior distribution, we can calculate any summary of it we like
- \(P(0.008 < \theta < 0.012)\), \(P( \theta < 0.1)\), interval with 95% probability of containing \(\theta\)
- We can use quantities to summarize the posterior distribution (mean, median, sd, IQR)
- We can plot the density function and shade areas of interest

Posterior uncertainty

Posterior variance is used to characterize the posterior uncertainty
- In most cases, it is smaller than the prior variance
- In some cases, for some models, the opposite is true, when prior is in direct conflict with the data
Credible intervals play the role of the confidence intervals
- For the 95% credible interval \([\alpha, \beta]\), \(P(\alpha< \theta <\beta \mid y)\) = 0.95

Posterior predictive distributions

Often we are interested in extending our (Bayesian) analysis using priors and data, and predicting outcomes in data observed in the future
- e.g., predicting number of successes in a future trial for another 25 new observations
Essentially we want the probability density \(P(y^* \mid y)\), where with \(y^*\) we denote the new data
- This probability is not always possible to compute directly, it does not correspond to some known distribution
- We will see that these kinds of questions can be answered using Markov Chain Monte Carlo method and brms (coming up in a couple of weeks)

Posterior predictive distribution

Let \(y^*\) denote a new data, the posterior predictive distribution for \(y^*\) is

\[P(y^* \mid y) = \int P(y^* \mid \theta) P(\theta \mid y) \ d \theta\]

The posterior predictive distribution as an average of conditional predictions over the posterior distribution of \(\theta\)
We assume conditional independence of \(y^*\) and \(y\) given \(\theta\)

In the beta-binomial adverse event example, let \(y^*\) denote the result of a new trial of 6 patients given posterior \(\theta \mid y \sim Beta(1,7)\),

\[P(y^* \mid y) = \int_0^1 {6 \choose y^*} \theta^{y^*} (1- \theta)^{6-y^*} P(\theta \mid y) \ d \theta, y^* \in \{0,1,2,3,4,5,6\}\]

\[P(\theta \mid y) = \frac{\Gamma(8)}{\Gamma(1)\Gamma(7)} \theta^{0}(1-theta)^{6}\]

Code

likelihood<-length(7)
for (i in 1:7) {

integrand_pred <- function(theta) {
  (choose(6, i-1)*theta^(i-1)*(1-theta)^(6-i+1))*((gamma(8)/gamma(1)/gamma(7))*(1-theta)^6)}

likelihood[i] <- integrate(integrand_pred,lower = 0, upper = 1)$value   
    }
  
posterior_pred<-data.frame(y_star=0:6, likelihood = likelihood)
knitr::kable(posterior_pred, caption = "Posterior predictive distribution for a new trial of 6 patients")

Posterior predictive distribution for a new trial of 6 patients
y_star	likelihood
0	0.5384615
1	0.2692308
2	0.1223776
3	0.0489510
4	0.0163170
5	0.0040793
6	0.0005828

Code

sum(posterior_pred$likelihood)

[1] 1

Why Bayesian? Practical Reasons

Apart from making more sense to use then P-value, the modern Bayesian approach offers some practical advantages to data analysis
- Prior information
- Direct probability statements
- Subjectivity acknowledged and formally represented

Prior information

Science rarely proceeds in the absence of any prior information or opinion
Bayesian reasoning allows the formal incorporation of this opinion
The classical approach introduces prior information and opinion in study planning and, informally, at the discussion stage of results - high risk of inconsistency
In Bayesian approach:
- For small sample sizes, a prior will “rein in” imprecise data-based estimates
- For large sample sizes, most priors are overwhelmed by the likelihood (data)

Probability Statements

A result of a Bayesian analysis is a posterior distribution, \(P(\theta \mid data)\), for an unknown parameter (e.g. an odds ratio \(\theta\)) given the data
We can compute probabilities directly from the posterior distribution

\[P(\theta < 1 \mid data) = \int_0^1 P(\theta \mid data) d\theta\]

this is the area under the posterior density \(P(\theta \mid data)\) between the value of 0 and 1 for \(\theta\)
this is what we prefer p-value to be!
we can define \(\theta < 0.9\) as a clinically important effect, then we calculate the probability of \(P(\theta < 0.9 \mid data )\). No need to choose a “significance” level, we can directly use this probably to make clinical decisions!

Summary

Practical advantages of Bayesian approach

Direct probability statements
Incorporation of external evidence
Complex models can be built step-by-step
Makes you understand the analysis better

Disadvantages including

Learning how to code
Need to know more about distributions
Require good probability and statistical model knowledge
More time consuming (computationally more expensive with complex model and multiple parameters)

References

Hernández, Glenn, Gustavo A Ospina-Tascón, Lucas Petri Damiani, Elisa Estenssoro, Arnaldo Dubin, Javier Hurtado, Gilberto Friedman, et al. 2019. “Effect of a Resuscitation Strategy Targeting Peripheral Perfusion Status Vs Serum Lactate Levels on 28-Day Mortality Among Patients with Septic Shock: The ANDROMEDA-SHOCK Randomized Clinical Trial.” Jama 321 (7): 654–64.

Wasserstein, Ronald L, and Nicole A Lazar. 2016. “The ASA Statement on p-Values: Context, Process, and Purpose.” The American Statistician. Taylor & Francis.

Zampieri, Fernando G, Lucas P Damiani, Jan Bakker, Gustavo A Ospina-Tascón, Ricardo Castro, Alexandre B Cavalcanti, and Glenn Hernandez. 2020. “Effects of a Resuscitation Strategy Targeting Peripheral Perfusion Status Versus Serum Lactate Levels Among Patients with Septic Shock. A Bayesian Reanalysis of the ANDROMEDA-SHOCK Trial.” American Journal of Respiratory and Critical Care Medicine 201 (4): 423–29.