Learn two simple Bayesian models (Beta-binomial & normal-normal)
Discuss practical advantages and disadvantages of Bayesian approach
1. Introduction to Bayesian approach
The Bayesian approach to estimating parameters stems from Bayes’ theorem:
Let \(\theta\) be the parameter of interest and \(y\) be the observed data,
\[\begin{aligned}
P(\theta \mid y) & = \frac{P(y \mid \theta)P(\theta)}{P(y)} \\
& = \frac{\text{likelihood of data given parameter} \times \text{prior}}{\text{marginal distribution of data free of the parameter}} \\
& \propto \text{likelihood}(y \mid \theta ) \times \text{prior}(\theta)
\end{aligned}\]
\(P(y)\) is called a normalizing factor, it’s in place to ensure that \(\int P(\theta \mid y) d\theta = 1\), that is the posterior distribution of \(\theta\) is a proper probability distribution with area under the density curve equals to 1.
Its value is not of interest, unless we are comparing between data models.
The essence of Bayes theorem only concerns the terms involving the parameter, \(\theta\), hence \(P(\theta \mid y) \propto P(y\mid \theta)P(\theta)\).
Likelihood vs Probability distribution
Probability distribution: A function of the data given fixed parameters.
A probability distribution \(f_y(y\mid \theta)\) sum or integrate to 1 overall all possible values of the random variable Y. \(\theta\) is fixed.
It answers the question: “If the parameter were \(\theta\), how likely (probable) is each possible outcome y?”, “What data could I see if \(\theta\) is \(\theta^*\)”
Likelihood: The same mathematical formula, but viewed as a function of the parameters given fixed observed data.
The likelihood function \(L(\theta \mid y_{obs})\) reuse the same math expression \(f_y(y\mid \theta)\), s.t., \(L(\theta \mid y_{obs}) = f_y(y\mid \theta)\).
It answers the questions: Given the observed data \(y_{obs}\), how plausible is each parameter value \(\theta\)?“,”Which \(\theta\) best explains the observed data?”
Likelihood does not require sum or integrate to 1.\(\theta\) can vary but \(y_{obs}\), data is fixed.
1.2 Working Example - Estimating a proportion
Frequenist approach
Suppose you have observed 20 patients in a Phase I RCT on a given dose of drug,
0 out of 20 patients have had an adverse event (\(=0\%\))
decision to escalate to a higher dose if it’s unlikely that the current dosing results in a true proportion of adverse events above 20%
This is a classic phase I estimate, let \(\theta\) be the true risk of adverse event (parameter of interest).
Under frequentist test (One-sided Exact Binomial Test), we are interested to test \(H_0: \theta = 0.2\) vs \(H_a: \theta \leq 0.2\) (testing whether the risk of AE is trending below 0.2).
If the \(H_0\) is true, we would expect to see \(0.2\times20=4\) patients experiencing AE. We find our \(p-value\) for this test by considering the probability of seeing an outcome as, or more extreme than the observed \(0\) AE among the 20 patient samples.
If we have a significance level of \(5\%\), the p-value indicates that we have evidence that is significant enough to reject the null. Given this test results, we will escalate the dose.
Code
binom.test(x=0, n=20, p =0.2,alternative ="less")
Exact binomial test
data: 0 and 20
number of successes = 0, number of trials = 20, p-value = 0.01153
alternative hypothesis: true probability of success is less than 0.2
95 percent confidence interval:
0.0000000 0.1391083
sample estimates:
probability of success
0
Code
#calculate P-value using binomial likelihood;choose(20,0)*0.8^20
[1] 0.01152922
Code
#calculate P-value using binomial probability density function;pbinom(q =0, size =20, p =0.2, lower.tail =TRUE)
[1] 0.01152922
The observed proportion \(\hat{\theta}=0\) with 95% CI: 0 - 0.14.
Quiz - Interpreting P-value
Given the p-value = 0.012 from the binomial exact test, Which of the following are true?
The probability that the null hypothesis is true is 1.2%.
The probability that the alternative hypothesis is true is 1-1.2% = 98.8%.
There’s 1.2% chance of making a mistake by rejecting the null hypothesis.
Assuming the risk of AE is 0.2 (\(H_0\) is true), we would obtain sample estimated risk of AE as extreme or more than the observed 0%, in 1.2% of studies in the long run when repeating the trial.
Quiz - Interpreting Confidence Interval
Given the 95% CI of \(\theta\) is (0, 0.14), Which of the following are true?
If we recruit another sample of 20 patients, there is 95% chance the sample proportional of AE would be between 0% and 14%.
About 95% of patients in the sample will have risk of AE between 0 and 14%.
We are 95% confidence that the interval (0,0.14) captured the true risk of AE.
If we repeat this trial many times, then about 95% of the intervals produced will capture the true risk of AE.
What would a Bayesian do?
To make probability statements about \(\theta\) after observing data \(y\), we need a probability distribution for \(\theta\) given \(y\) (the posterior distribution).
1.First, we need to specify a prior distribution for \(\theta\), \(P(\theta)\).
Example 1: We might have no idea about \(\theta\) other than that it lies in the interval [0,1] and thus specify a unif(0,1). Let \(\theta \sim U(0,1)\), the prior probability distribution (p.d.f) is \[ P(\theta) = \frac{1}{1-0} = 1.\]
We assume the \(P(y \mid \theta)\) follows a binomial distribution, thus the likelihood of the observed data given \(\theta\) is \[ P(y = 0 \mid \theta) = {20 \choose 0} \theta^0 (1-\theta)^{20} = (1-\theta)^{20}\]
The posterior then becomes (given example prior 1)
d <-tibble(p_grid =seq(from =0, to =1, length.out =1000),y =0, n =20) %>%mutate(prior =dunif(p_grid, 0, 1),likelihood =dbinom(y, n, p_grid)) %>%mutate(posterior = likelihood * prior )d %>%pivot_longer(prior:posterior) %>%# this line allows us to dictate the order in which the panels will appearmutate(name =factor(name, levels =c("prior", "likelihood", "posterior"))) %>%ggplot(aes(x = p_grid, y = value, fill = name)) +geom_area(show.legend =FALSE) +scale_fill_manual(values =c("grey", "red", "blue")) +facet_wrap(~ name, scales ="free")+theme_bw()
Approximate posterior distribution obtained using Bayes’ rule with UNIF(0,1) prior. In this example, the normalizaing term P(y=0) is not considered.
Hypothesis Tests for class b:
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio
1 (Intercept)-(0.2) < 0 -0.16 0.04 -0.2 -0.07 1999
Post.Prob Star
1 1 *
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.
Bayesian estimation of the AE example using brms package in R
2. The Beta-binomial model
With a single sample of \(n\) binary outcomes, we have one unknown parameter: \(\theta\) and \(0 \leq \theta \leq 1\).
We need a prior distribution for \(\theta\) (e.g., proportion, risk, probability of event etc): \(P(\theta)\)
To express indifference between all values of \(\theta\), we can use a uniform distribution on \(\theta\), as we did in the previous example
To express belief (e.g. based on external evidence) that some values of \(\theta\) are more likely that others, it is convenient to use a beta distribution
This has two parameters, often labelled as \(\alpha\) (also written as a) and \(\beta\) (also written as b), which we can choose to represent the strength of the external evidence
If a parameter has a Beta(a,b) distribution, then the prior mean is \[\frac{a}{a+b}\]
The beta distribution prior for the binomial is useful for illustrating how the Bayesian approach combines prior information and new data
Beta-binomial model
Recall the likelihood for a binomial outcome of x successes in n trials,
Thus, comparing to the \(Beta(\alpha,\beta)\) prior, the posterior just changes the exponents by adding x and n-x, respectively.
Comparing to the prior, make two changes to get the posterior:
\(a \rightarrow a+x\), [a + number of events]
\(b \rightarrow b+(n-x)\), [b + number of non-events]
Quite simply, when \(x\) events have been observed in n subjects, the prior
\[ \theta \sim Beta(\alpha, \beta) \]
gives the posterior
\[ \theta \mid x \sim Beta(\alpha+x, \beta+n-x) \]
The prior and posterior are both beta distributions!
Interpretation of Beta Prior
Suppose we start with a beta prior with small parameters
\[ \theta \sim Beta(0.001, 0.001) \]
Observe x events in n trials, the posterior
\[ \theta \mid x \sim Beta(0.001+x, 0.001+n-x) \approx Beta(x,n-x)\]
Posterior mean of \(\theta \approx \frac{x}{n}\), the equivalent to the MLE based only on the data
Interpretation of the \(Beta(\alpha,\beta)\) prior.
Like having seen \(\alpha\) events and \(\beta\) non events in a sample size of \(\alpha + \beta\)
Strength of prior information equivalent to prior “sample size” \(\alpha + \beta\)
Prior mean = \(\frac{\alpha}{\alpha + \beta}\)
Working example of Beta-binomail model
Consider Beta(3,7)and Beta(12,28) priors
Gold line: prior belief that assumes approximately 3 events in 10 subjects
Blue line: prior belief that assumes approximately 12 events in 40(=12+28) subjects
Summarizing Posterior Distribution
Since we know the form of the posterior distribution, we can easily calculate functions such as:
Posterior mean \(E(\theta) = \frac{\alpha+x}{\alpha + \beta+n}\)
95% Credible intervals (we will talk more about them next week)
\(P(\theta < 0.2)\), \(P(\theta < 0.5)\), \(P(0.4< \theta < 0.6)\), etc, which can be directly used to make probabilistic statement about the \(\theta\). E.g., the probability of the posterior adverse event rate < 0.2 is about 0.95.
Generate informative plots for assessing priors and posteriors. All this can easily be done using R.
Data overwhelming the prior
The posterior for the beta-binomial model after seeing x events in n trials is \(\theta \mid x ~ Beta(\alpha + x, \beta + n - x)\) with posterior mean as
In \(n \gg \alpha + \beta\) and \(x \gg \alpha\) (sample size and number of events is large), recall when using prior \(Beta(0.001, 0.001)\), \[ E(\theta \mid x) = \frac{\alpha + x}{\alpha + \beta + n} \approx \frac{x}{n}\]
Here, prior is of little importance!
A Beta(1,1) is Unif(0,1)
When \(\alpha = \beta = 1\), \(P(\theta) \propto \theta^{1-1}(1-\theta)^{1-1} = 1\), which is the probability density of a unif(0,1).
Where \(w\) is the ratio of the “prior sample size” to the “total sample size” \[ w = \frac{\alpha+ \beta }{\alpha + \beta + n}.\]
This is a common theme in Bayesian models with actual prior information
The posterior distribution will lie between the prior and likelihood
The posterior mean is a weighted average of the prior mean and data-based estimate
As the sample size increases, the contribution of the prior diminishes
3. The Normal-normal model
Parameters with normal priors and likelihoods
Continuing with a single parameter model
The likelihood of the parameter given some data is that of a normal distribution with known variance \(\sigma^2\)
\[ y \mid \theta \sim N(\theta, \sigma^2) \]
suppose the prior for \(\theta\), the mean, follows a normal distribution
\[ \theta \sim N(\mu_0, V_0)\]
We want to make inference about \(\theta\) given the data y, what is the posterior distribution \(P(\theta \mid y)\)?
Normal-normal model
1. The normal model (likelihood) of one data point
Consider a single observation \(y\) from a normal distribution parameterized by a mean \(\theta\) and variance \(\sigma^2\), we assume that \(\sigma^2\) is known.
The normal likelihood for \(y\) given \(\theta\) is \[ P(y\mid \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(y-\theta)^2}{2\sigma^2}}, y \in (-\infty, \infty)\]
The \(E(y) = \theta\) and \(V(y) = \sigma^2\) and \(SD(y) = \sigma\).
Given \(y \mid \theta \sim N(\theta,\sigma^2 )\), \(z = \frac{y-\theta}{\sigma} \sim N(0,1)\). Thus, if we rescale \(y\) by \(\sigma\), we have
95%CI of y: \(\theta \pm 1.96 \times \sigma\)
2. Prior
We can often use a normal prior to represent \(\theta \sim N(\mu_0, V_0)\)
\(\mu_0\) is the prior mean
\(V_0\) is the prior variance, which expresses the uncertainty around the mean \(\mu_0\)
\(\tau_0 = \frac{1}{V_0}\) is called precision, it can also be used to express the uncertainty around the mean, e.g. \(\theta \sim N(\mu_0, \tau_0)\)
High precision = low variance, in other words, the distribution of the normal random variable \(\theta\) is closely centered around its mean.
3. Posterior
With a normal prior and a normal likelihood of the data given parameter \(\theta\), (here we use precision to express the normal distributions)
\[ \theta \sim N(\mu_0,\tau_0 ) \text{ and } y \mid \theta \sim N(\theta,\tau_y )\]
where \(\tau_y = \frac{1}{\sigma^2}\) is the precision of the observed data
It can be proven mathematically using the Bayes’ rule that the posterior distribution given a normal prior and normal likelihood is also normal,
The posterior mean is a weighted sum of the prior mean and the observed value.
The weights are the relative precisions of the prior and the likelihood \[\mu_1 = w_0 \times \mu_0 + w_y \times y\]\[w_0 = \frac{\tau_0}{\tau_0 + \tau_y}\] and
\[w_y = 1-w_0 = \frac{\tau_y}{\tau_0 + \tau_y}\]
Data overwhelming the prior and vice versa
With relatively low prior precision (imprecise prior and substantial data), i.e., \(\tau_0 \ll \tau_y\) and \(0 \approx w_0 \ll w_y \approx 1\),
In most cases, the posterior mean is a compromise between what we believed before and what we observed in the data.
Normal model with multiple observations
Consider a set of \(n\) observations \(y = (y_1, \ldots, y_n)\) sample from a \(N(\theta, \sigma^2)\), where \(y_1,\ldots, y_n\) are identically, independently distributed and \(y_i \sim N(\theta, \sigma^2)\), \(i = 1, \ldots, n\).
The normal likelihood of the joint distribution \(y = (y_1, \ldots, y_n)\) is
\(\bar{y} = \frac{\sum_{i=1}^2y_i}{n}\) is the sample mean and \(\tau_y = \frac{1}{\sigma^2}\)
Again, the posterior mean is a weighted sum between prior and data. As sample size \(n \rightarrow \infty\), \(\Big( \frac{\tau_0}{\tau_0+ n\tau_y} \Big) \rightarrow 0\) and \(\Big( \frac{n\tau_y}{\tau_0+ n\tau_y} \Big) \rightarrow 1\), data will overwhelm the prior and dominate the posterior distribution.
The use of normal-normal model in clinical applications
Consider estimating a parameter \(\theta\), e.g. mean, log-hazard ratio, log-odds-ratio, log-relative-risk
Suppose we have an estimate of \(\theta\); We will call it \(y\) and let \(\hat{\sigma}^2_y\) be the estimated variance
This estimate \(y\) can be viewed as a single data observation of \(\theta\)!
In large samples, it is approximately true that \(y \sim N(\theta, \sigma^2_y)\), where \(\theta\) is the true value and \(\sigma^2_y\) is the true variance (sample size dependent) and the 95%CI of \(y\) is \(\theta \pm 1.96 \times \sigma\).
We can construct a normal likelihood of data for the parameter \(\theta\) as \(y \mid N(\theta, \hat{\sigma}^2_y)\)
Working exmaple - Treatment effect estimate on the mean
For this example, we will simulate data for a small study of n=100 subjects and randomized 50% to treat=0 and 50% to treat=1 (mimicking an RCT). We will then generate the outcome Y from a normal distribution depending on the treatment indicator.
Suppose we have the following table of results from a linear regression estimating treatment effect \(\theta\) on the outcome
Code
n <-100set.seed(1234)treat <-1*(runif(n, min =0, max =1)<0.5)y <-5*treat+rnorm(n, mean =0, sd =10)fit<-lm(y~treat)summary(fit)
Call:
lm(formula = y ~ treat)
Residuals:
Min 1Q Median 3Q Max
-19.8662 -6.7095 -0.5731 5.7503 24.1866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.303 1.423 0.916 0.3618
treat 4.080 1.918 2.127 0.0359 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.543 on 98 degrees of freedom
Multiple R-squared: 0.04414, Adjusted R-squared: 0.03438
F-statistic: 4.525 on 1 and 98 DF, p-value: 0.03591
Code
round(confint.default(fit, 'treat', level=0.95),3) # based on asymptotic normality
2.5 % 97.5 %
treat 0.321 7.84
We are interested in using this published results to make Bayesian inference on \(\theta\)
Our observed data (estimate) of \(\theta\), denoted as \(y\) is 4.08 with standard error 1.918. Here, \(y\) plays the role of data! We call \(y\) the observed datum.
Relying on the large sample approximation, we assume that the estimate y arose from a normal distribution with true mean \(\theta\), We treat the observed standard error\(\hat{\sigma}_y = 1.918\) as the true standard deviation of \(y\)
Recall, the standard deviation of the sampling distribution of a parameter estimate is called its standard error
This gives us the likelihood \[y \mid \theta \sim N(\theta, 1.918^2)\]
The width of the 95% CI is approximately equal to \(2 \times 1.96 \times \sigma_y\)
Suppose we are not provided with \(\sigma_y\) from the table of results, but know the 95% CI of y is (0.321,7.84), we can calculate \(\sigma_y\) with
Working example - Bayesian reanalysis of RCT - Odds Ratio
(frequentist) Effect of a Resuscitation Strategy Targeting Peripheral Perfusion Status vs Serum Lactate Levels on 28-Day Mortality Among Patients With Septic Shock The ANDROMEDA-SHOCK Randomized Clinical Trial (Hernández et al. 2019)
(Bayesian reanalysis) Effects of a Resuscitation Strategy Targeting Peripheral Perfusion Status versus Serum Lactate Levels among Patients with Septic Shock. A Bayesian Reanalysis of the ANDROMEDA-SHOCK Trial (Zampieri et al. 2020)
We are interested to make inference on the odds ratio of peripheral perfusion versus Lactate level on 28-Day mortality among patients with septic shock.
We obtain a point estimate of OR and it’s 95%CI from (Hernández et al. 2019) give the following 2x2 table.
\[95\% CI = exp(log(\hat{OR} - 1.96\times SE\{ln(\hat{OR})\}) \text{ to } exp(log(\hat{OR} + 1.96\times SE\{ln(\hat{OR})\}) = (0.473, 1.035)\]
Intervention
Death within 28d
Alive by 28d
Total
Peripheral Perfusion-Targeted Resuscitation
74
138
212
Lactate Level-Targeted Resuscitation
92
120
212
Total
166
258
424
We treat this as a data evidence!
1. The normal likelihood on log(OR)
The estimated OR on 28-day mortality, 0.699 (0.473 to 1.035), is called the observed datum (viewed as a single data observation on \(\theta\), denoted as \(y\))
It’s known that log of the parameter estimate, such as HR, RR and OR, follow a normal distribution around its true log value.
Let \(\theta\) denote the random variable representing log(OR)
Let \(y\) denote an observed data point of \(\theta\), we have \(y \sim N(\theta, \sigma_y^2)\), where \(\theta\) is log(OR) and \(\sigma_y\) is the standard error of the log(OR). \(\sigma_y=0.2\) from the result table.
If we don’t have \(\sigma_y^2\) but only the reported 95% CI of the log of OR, we can take the width of the 95% CI and divided by \(2\times 1.96\) to obtain \(\sigma_y^2\). For example, we have
Given the requentist analysis results, we have a data likelihood (evidence) \(y \mid \theta \sim N(\theta,0.2^2)\) where we have observed \(y=log(0.699)=-0.358\)
2. The prior for log(OR)
From (Zampieri et al. 2020), “a neutral prior in which the log[OR] of mean of 0 and an SD of 0.5, corresponding to an OR of 1 (95% credible interval, 0.37-2.75)”
Prior OR beliefs on the effectiveness of peripheral perfusion on 28 days mortality from the Bayesian reanalysis paper
Suppose we consider a neutral prior for log(OR) from the reanalysis paper (Zampieri et al. 2020)
The neutral prior is \(log(OR) \sim N(log(1), 0.5^2)\)
3. The posterior of log(OR)
Given prior \(\theta \sim N(0, 0.5^2)\) and likelihood \(y \mid \theta \sim (\theta,0.2^2)\) where \(y=log(0.699)=-0.358\)
Prior precision is \(\frac{1}{0.5^2} = 4\) and the likelihood precision is \(\frac{1}{0.2^2} = 25\), we can see that the likelihood has more information than the prior!
The posterior precision on log(OR) is \(\tau_1 = 4 + 25=29\)
The posterior variance of \(\log(OR) = \frac{1}{29}\) and the posterior standard deviation is \(\sqrt{\frac{1}{29}} = 0.186\), thus
\[\theta \mid y \sim N(\mu_1 = -0.309, \sigma_1^2 = 0.186^2)\]
The posterior mean OR = exp(-0.309) = 0.734 and the posterior 95% Credible Interval for OR is \[\exp(-0.309 \pm 1.96\times 0.186) \rightarrow (0.51, 1.057) \]
Code
#prior on log(OR);mu0<-0sd0<-0.5prec0 <-1/sd0^2#likelihood on log(OR);ybar<--0.358sd.ybar<-0.2prec.y <-1/sd.ybar^2#posterior on log(OR)prec1 <- prec0 + prec.ymu1 <- (prec0 * mu0 + prec.y*ybar)/prec1sigma1 <-1/sqrt(prec1)mu.plot <-seq(ybar-1.5,ybar+2,length.out=200)lik <-dnorm(ybar, mu.plot, sd.ybar)prior <-dnorm(mu.plot, mu0, sd0)posterior<-dnorm(mu.plot, mu1,sigma1)d<-data.frame(mu.plot=rep(mu.plot, 3),p =c(lik,prior,posterior),Function=rep(c("likelihood","prior","posterior"),each=length(mu.plot)))p1<-ggplot(d, aes(mu.plot, p,colour=Function))+geom_line(size =1)+xlab("log(OR)")+ylab("Probability Density")+scale_colour_manual(values=c("goldenrod","steelblue","black"))+theme_bw()+ggtitle("Neural prior on log(OR): plot on log(OR) scale")p2<-ggplot(d, aes(exp(mu.plot), p,colour=Function))+geom_line(size =1)+xlab("OR")+ylab("Probability Density")+scale_colour_manual(values=c("goldenrod","steelblue","black"))+theme_bw()+ggtitle("Neural prior on log(OR): plot on OR scale")ggarrange(p1,p2, nrow=2)
Posterior summary statistics from R
Code
thresholds<-c(1,0.8)res <-exp( qnorm(c(0.5, 0.025, 0.975), mu1, sigma1)) res <-c(res, pnorm(log(thresholds), mu1, sigma1))names(res) <-c("median","q2.5","q97.5",paste0("Pr(OR < ",thresholds,")")) res
There are a small number of prior-likelihood pairs where the prior and the posterior are in the same family (e.g., both beta, both normal): these are called conjugate models
These posterior distributions can be computed without specialized software
These examples are useful for illustrating how Bayesian methods combine prior information with data
They have only limited practical usefulness - limits on types priors, limits on number of parameters
They are useful teaching tools, soon we will see how we can extend beyond these simple models
4. Why Bayesian: Statistical Estimation Revisit
4.1 Classic Frequentist Inference
Many common intuitive data-based estimators are maximum likelihood estimators
sample mean, \(\bar{x}\)
sample proportion, \(\hat{p}\)
rate of an event, \(\hat{\lambda} = \frac{\text{Number Events}}{\text{Follow-up Time}}\)
least square estimators from linear regression model
Finds the MLE, the parameter value that is most likely, given the observed data.
Finds how much curvature there is near this maximum. This leads to estimates of standard errors and CIs.
The 95% CI refers to the frequency with which hypothetical future random intervals from repeated sampling constructed in the same way will include the true value
A particular interval will either include the value or not.
It does not mean that the interval contains the true value with 0.95 probability, i.e. we do not have \(P(\theta \mid data)\)
Hypothesis Testing in a nutshell
Hypothesis testing is a procedure for determining whether a hypothesis is valid or not, based on sample evidence
Set up two hypotheses: Null (\(H_0\)) vs Alternative (\(H_A\))
\(H_0\), Specifies population parameter of interest & proposes baseline value for it. Null presumed true unless disproved
\(H_A\), Contains value of parameter different from H0; typically what you want to show
Specify appropriate model (find test statistic)
Z test, T test, chi-square test etc
under the assumption that the model is correct
Mechanics (calculate p-value)
We want to compare data to what we would expect to see if \(H_0\) was true and ask how likely it is to get these data, if \(H_0\) was true
P-value is used to quantify evidence against \(H_0\); The probability of getting equally or more extreme data (i.e. test statistic) under repeated sampling if \(H_0\) is true
\[ P_{-value} = P\Bigg( \begin{matrix}
\text{observed or more} \\
\text{extreme test statistics}
\end{matrix} \mid H_0 \ is \ true \Bigg) \]
P-value tells how “surprising” are data you observed under \(H_0\)
Arrive at conclusion: Reject \(H_0\) or not
Decisions about rejecting H0 or not are usually done by computing a p-value and comparing to \(\alpha\), significance level
significance level, an arbitrary threshold for P-value to define “rare/extreme event”. If P-value falls below this level, we conclude results are statistically significance and we reject the null
Hypothesis Tests Errors
Type I error (\(\alpha\)): \(H_0\) is true, but we reject it \(\alpha = P(Reject \ H_0 | H_0 \ is \ true)\)
We commonly control for Type I error - “innocent until proven guilty”
Type II error (\(\beta\)): \(H_0\) is false, but we fail reject it \(\beta = P(Fail \ to \ Reject \ H_0 | H_0 \ is \ false)\)
A test’s ability to detect a false null hypothesis is called the power of the test, \(1-\beta\)
For any test, there is trade-off between \(\alpha\) and \(\beta\) (e.g., for fixed n, decreasing\(\alpha\) increases \(\beta\), vice-versa )
By its definition, P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
“A conclusion does not immediately become true on one side of the divide and false on the other.”
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
“By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.”
“For example, a p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis.”
absence of evidence is not evidence of absence
Inference is performed based on data that have not been observed, under an assumption that is almost certain not to be true
Would you be willing to bet that the true probability a coin comes up heads is exactly P(head)=0.5, or that the exact true adverse rate is fixed at 0.2?
4.2 Bayesian Inference
In Bayesian statistics, inference is done using the posterior distribution
Given a posterior distribution, we can calculate any summary of it we like
\(P(0.008 < \theta < 0.012)\), \(P( \theta < 0.1)\), interval with 95% probability of containing \(\theta\)
We can use quantities to summarize the posterior distribution (mean, median, sd, IQR)
We can plot the density function and shade areas of interest
Posterior uncertainty
Posterior variance is used to characterize the posterior uncertainty
In most cases, it is smaller than the prior variance
In some cases, for some models, the opposite is true, when prior is in direct conflict with the data
Credible intervals play the role of the confidence intervals
For the 95% credible interval \([\alpha, \beta]\), \(P(\alpha< \theta <\beta \mid y)\) = 0.95
Posterior predictive distributions
Often we are interested in extending our (Bayesian) analysis using priors and data, and predicting outcomes in data observed in the future
e.g., predicting number of successes in a future trial for another 25 new observations
Essentially we want the probability density \(P(y^* \mid y)\), where with \(y^*\) we denote the new data
This probability is not always possible to compute directly, it does not correspond to some known distribution
We will see that these kinds of questions can be answered using Markov Chain Monte Carlo method and brms (coming up in a couple of weeks)
Posterior predictive distribution
Let \(y^*\) denote a new data, the posterior predictive distribution for \(y^*\) is
The posterior predictive distribution as an average of conditional predictions over the posterior distribution of \(\theta\)
We assume conditional independence of \(y^*\) and \(y\) given \(\theta\)
In the beta-binomial adverse event example, let \(y^*\) denote the result of a new trial of 6 patients given posterior \(\theta \mid y \sim Beta(1,7)\),
likelihood<-length(7)for (i in1:7) {integrand_pred <-function(theta) { (choose(6, i-1)*theta^(i-1)*(1-theta)^(6-i+1))*((gamma(8)/gamma(1)/gamma(7))*(1-theta)^6)}likelihood[i] <-integrate(integrand_pred,lower =0, upper =1)$value }posterior_pred<-data.frame(y_star=0:6, likelihood = likelihood)knitr::kable(posterior_pred, caption ="Posterior predictive distribution for a new trial of 6 patients")
Posterior predictive distribution for a new trial of 6 patients
y_star
likelihood
0
0.5384615
1
0.2692308
2
0.1223776
3
0.0489510
4
0.0163170
5
0.0040793
6
0.0005828
Code
sum(posterior_pred$likelihood)
[1] 1
Why Bayesian? Practical Reasons
Apart from making more sense to use then P-value, the modern Bayesian approach offers some practical advantages to data analysis
Prior information
Direct probability statements
Subjectivity acknowledged and formally represented
Prior information
Science rarely proceeds in the absence of any prior information or opinion
Bayesian reasoning allows the formal incorporation of this opinion
The classical approach introduces prior information and opinion in study planning and, informally, at the discussion stage of results - high risk of inconsistency
In Bayesian approach:
For small sample sizes, a prior will “rein in” imprecise data-based estimates
For large sample sizes, most priors are overwhelmed by the likelihood (data)
Probability Statements
A result of a Bayesian analysis is a posterior distribution, \(P(\theta \mid data)\), for an unknown parameter (e.g. an odds ratio \(\theta\)) given the data
We can compute probabilities directly from the posterior distribution
this is the area under the posterior density \(P(\theta \mid data)\) between the value of 0 and 1 for \(\theta\)
this is what we prefer p-value to be!
we can define \(\theta < 0.9\) as a clinically important effect, then we calculate the probability of \(P(\theta < 0.9 \mid data )\). No need to choose a “significance” level, we can directly use this probably to make clinical decisions!
Summary
Practical advantages of Bayesian approach
Direct probability statements
Incorporation of external evidence
Complex models can be built step-by-step
Makes you understand the analysis better
Disadvantages including
Learning how to code
Need to know more about distributions
Require good probability and statistical model knowledge
More time consuming (computationally more expensive with complex model and multiple parameters)
References
Hernández, Glenn, Gustavo A Ospina-Tascón, Lucas Petri Damiani, Elisa Estenssoro, Arnaldo Dubin, Javier Hurtado, Gilberto Friedman, et al. 2019. “Effect of a Resuscitation Strategy Targeting Peripheral Perfusion Status Vs Serum Lactate Levels on 28-Day Mortality Among Patients with Septic Shock: The ANDROMEDA-SHOCK Randomized Clinical Trial.”Jama 321 (7): 654–64.
Wasserstein, Ronald L, and Nicole A Lazar. 2016. “The ASA Statement on p-Values: Context, Process, and Purpose.”The American Statistician. Taylor & Francis.
Zampieri, Fernando G, Lucas P Damiani, Jan Bakker, Gustavo A Ospina-Tascón, Ricardo Castro, Alexandre B Cavalcanti, and Glenn Hernandez. 2020. “Effects of a Resuscitation Strategy Targeting Peripheral Perfusion Status Versus Serum Lactate Levels Among Patients with Septic Shock. A Bayesian Reanalysis of the ANDROMEDA-SHOCK Trial.”American Journal of Respiratory and Critical Care Medicine 201 (4): 423–29.