using one or more predictor variables to model the distribution of one or more outcome variables
For a continuous outcome, we typically fit a linear regression model.
Of course the relationship between outcome and predictors can be non-linear, in this case, we would consider fitting polynomial regression models or splines.
For a categorical outcome, we will fit a generalized linear regression. We will cover this topic in future sessions.
For a repeatedly measured outcome, we can fit a linear mixed-effect model (continuous outcome) or a generalized linear mixed-effect model (categorical outcome).
1.1 Normal Models and Linear Regression
Given its mean and variance, an observation has a normal distribution
We do not assume the collection of \(Y_i, i=1, \ldots, n\) have a normal distribution
Instead we assume the error term is normally distributed - a lesser assumption!
In case of multiple predictors, \(\mu_i\) becomes a weighted average of the \(\beta_j\) values, the regression coefficient with \(x_{ij}\) denoting the predictors. For example, for two covariates we have
Linear relationship We can check this assumption by examining marginal plots comparing the model predicted relationship between outcome and each continuous predictor and by examining the residual plot.
Normality of the residuals
Homoscedasticity Homoscedasticity in a model means that the residual is constant along the values of the dependent variable.
Multicollinearity Multicollinearity is the phenomenon when a number of the explanatory variables are strongly correlated.
Correctly specified regression model This means that all relevant predictors for the response variable have been included in the model. This is often difficult to verify, but we can use posterior predictive distribution to check for regression fit.
1.2 One-sample and two-sample normal regression
One-sample
If all observations have the same mean and variance, the likelihood is
\[ Y_i \mid \mu, \sigma^2 \sim N(\mu, \sigma^2)\]
Good candidate priors for \(mu\) and \(\sigma\) are
\(\mu \sim N(\text{sample mean}, \text{sample sd})\) or Student-T distributions (to handle heavy tails)
We specify our model with the following priors \[ heights_i \sim N (\mu, \sigma)\]\[\mu \sim N(138, 28)\]\[\sigma \sim \text{half-t}(0, 28, df = 3)\]
Prior distributions
we now fit this model in brms
Code
fit0 <-brm(data = dat,family = gaussian, height ~1, prior =c(prior(normal(138, 28), class = Intercept),prior(student_t(3, 0, 28), class = sigma)),iter =10000, warmup =5000, chains =4, cores =4, #instructions on MCMCseed =123, # random number seed to make results reproduciblesilent =2)# saveRDS(fit0, "data/chp4_onesample_example")
Summarizing posterior results
Family: gaussian
Links: mu = identity; sigma = identity
Formula: height ~ 1
Data: dat (Number of observations: 544)
Draws: 4 chains, each with iter = 10000; warmup = 5000; thin = 1;
total post-warmup draws = 20000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 138.26 1.18 135.97 140.57 1.00 16100 12949
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 27.65 0.85 26.03 29.40 1.00 16274 12070
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Geweke (1992) proposed a convergence diagnostic for Markov chains based on a test for equality of the means of the first and last part of a Markov chain
by default the first 10% and the last 50%
the test statistic is a standard Z-score: the difference between the two sample means divided by its estimated standard error
evalue the test statistics with respect to 1.96 (often rounded up to 2 in application).
Geweke index and R-hat (Geweke et al. 1991) to check for MCMC convergence
Rhat equals to 1 for both intercept and sigma indicating good convergence
similarly, Geweke indexes have absolute value within 2, indicating good convergence
Hypothesis Tests for class b:
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio
1 (Intercept)-(140) > 0 -1.74 1.18 -4.03 0.57 0.08
Post.Prob Star
1 0.07
---
'CI': 95%-CI for one-sided and 97.5%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 97.5%;
for two-sided hypotheses, the value tested against lies outside the 97.5%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.
# proportion of posterior samples on the mean (intercept) # with value > 140, combining 4 chains;modelposterior2<-posterior_samples(fit0)sum(modelposterior2[,1]>140)/length(modelposterior2[,1])
[1] 0.0709
1.2 Two-sample
Simple extension of the one-sample problem
Observations within the each group have the same mean and variance, yielding the following likelihood \[ Y_{ji} \mid \mu_j, \sigma^2_j \sim N(\mu_j, \sigma^2_j)\]
Similar candidate prior as in one-sample model
using normal or student-t dist and observed summary statistics
within-group variance should be smaller than the variance when we ignore groups
We can assume equal variance \(\sigma_1^2 = \sigma_2^2\) or allow them to be estimated seperately
These two approaches are analogous to the equal-variance and unequal-variance t-tests
In brms, this is easy to specify
Modelling hours of sleep comparing between two drugs
Data which show the effect of two medication (within-patient increase in hours of sleep) on groups consisting of 10 patients each.
Fitting two-sample model in brms using default brms prior
student_t(3, 1, 2.5) for mean (same for each group)
student_t(3, 0, 2.5) for sigma (same for each group for unequal variance model)
fit_eq <-brm(data = sleep,family = gaussian, extra ~ group, #mean depends on group and one sigmaprior =c(prior(normal(0, 10), class = Intercept),prior(normal(0,10), class = b),prior(student_t(3, 0, 2), class = sigma)),iter =10000, warmup =5000, chains =4, cores =4, #instructions on MCMCseed =123, # random number seed to make results reproduciblesilent =2,refresh =0)fit_uneq <-brm(data = sleep,family = gaussian,bf(extra ~ group, #mean depends on group sigma ~ group), #sigma differ by groupprior =c(prior(normal(0, 10), class = Intercept), #prior for intercept of the extra ~ group model;prior(normal(0, 10), class = b), #prior for coefficients of the extra ~ group model;prior(student_t(3, 0, 2), class = Intercept, dpar ="sigma"), #prior for intercept of the sigma ~ group model;prior(student_t(3, 0, 2), class = b, dpar ="sigma")), #prior for coefficients of the sigma ~ group model;iter =10000, warmup =5000, chains =4, cores =4, #instructions on MCMCseed =123, # random number seed to make results reproduciblesilent =2,refresh =0)# saveRDS(fit_eq, "data/chp4_twosample_eq")# saveRDS(fit_uneq, "data/chp4_twosample_uneq")
prior class coef group resp dpar nlpar lb ub source
normal(0, 10) b user
normal(0, 10) b group (vectorized)
normal(0, 10) Intercept user
student_t(3, 0, 2) sigma 0 user
Code
prior_summary(fit_uneq)
prior class coef group resp dpar nlpar lb ub source
normal(0, 10) b user
normal(0, 10) b group (vectorized)
student_t(3, 0, 2) b sigma user
student_t(3, 0, 2) b group sigma (vectorized)
normal(0, 10) Intercept user
student_t(3, 0, 2) Intercept sigma user
Summarizing posterior results from equal variance model
Family: gaussian
Links: mu = identity; sigma = identity
Formula: extra ~ group
Data: sleep (Number of observations: 20)
Draws: 4 chains, each with iter = 10000; warmup = 5000; thin = 1;
total post-warmup draws = 20000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -0.83 1.42 -3.61 1.99 1.00 16038 13105
group 1.58 0.90 -0.21 3.34 1.00 15983 12465
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 1.98 0.34 1.44 2.78 1.00 14393 12802
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Question: What is the probability that there is more increase in sleep with drug A? (under equal-variance model)
samples <-posterior_samples(fit_eq)difference <- samples[,"b_group"]#summarize the differencec(Prob =mean(difference >0), Mean =mean(difference), lowCrI =quantile(difference, 0.025), highCrI =quantile(difference, 1-0.025))
Prob Mean lowCrI.2.5% highCrI.97.5%
0.9605000 1.5764300 -0.2065725 3.3406688
#we can also use hypothesis function to evaluate thishypothesis(fit_eq, "group > 0", alpha =0.025)
Hypothesis Tests for class b:
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 (group) > 0 1.58 0.9 -0.21 3.34 24.32 0.96
---
'CI': 95%-CI for one-sided and 97.5%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 97.5%;
for two-sided hypotheses, the value tested against lies outside the 97.5%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.
Summarizing posterior results from unequal variance model and evaluate the same hypothesis
summary(fit_uneq)
Family: gaussian
Links: mu = identity; sigma = log
Formula: extra ~ group
sigma ~ group
Data: sleep (Number of observations: 20)
Draws: 4 chains, each with iter = 10000; warmup = 5000; thin = 1;
total post-warmup draws = 20000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -0.80 1.45 -3.60 2.12 1.00 15762 12489
sigma_Intercept 0.52 0.54 -0.49 1.64 1.00 15749 12768
group 1.56 0.95 -0.33 3.41 1.00 15114 12462
sigma_group 0.11 0.34 -0.55 0.79 1.00 15262 13948
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
hypothesis(fit_uneq, "group > 0", alpha =0.025)
Hypothesis Tests for class b:
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 (group) > 0 1.56 0.95 -0.33 3.41 19.2 0.95
---
'CI': 95%-CI for one-sided and 97.5%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 97.5%;
for two-sided hypotheses, the value tested against lies outside the 97.5%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.
How to determine equal variance or not?
We are essentially testing model fit here.
Approach 1, visually examine the conditional mean by group.
Approach 2, comparing models using information criteria (WAIC) or leave-one-out cross validation (LOO) (Vehtari, Gelman, and Gabry 2017)
Both approaches confirm the equal variance model is the best model!
Code
plot(conditional_effects(fit_uneq), points =TRUE)
Code
waic1 <-waic(fit_eq)waic2 <-waic(fit_uneq)compare_ic(waic1, waic2) #equal var best model;
elpd_loo is an estimate of the expected log predictive density (ELPD) - this is an estimate of the predictive performance of the model on new data, and we can use it to compare models from different function families.
elpd_diff and se_diff (standard error of elpd_diff) columns are computed by making pairwise comparisons between each model and the model with the largest theoretical ELPD - the model in the first row
In general With ELPD, larger values indicate better estimated predictive performance.
elpd_diff column will always have the value 0 in the first row (i.e., the difference between the preferred model and itself) and negative values in subsequent rows for the remaining models.
se_diff calculation uses normal approximation. Due to cross-validation folds not being independent, SE tends to be underestimated especially if the number of observations is small or the models are badly misspecified.
When the number of observations is larger than 100 and the model is not badly misspecified then normal approximation and SE are quite reliable description of the uncertainty in the difference.
If elpd difference is less than 4, the difference between the two model is small. The models have very similar predictive performance and it doesn’t matter if the normal approximation fails or SE is underestimated.
If elpd difference is larger than 4, then compare that difference to standard error of elpd_diff. If elpd diff is >3 times more than se_diff, then the other model is the better.
In this example fit_eq has larger ELPD. However, elpd_diff is less than 4 a. The difference between elpd and se is really small (single digits). This is indicating that there is no major difference between the two models using LOO and ELPD.
1.3 Simple linear regression
The two-sample model is a special case of simple linear regression
\[ \mu_i = \beta_0 + \beta_1 x_i \]
Where \(x_i = 0\) for group 1 and \(x_i = 1\) for group 2
For group 1, \(\mu_i = \beta_0\), the intercept
For group 2, \(\mu_i = \beta_0 + \beta_1\)
brms takes care of creating dummy variables!
but be aware of what the reference group is and what the coefficients mean
levels will be by default in alphabetical order - use the factor() statement to make a variable with the desired order
where the \(x_{ij}\) can be continuous variables or “dummy” variables - the usual assumption on the variance is \(\sigma_i^2 = \sigma^2\) for all subjects (Homoscedasticity - equal variance assumption)
We can seen that is technically easier to relax this assumption in a Bayesian model! (two-sample unequal variance example)
Predictions
We can get the predictions fro the mean value at a given value of the predictors. For example, in simple linear regression model we have
\[E(Y\mid X=x) = \beta_0 + \beta_1 x\]
but recall that both \(\beta_0\) and \(\beta_1\) have posterior distributions
For each posterior draws of \((\beta_0, \beta_1)^s, s= 1, \ldots, S\), we have a different predicted mean
\[E(Y\mid X=x, S=s) = \beta_0^s + \beta_1^s x\]
The set of values of these predictions across all sampled values give the posterior for the predicted mean
Modelling height - Howell data revisit
Let’s descriptively examine the height distribution by weight, age, and sex
We can see that the relationship between age and height is not linear
we consider the following prior distributions (diffuse normal on regression parameter and exponential distribution for \(\sigma\)) \[\beta_0 \sim N(0, 100) \]\[\beta_1 \sim N(0, 10) \]\[\beta_2 \sim N(0, 10) \]\[\beta_3 \sim N(0, 10) \]\[\beta_4 \sim N(0, 10) \]\[ \sigma \sim Gamma(1, 0.01) \]
Prior distributions
we now fit this model in brms
dat <- dat %>%mutate(male =factor(male, labels =c("female", "male")))fit1 <-brm(data = dat,family = gaussian, height ~ weight + age +I(age^2) + male,prior =c(prior(normal(0, 100), class = Intercept),# set for all "b" coefficientsprior(normal(0, 10), class = b),prior(gamma(1, 0.01), class = sigma)),iter =20000,warmup =18000,chains =4,cores =4, #instructions on MCMCseed =123, # random number seed to make results reproduciblesilent =2,refresh =0)# saveRDS(fit1, "data/chp4_reg_example")
Summarizing posterior results
Family: gaussian
Links: mu = identity; sigma = identity
Formula: height ~ weight + age + I(age^2) + male
Data: dat (Number of observations: 544)
Draws: 4 chains, each with iter = 20000; warmup = 18000; thin = 1;
total post-warmup draws = 8000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 75.51 0.98 73.56 77.42 1.00 9141 6972
weight 1.26 0.06 1.15 1.38 1.00 2966 4106
age 1.06 0.12 0.83 1.28 1.00 2765 3747
IageE2 -0.01 0.00 -0.01 -0.01 1.00 2902 3909
malemale 1.85 0.79 0.28 3.37 1.00 5456 5016
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 8.70 0.27 8.20 9.24 1.00 6769 5267
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Using the posterior mean, we have the following regression line
Comparing between male and female of the same weight and age, the expected difference on height is estimated at 1.86cm with 95% CI[0.33, 3.41].
What is the posterior evidence that there is a positive association between height and sex?
Code
hypothesis(fit1, 'malemale > 0' , alpha =0.025)
Hypothesis Tests for class b:
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 (malemale) > 0 1.85 0.79 0.28 3.37 90.95 0.99 *
---
'CI': 95%-CI for one-sided and 97.5%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 97.5%;
for two-sided hypotheses, the value tested against lies outside the 97.5%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.
we can visualize conditional effect of each continuous covariates on the response variable
Coefficient of determination \(R^2\) is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).
The estimated \(R^2\) is about 0.9. Really good! About 90% of the variance on height is explained by our model.
If some coefficients are particularly strongly correlated, you may need to think about using a stronger prior or combining some predictors.
Code
pairs(fit1, off_diag_args =# arguments of the scatterplotslist(size =0.5, # point sizealpha =0.25)) # transparency
1.5 Model comparision
Watanabe-Akaike Information Criteria (WAIC)
WAIC (Watanabe and Opper 2010) is an alternative approach to DIC in estimating the expected log point-wise predictive density
The smaller the better!
WAIC incorporates prior information, and the use of point-wise likelihood makes it more robust when the posterior distributions deviate from normality.
In general, WAIC is a better estimate of the out-of-sample deviance than AIC and DIC.
In R under brms package we can use waic() function to obtain WAIC.
Bayesian Leave-one-out cross-validation
The idea of cross-validation is to split the sample so that it imitates the scenario of estimating the parameters in part of the data and predicting the remaining part.
data that are used for estimation is called the training set, and data that are used for prediction is called the validation set.
Leave-one-out information criteria (LOO-IC) based on posterior predictive distribution
means that one uses \(n-1\) observations as the training set and 1 observation as the validation sample, repeat the process \(n\) times so that each time a different observation is being predicted
LOO-IC requires fitting the model \(n\) times, it is generally very computational intensive
WAIC can be treated as a fast approximation of LOO-IC, although LOO-IC is more robust and will be a better estimate of out-of-sample deviance
In R under brms package we can use loo() function to obtain LOO-IC.
Comparing model with\(weight^2\)
fitting the new model on height, adjusting for age, sex, weight, \(age^2\) and \(weight^2\)
comparing this model to the regression model adjusted for age, \(age^2\), sex, and weight using WAIC and LOO.
Code
fit2 <-brm(data = dat,family = gaussian, height ~ weight +I(weight^2) + age +I(age^2) + male, prior =c(prior(normal(0, 100), class = Intercept),prior(normal(0, 10), class = b),prior(gamma(1, 0.01), class = sigma)),iter =20000, warmup =18000, chains =4, cores =4, #instructions on MCMCseed =123, # random number seed to make results reproduciblesilent =2,refresh =0)saveRDS(fit2, "data/chp4_height2")
Geweke, John F et al. 1991. “Evaluating the Accuracy of Sampling-Based Approaches to the Calculation of Posterior Moments.” Federal Reserve Bank of Minneapolis.
McElreath, Richard. 2018. Statistical Rethinking: A Bayesian Course with Examples in r and Stan. Chapman; Hall/CRC.
Student. 1908. “The Probable Error of a Mean.”Biometrika, 1–25.
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.”Statistics and Computing 27 (5): 1413–32.
Watanabe, Sumio, and Manfred Opper. 2010. “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory.”Journal of Machine Learning Research 11 (12).