Tutorial 5

require(tidyverse)
library(brms)
library(bayesplot)
library(DT)
library(gtsummary)
library(knitr)

Hands-on exercise fitting regression models

1. How scientific are various disciplines

1.1 Study background

These data are collected from a survey study. Doctors across Canada were asked in a survey how many hours per week they spent on scientific research.

dat <- read.table("data/HowScientific.txt",header = T)

dat %>% datatable(
  rownames = FALSE,
  options = list(
 pageLength = 35,
    columnDefs = list(list(className = 'dt-center', 
                      targets = 0:3))))

We can easily rank the disciplines from highest to lowest based on the observed means:

dat <- dat[order(dat$Hours,decreasing = F),]
dat$Specialty <- factor(dat$Specialty, 
                      levels=dat$Specialty)
dat %>% ggplot(aes(y=Specialty,x=Hours))+
  geom_point()+
  geom_errorbar(aes(xmin=Hours-SD, xmax=Hours+SD), width=.2, colour = "red", position=position_dodge(0.05))+
  labs(title = paste("Research hour by specialty", "(Mean hours +/- SD)", sep = " "))+
  theme_bw()

1.2 Modelling ranking and data uncertainty

How sure are we of these rankings - there is uncertainty about the true value of the mean in each group - we know the group mean - although we do not have all the data in each group, we have the SD and n (this resembles a meta analysis!)

Objective of this analysis is to quantify the probability that a particular discipline is ``more scientific’’ than another discipline.

1.3 Model building

We can specify a model for each discipline, index by i.
The observed mean hours are sampled from a normal distribution with an unknown mean \(\mu_i\) and and known variance of the mean. Under the central limit theorem, the variance of mean is

\[var(\mu) = SE(\mu)^2 = SD^2/n\]

Thus, we have the following model/likelihood.

\[ Hours_i \sim N(\mu_i, sd^2_{i}/n_i)) \]

We will put a random effects prior on the mu_i. Can you see that \(\mu_0\) and \(\sigma^2\) as two hyper priors

\[ \mu_i \sim N(\mu_0, \sigma^2) \]

We are interest in estimating the values of \(\mu_i\) as these are the true unknown mean hours for each specialty, which we want to compare (rather than the observed hours).
This can be easily done in brms. The function has notation that allows us to supply a known SD.

1.4 Fitting a random intercept model (partical pooling)

We can fit a random intercept model, here we assume there is a common underlying research hours but there exists random variation across disciplines.
We can use the brms default priors.

dat$SE_mean <- dat$SD/sqrt(dat$n)
fit1 <- brm(data = dat, 
            family = gaussian,
            Hours | se(SE_mean) ~ 1 + (1 | Specialty),
            iter = 5000, 
            warmup = 4000, 
            cores = 4, 
            chains = 4,
            seed = 15)
# saveRDS(fit1, "data/tut4_fit1")

fit1 <- readRDS("data/tut4_fit1")
prior_summary(fit1)

                  prior     class      coef     group resp dpar nlpar lb ub
 student_t(3, 8.6, 3.9) Intercept                                          
   student_t(3, 0, 3.9)        sd                                      0   
   student_t(3, 0, 3.9)        sd           Specialty                  0   
   student_t(3, 0, 3.9)        sd Intercept Specialty                  0   
       source
      default
      default
 (vectorized)
 (vectorized)

print(fit1)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: Hours | se(SE_mean) ~ 1 + (1 | Specialty) 
   Data: dat (Number of observations: 35) 
  Draws: 4 chains, each with iter = 5000; warmup = 4000; thin = 1;
         total post-warmup draws = 4000

Multilevel Hyperparameters:
~Specialty (Number of levels: 35) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     2.84      0.37     2.20     3.65 1.01      418     1093

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     8.65      0.49     7.66     9.60 1.01      354      543

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.00      0.00     0.00     0.00   NA       NA       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The estimated mean research hours across discipline is 8.65 (7.66, 9.60).
There is substantial between specialty variation estimated at 2.84 (2.20, 3.65).
We can produce Bayesian model estimated research hours by discipline and rank them!

x <- posterior_samples(fit1)
names(x)

 [1] "b_Intercept"                                  
 [2] "sd_Specialty__Intercept"                      
 [3] "sigma"                                        
 [4] "r_Specialty[FamilyMedicine,Intercept]"        
 [5] "r_Specialty[DiagnosticRadiology,Intercept]"   
 [6] "r_Specialty[GeneralPathology,Intercept]"      
 [7] "r_Specialty[Ophthalmology,Intercept]"         
 [8] "r_Specialty[Anesthesia,Intercept]"            
 [9] "r_Specialty[Otolaryngology,Intercept]"        
[10] "r_Specialty[OrthopedicSurgery,Intercept]"     
[11] "r_Specialty[PlasticSurgery,Intercept]"        
[12] "r_Specialty[GeneralSurgery,Intercept]"        
[13] "r_Specialty[Obstetrics/Gynecology,Intercept]" 
[14] "r_Specialty[Dermatology,Intercept]"           
[15] "r_Specialty[Psychiatry,Intercept]"            
[16] "r_Specialty[Physiatry,Intercept]"             
[17] "r_Specialty[Urology,Intercept]"               
[18] "r_Specialty[Pediatrics,Intercept]"            
[19] "r_Specialty[NuclearMedicine,Intercept]"       
[20] "r_Specialty[GenInternalMedicine,Intercept]"   
[21] "r_Specialty[AnatomicPathology,Intercept]"     
[22] "r_Specialty[Rheumatology,Intercept]"          
[23] "r_Specialty[Immunology/Allergy,Intercept]"    
[24] "r_Specialty[Gastroenterology,Intercept]"      
[25] "r_Specialty[EmergencyMedicine,Intercept]"     
[26] "r_Specialty[Cardio-thoracicSurgery,Intercept]"
[27] "r_Specialty[Cardiology,Intercept]"            
[28] "r_Specialty[MedicalOncology,Intercept]"       
[29] "r_Specialty[Nephrology,Intercept]"            
[30] "r_Specialty[RadiationOncology,Intercept]"     
[31] "r_Specialty[Neurology,Intercept]"             
[32] "r_Specialty[Geriatrics,Intercept]"            
[33] "r_Specialty[Respirology,Intercept]"           
[34] "r_Specialty[CommunityMedicine,Intercept]"     
[35] "r_Specialty[Neurosurgery,Intercept]"          
[36] "r_Specialty[Hematology,Intercept]"            
[37] "r_Specialty[Microbiology/ID,Intercept]"       
[38] "r_Specialty[Endocrine/Metabolism,Intercept]"  
[39] "lprior"                                       
[40] "lp__"

names(x) <- gsub("r_Specialty","",names(x))
names(x) <- gsub(",Intercept","",names(x))
# x <- x %>% mutate(across(c(-sd_Specialty__Intercept,-sigma)))
x <- x[,-c(2,3,39,40)] #remove sd and likelihood estimates;
x <- x[,-1] + x[,1] #remove the overall mean estimate and add that value to each remaining columns to obtain specialty specific hours;

x %>% datatable(
  rownames = FALSE,
  options = list(
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 10))

Now creating table to show posterior summary of the estimated hours by specialty

ranks <- apply(x, MARGIN = 1, FUN = rank)
qranks <- round(apply(ranks, MARGIN = 1, FUN = quantile, c(0.5, 0.025, 0.975)),0)
kable(t(qranks))

	50%	2.5%	97.5%
[FamilyMedicine]	1	1	2
[DiagnosticRadiology]	3	2	5
[GeneralPathology]	3	1	7
[Ophthalmology]	4	2	7
[Anesthesia]	5	3	8
[Otolaryngology]	6	3	10
[OrthopedicSurgery]	7	4	10
[PlasticSurgery]	8	3	12
[GeneralSurgery]	8	5	11
[Obstetrics/Gynecology]	11	8	14
[Dermatology]	11	7	15
[Psychiatry]	12	10	14
[Physiatry]	13	8	17
[Urology]	14	8	18
[Pediatrics]	16	14	19
[NuclearMedicine]	17	10	24
[GenInternalMedicine]	18	15	21
[AnatomicPathology]	19	14	23
[Rheumatology]	19	14	24
[Immunology/Allergy]	19	9	28
[Gastroenterology]	20	15	25
[EmergencyMedicine]	23	18	28
[Cardio-thoracicSurgery]	22	16	28
[Cardiology]	24	20	28
[MedicalOncology]	26	20	33
[Nephrology]	26	20	33
[RadiationOncology]	29	23	34
[Neurology]	30	24	34
[Geriatrics]	29	21	35
[Respirology]	31	25	35
[CommunityMedicine]	30	24	35
[Neurosurgery]	30	21	35
[Hematology]	31	22	35
[Microbiology/ID]	33	28	35
[Endocrine/Metabolism]	34	27	35

For any pair of specialties, we can calculate the probability that one is more scientific than another

mean(x[,"[Microbiology/ID]"] < x[,"[Endocrine/Metabolism]"])

[1] 0.55675

mean(x[,"[Nephrology]"] < x[,"[Endocrine/Metabolism]"])

[1] 0.9425