Lecture 1 - Introduction to the course

Learning Objectives
  • Students and instructor introduction
  • Going over course syllabus
  • Reviewing of common observational study designs and reporting of observational study
  • Reviewing survival analysis (Biostatistics II)

About me

I am an Assistant Professor in Health Services Research (outcomes and evaluation method emphasis) at IHPME. I also hold a cross-appointment at the Division of Biostatistics. I received my PhD in Biostatistics from U of T under the supervision of Dr. Eleanor Pullenayegum.

My primary research focuses on developing methodology for statistical inference with complex longitudinal data in comparative effectiveness research. My areas of methodological interest include causal inference, Bayesian statistics, longitudinal data analysis, bias analysis, and mixture and joint modelling. If you are interested in working with me for your research/thesis projects, free feel to reach me.

Besides academic work and creating fun data visualizations in R (Covid Dashboard), I love watercolour painting, taking contemporary dance classes in Toronto, and playing piano as an adult beginner.

Introducing Yourself

You can share your program, research interests, learning goals, and hobby.

Syllabus

Detailed course syllabus is posted on Quercus.

Important Notes
  • Course schedule
  • In-person office hour
    • Tuesday 1:00pm to 2:00pm
    • Begins on Sept 19, 2022, ends on Nov 28, 2022
  • Course materials
    • The lecture notes will be the main resource for this course.
      • All course notes (or link to notes) will be posted on Quercus. Notes for certain sessions will be open-acess and hosted on this website, https://kuan-liu.github.io/biostat3/.
    • Additional reading materials (e.g., articles) will be posted on Quercus
    • Example R scripts/ R markdown files and data will be uploaded to Quercus.
  • R and RStudio
    • We will be primarily using R via RStudio for this course!
    • You are encouraged to write your individual assignment using Quarto or Rmarkdown.
    • Please bring your laptop to sessions on the course calendar that flagged with R in description.
  • ‘Grading’
    • Participation: 1% participation mark per week; Students can accumulate full 10% from any 10 out of 12 lecture weeks throughout the term.
    • Three group presentations for each lecture block (10% x 3 = 30%).
    • A team of 2 to 3 students for give a 10 mins presentation followed by 3 mins Q&A. I suggest collaborating with different peers for each group presentation. I am happy to facility random grouping if preferred.
    • Three individual assignments for each lecture block (15% x 2 + 30% = 60%). Assignment instruction and data will be handed out and submitted on Quercus.
    • Can I work with peers for individual assignments?
  • Auditing the course
    • Welcome but with caveats (past experience from other courses)
    • seriously committed (try not to miss classes)
    • expected to have the same level of participation in in-class activities
  • Quercus discussion board (Encourage!)
    • Free feel to answer your peer questions, help each other to learn!

Course overview

1. Lecture blocks

2. Motivation of Biostatistics III

  • Learn advance statistical modelling approaches for observational study and apply to your own research
  • writing publications that uses these methods, content knowledge to answer reviewer’s questions, as well as reviewing others work/project
  • building teamwork and collaborative experience
  • Understanding the importance of statistics in clinical and public health research
    • which analysis to apply when
    • how to address data limitations
    • feasibility and generalization
    • validity, accuracy, and fit
  • The aim of the course is to help you
    • design a complex observational study
    • develop statistical analysis plan
    • execute more advanced analyses
    • interpret outcome and report results

Observational study

1. BASIC points for designing research study (from Victor, Biostat II)

  1. Identify the primary hypothesis: Start simple: if multiple, choose one get it done, and repeat
  2. Determine the form of the primary outcome variable
  3. Identify any potential adjustments that need to made to the analysis
    • Be honest with yourself
    • Is there natural or forced clustering of sampled units
    • Are there important multi-level interactions
  4. Determine primary study variable (predictor/independent variable) and its form
  5. Perform simple analyses
    • Note that simple does not necessarily mean easy
    • If clustering needs to be taken into account in the regression, it needs to be taken into account in the basic statistics as well
  6. Determine if you want to use study sample, analytic sample, or both
  7. Perform regressions
    • Verify that assumptions hold
    • Where assumptions do not hold, determine if violation is severe enough to warrant adjustment
  8. Interpret results

2. The pyramid of scientif evidence, Varoni et al, 2014

  • RCT is not always the answer!
    • costly and virtually unfeasible to assess long-term exposure/intervention effect
    • require extensive research planning, infrastructures, and money
    • ethical reasons (e.g., can not randomize patients to harmful exposure)
  • Observational study is a viable design
    • can use readily available data
    • however, no free lunch
      • selection bias
      • confounding bias
      • information bias
      • missing data (this can also happen in RCT)
      • other data complexity (e.g., correlated data)
    • luckily, we can use appropriate statistical methods to mitigate these data challenges.

3. Common study designs from Setia, 2016

  • Cohort study is the most common design
Retrospective cohort
  • Research begins “at the time” of events

Prospective cohort
  • Research begins “at the time” of exposure

In-class activity for Thursday, Sept 7th
  • A team of 2 or 3 students work together to draft a study protocol of a mock clincial survival study
    • 2 slides max including the following:
      • Study objective
      • Primary endpoint
      • Secondary endpoint(s)
      • Design, observation window
      • Power
      • Data
      • Statistical analysis (with the knowledge you have now - if no idea we can discuss and revisit)
        • sex- and gender- based analysis considerations
      • Comment on feasibility and problems foreseen
  • We will spend roughly 30 mins going through study protocols.
  • Last but not least, reporting of observational studies in epidemiology, the STROEB guideline and checklist, https://www.strobe-statement.org/. link to pdf

Survival analysis review

  • Survival analysis is the study of survival times and of the factors that influence them.
  • Survival studies can involve
    • estimation of the survival distribution,
    • comparisons of the survival distributions of various treatments or interventions,
    • identification of the factors that influence survival times (inference)
    • prediction of survival probability and time (prediction)

1. Survival data and censoring

Survival Data
  • response variable is a non-negative discrete or continuous random variable, and represents the time from a well-defined origin to a well-defined event
    • (time, event), also called time-to-event data
Censoring classification
  • censoring arises when the starting or ending events are not precisely observed
    • most common type of censoring is right censoring where final endpoint is only known to exceed a particular value.
    • less common type of censoring is left censoring where events are known to have occurred before a certain time
    • interval censoring, where the failure time is only known to have occurred within in a specified interval of time
  • Type I
    • censoring times are pre-specified, e.g., a pre-specified ending time
    • right-censored
  • Type II
    • experimental objects are followed until a pre-specified fraction have failed, e.g., until 20% of the study participant experienced the event. less common in medical and public health research
    • right-censored
  • Random (or non-informative)
    • each subject has a censoring time that is statistically independent of their failure time
    • the observed time value is the minimum of the censoring and failure times
    • subjects whose failure time is greater than their censoring time are right-censored.
  • informative censoring can lead to biased survival estimates
    • patient dropout for non-disease related reasons vs disease related reason
    • can you assume independence between the event of interest and competing events?
  • Example survival data, Survival in a randomised trial comparing two treatments for ovarian cancer

  • Source: Edmunson, J.H., Fleming, T.R., Decker, D.G., Malkasian, G.D., Jefferies, J.A., Webb, M.J., and Kvols, L.K., Different Chemotherapeutic Sensitivities and Host Factors Affecting Prognosis in Advanced Ovarian Carcinoma vs. Minimal Residual Disease. Cancer Treatment Reports, 63:241-47, 1979.

  • futime: survival or censoring time

  • fustat: event status (1=event, 0=censoring)

  • age: in years

  • resid.ds: residual disease present (1=no,2=yes)

  • rx: treatment group

  • ecog.ps: ECOG performance status (1 is better, see reference)

#loading packages;
library(survival)
library(tidyverse)
#display the first 6 rows of data;
head(ovarian)
  futime fustat     age resid.ds rx ecog.ps
1     59      1 72.3315        2  1       1
2    115      1 74.4932        2  1       1
3    156      1 66.4658        2  1       2
4    421      0 53.3644        2  2       1
5    431      1 50.3397        2  1       1
6    448      0 56.4301        1  1       2
# number of events is 12;
table(ovarian$fustat)

 0  1 
14 12 
# plot with shapes to indicate censoring or event
ovarian2 <- ovarian %>% mutate(Subject = 1:length(ovarian$fustat))

ggplot(ovarian2, aes(Subject, futime, color = as.factor(fustat))) +
    geom_bar(stat = "identity", width = 0.1) + 
    geom_point(data = ovarian2, 
               aes(Subject, futime, color = as.factor(fustat),
                   shape = as.factor(fustat)),  size = 3) +
    scale_x_continuous(breaks = 1:length(ovarian$fustat)) +
    ylab("Follow-up time") +
    coord_flip() +
    theme_minimal() + 
    theme(legend.position = "bottom")

Excercises from Dirk F. Moore, 2016
  • Consider a simple example of five cancer patients who enter a clinical trial as illustrated in the following diagram:
  • Create a simple survival data based on this diagram
    • data to include, patient ID, cohort entry year, survival time and censoring indicator
    • Identify total number of patients who experienced the event within the study window.
    • Calculate total person-years and the death rate per person-year

2. Basic Concepts and Principles of Survival Analysis

The Hazard and Survival Functions
  • Survival function: probability of survival beyond time t.
  • Failure function: probability of event by time t

\[ S(t) = Pr(T>t), 0 < t < \infty \]

\[ F(t) = 1 - S(t) \] - Median Survival Time, \(S(t) = 1/2\).

  • Hazard function: instantaneous failure rate; the probability that, given that a subject has survived up to time t, they fail in the next small interval of time, divide by the length of that interval.

  • Cumulative hazard function: the integral of the hazard, or the area under the hazard function between times 0 and t; the number of events that would be expected for each individual by time t if the event were a repeatable process

\[ h(t) = \lim_{\delta \rightarrow 0} \frac{Pr(t<T<t+\delta \mid T>t)}{\delta} \]

\[ H(t) = \int_0^t h(s) ds \]

\[ S(t) = \exp(-H(t)) = \exp(- \int_0^t h(s) ds) \]

3. Nonparametric Survival Curve Estimation and comparison of Survival Distributions

Kaplan-Meier estimator for survival function

\[ \hat{S}(t) = \prod_{i:t_i \leq t}\Big( 1-\hat{q}_i\Big) = \prod_{i:t_i \leq t}\Big( 1- \frac{d_i}{n_i}\Big) \] where \(n_i\) is the number of subjects at risk at time \(t_i\), and \(d_i\) is the number of individuals who fail at that time.

  • Kaplan-Meier estimator has a variance expression known as the Greenwood formula (obtained using delta method), which can be used to construct 95% confidence interval of the estimated survival function.
Nelson-Aalen estimator for cumulative hazard function

\[ \hat{H}(t) = \sum_{i:t_i\leq t}\frac{d_i}{n_i} \] where \(n_i\) is the number of subjects at risk at time \(t_i\), and \(d_i\) is the number of individuals who fail at that time.

  • less informative than the survival curve
  • can be used for example for visually checking how constant the hazard rate is over time, since a constant hazard rate corresponds to linear cumulative hazard
Comparing survival functions
  • We can plot two or more KM-curves along with their respective confidence bands in the same figure, and see whether the intervals are overlapping at any given point in time.

  • For testing equivalence of two or more survival functions, we can use the non-parametric log-rank test.

  • log-rank test statistic aggregates the observed and expected event counts in each group and follows a \(\chi^2\) distribution with 1 degree of freedom (2 groups).

#overall survival;
fit0 <- survfit(Surv(futime, fustat) ~ 1, data=ovarian)
print(fit0)
Call: survfit(formula = Surv(futime, fustat) ~ 1, data = ovarian)

      n events median 0.95LCL 0.95UCL
[1,] 26     12    638     464      NA
summary(fit0)
Call: survfit(formula = Surv(futime, fustat) ~ 1, data = ovarian)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
   59     26       1    0.962  0.0377        0.890        1.000
  115     25       1    0.923  0.0523        0.826        1.000
  156     24       1    0.885  0.0627        0.770        1.000
  268     23       1    0.846  0.0708        0.718        0.997
  329     22       1    0.808  0.0773        0.670        0.974
  353     21       1    0.769  0.0826        0.623        0.949
  365     20       1    0.731  0.0870        0.579        0.923
  431     17       1    0.688  0.0919        0.529        0.894
  464     15       1    0.642  0.0965        0.478        0.862
  475     14       1    0.596  0.0999        0.429        0.828
  563     12       1    0.546  0.1032        0.377        0.791
  638     11       1    0.497  0.1051        0.328        0.752
#survival by treatment group;
fit1 <- survfit(Surv(futime, fustat) ~ rx, data=ovarian)
print(fit1)
Call: survfit(formula = Surv(futime, fustat) ~ rx, data = ovarian)

      n events median 0.95LCL 0.95UCL
rx=1 13      7    638     268      NA
rx=2 13      5     NA     475      NA
summary(fit1)
Call: survfit(formula = Surv(futime, fustat) ~ rx, data = ovarian)

                rx=1 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
   59     13       1    0.923  0.0739        0.789        1.000
  115     12       1    0.846  0.1001        0.671        1.000
  156     11       1    0.769  0.1169        0.571        1.000
  268     10       1    0.692  0.1280        0.482        0.995
  329      9       1    0.615  0.1349        0.400        0.946
  431      8       1    0.538  0.1383        0.326        0.891
  638      5       1    0.431  0.1467        0.221        0.840

                rx=2 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
  353     13       1    0.923  0.0739        0.789        1.000
  365     12       1    0.846  0.1001        0.671        1.000
  464      9       1    0.752  0.1256        0.542        1.000
  475      8       1    0.658  0.1407        0.433        1.000
  563      7       1    0.564  0.1488        0.336        0.946
#survival plot using base R;
plot(fit1, col=1:2, xscale=365, lwd=2, 
     mark.time=TRUE, conf.int = TRUE,
     xlab="Days since study entry", ylab="Survival")
legend(0, .2, c("rx=1", "rx=2"), col=1:2, lwd=2, bty='n')

#survival plot using ggsurvplot;
library(survminer)
ggsurvplot(
    fit = fit1,
    ylab = "Overall survival probability",
    break.time.by = 180,
    conf.int = TRUE,
    risk.table = TRUE,
    pval = TRUE,
    surv.median.line = "hv")

# cumulative hazard plot using ggsurvplot;
ggsurvplot(fit1, 
           break.time.by = 180,
           conf.int = TRUE,
           risk.table = TRUE,
           fun = "cumhaz")

# Log-rank test;
# rounded P-value, vs extracting chi-square statistics and calculate P-value;
survdiff(Surv(futime, fustat) ~ rx, data=ovarian)
Call:
survdiff(formula = Surv(futime, fustat) ~ rx, data = ovarian)

      N Observed Expected (O-E)^2/E (O-E)^2/V
rx=1 13        7     5.23     0.596      1.06
rx=2 13        5     6.77     0.461      1.06

 Chisq= 1.1  on 1 degrees of freedom, p= 0.3 
sd<-survdiff(Surv(futime, fustat) ~ rx, data=ovarian)
1 - pchisq(sd$chisq, length(sd$n) - 1)
[1] 0.3025911
  • We obtain a p-value of 0.3, which is not statistically significant at the 5 % level. There is not enough evidence to conclude difference in survival between the two treatment groups.