Additional Topics under Survival Analysis

Introduction to causal survival analysis

Overview

This tutorial shows how to estimate population-level causal effect for time-to-event outcomes with propensity score weighting (PSW):

  • Common causal estimand of interest, the Average Treatment Effect (ATE) as
    • average survival probabilities difference between treated and untreated at a specific time using PS-weighted Kaplan-Meier* estimator
    • causal Hazard ratios, obtained from weighted marginal Cox model.

Why the hazard ratio is not a causal estimand

The Hazards of Hazard Ratios (Hernan)

Important

The hazard ratio compares instantaneous risks among those who have already survived to time \(t\). Because treatment can change who survives up to \(t\), the sets being compared are post-treatment selected, so the hazard ratio generally lacks a simple causal interpretation even with no unmeasured confounding.

Let \(T^{(a)}\) denote the potential event time under treatment \(a\in\{0,1\}\). The (cause-specific) hazard under \(a\) is \[ \lambda_a(t) = \lim_{\Delta t \downarrow 0} \frac{\Pr\{t \le T^{(a)} < t+\Delta t \mid T^{(a)} \ge t\}}{\Delta t}, \] and the (time-specific) hazard ratio is \[ \mathrm{HR}(t) \;=\; \frac{\lambda_1(t)}{\lambda_0(t)}. \]

Two issues make \(\mathrm{HR}(t)\) non-causal:

  1. Conditioning on a post-treatment variable. The risk set at time \(t\) is those with \(T^{(a)}\!\ge t\). If treatment affects early survival, then \[ \{T^{(1)} \ge t\} \;\text{and}\; \{T^{(0)} \ge t\} \] include different compositions of individuals because of treatment itself. Thus, even if \((T^{(0)},T^{(1)}) \perp A \mid L\) (no unmeasured confounding given \(L\)), the event \(\{T^{(a)}\ge t\}\) is not independent of \(A\). Comparing \(\lambda_1(t)\) vs \(\lambda_0(t)\) is therefore a comparison of selected subpopulations at time \(t\), not a clean “effect of treatment on outcome.”

  2. Non-collapsibility. Even without confounding, marginal and conditional hazard ratios differ: \[ \frac{\lambda_1(t)}{\lambda_0(t)} \;\neq\; \sum_{\ell} w_\ell \,\frac{\lambda_1(t\mid L=\ell)}{\lambda_0(t\mid L=\ell)} , \] so the hazard ratio does not aggregate to a population-average causal contrast. One can observe \(\mathrm{HR}(t)\!\approx\!1\) while survival probabilities differ meaningfully across arms.

Survival Probability Causal Effect

A more transparent target is the survival probability causal effect (often called SPCE): \[ \Delta_S(t) \;=\; S_1(t) - S_0(t), \qquad S_a(t) \;=\; \Pr\{T^{(a)} > t\}. \]

\(\Delta_S(t)\) is a collapsible - the difference in the probability of being event-free at time \(t\) if everyone were treated vs if everyone were untreated. - We can compute \(\Delta_S(t)\) using propensity score weighted Kaplan-Meier estimation. - Extensions is possible to incorporate informative censoring and competing risks.

1. Hands on tutorial 1 on fitting PS-weighted marginal Cox and PS-weighted KM.

Mathematically, we will aim at estimating SPCE. - the survival probability causal effect (SPCE) at a chosen time \(t\), i.e. \[ \Delta_S(t) \equiv S_1(t) - S_0(t) \] where \(S_a(t)=Pr\{T^{(a)}>t\}\) is the counterfactual survival probability under treatment \(a, a\in \{0,1\}\).

  • Causal assumptions:
    • Stable unit treatmnet assumption (no multiple versions of treatment and no interference)
    • Consistency (i.e., observed outcome is consistent with their potential outcome under the treatment they actually received),
    • Positivity (i.e., non-zero probability of receiving any of the treatment levels),
    • Conditional exchangeability (also known as no unmeasured confounding).

1.1 Reading data

  • We will revisit the veteran dataset

  • Veterans’ Administration Lung Cancer study

    • trt: 1=standard 2=test
    • celltype: 1=squamous, 2=smallcell, 3=adeno, 4=large
    • time: survival time
    • status: censoring status
    • karno: Karnofsky performance score (100=good)
    • diagtime: months from diagnosis to randomization
    • age: in years
    • prior: prior therapy 0=no, 10=yes
  • Randomised trial of two treatment regimens for lung cancer. This is a standard survival analysis data set.

    • D Kalbfleisch and RL Prentice (1980), The Statistical Analysis of Failure Time Data. Wiley, New York.
library(survival)
library(dplyr)
library(ggplot2)
library(cobalt)
library(tableone)

data(veteran, package = "survival")
dat1 <- veteran %>%
  mutate(
    A     = as.integer(trt == 2),           # 1 = test regimen, 0 = standard
    time  = time,
    event = as.integer(status == 1),        # 1 = death, 0 = censored
    celltype = factor(celltype)
  ) %>%
  select(time, event, A, age, karno, diagtime, prior, celltype) %>%
  na.omit()


head(dat1)
  time event A age karno diagtime prior celltype
1   72     1 0  69    60        7     0 squamous
2  411     1 0  64    70        5    10 squamous
3  228     1 0  38    60        3     0 squamous
4  126     1 0  63    60        9    10 squamous
5  118     1 0  65    70       11    10 squamous
6   10     1 0  49    20        5     0 squamous

1.2 Checking baseline variable distribution by treatment

baselines <- c("age", "karno", "diagtime", "prior", "celltype")
tab0 <- CreateTableOne(vars = baselines,
                       data = dat1, 
                       strata = "A", 
                       test = FALSE, #mute P-value calculation;
                       smd = TRUE,
                       addOverall = TRUE)
print(tab0, smd = TRUE, showAllLevels = FALSE)
                      Stratified by A
                       Overall       0             1             SMD   
  n                      137            69            68               
  age (mean (SD))      58.31 (10.54) 57.51 (10.81) 59.12 (10.28)  0.153
  karno (mean (SD))    58.57 (20.04) 59.20 (18.74) 57.93 (21.40)  0.063
  diagtime (mean (SD))  8.77 (10.61)  8.65 (8.76)   8.90 (12.27)  0.023
  prior (mean (SD))     2.92 (4.56)   3.04 (4.64)   2.79 (4.52)   0.054
  celltype (%)                                                    0.465
     squamous             35 (25.5)     15 (21.7)     20 (29.4)        
     smallcell            48 (35.0)     30 (43.5)     18 (26.5)        
     adeno                27 (19.7)      9 (13.0)     18 (26.5)        
     large                27 (19.7)     15 (21.7)     12 (17.6)        

1.3 Propensity score weighting

library(WeightIt)
IPTW <- weightit(A ~ age + karno + diagtime + prior + celltype,
               data = dat1, 
               method = "glm", #using the default logistic regression;
               stabilize = TRUE,
               estimand = "ATE")
IPTW
A weightit object
 - method: "glm" (propensity score weighting with GLM)
 - number of obs.: 137
 - sampling weights: none
 - treatment: 2-category
 - estimand: ATE
 - covariates: age, karno, diagtime, prior, celltype
summary(IPTW)
                 Summary of weights

- Weight ranges:

           Min                                  Max
treated 0.6226 |---------------------------| 1.9439
control 0.6763  |---------------------|      1.7527

- Units with the 5 most extreme weights by group:
                                           
            101     96     95     98     91
 treated 1.4196 1.4639 1.4762 1.5009 1.9439
             11     52     47     46     48
 control 1.5396 1.5713 1.5865 1.7263 1.7527

- Weight statistics:

        Coef of Var   MAD Entropy # Zeros
treated       0.258 0.205   0.031       0
control       0.277 0.223   0.035       0

- Effective Sample Sizes:

           Control Treated
Unweighted   69.      68. 
Weighted     64.15    63.8
# Diagnostics: covariate balance and positivity

# Positivity: PS overlap
bal.plot(IPTW,
         which="both",
         type = "density",
         colors = c("red","blue"))

bal.tab(IPTW, un=TRUE, thresholds = c(m=0.1))
Balance Measures
                       Type Diff.Un Diff.Adj    M.Threshold
prop.score         Distance  0.5154   0.0053 Balanced, <0.1
age                 Contin.  0.1527   0.0023 Balanced, <0.1
karno               Contin. -0.0635  -0.0065 Balanced, <0.1
diagtime            Contin.  0.0230  -0.0060 Balanced, <0.1
prior_10             Binary -0.0249  -0.0019 Balanced, <0.1
celltype_squamous    Binary  0.0767  -0.0048 Balanced, <0.1
celltype_smallcell   Binary -0.1701  -0.0009 Balanced, <0.1
celltype_adeno       Binary  0.1343   0.0058 Balanced, <0.1
celltype_large       Binary -0.0409  -0.0001 Balanced, <0.1

Balance tally for mean differences
                   count
Balanced, <0.1         9
Not Balanced, >0.1     0

Variable with the greatest mean difference
 Variable Diff.Adj    M.Threshold
    karno  -0.0065 Balanced, <0.1

Effective sample sizes
           Control Treated
Unadjusted   69.      68. 
Adjusted     64.15    63.8
love.plot(IPTW, 
          binary = "std", 
          grid = TRUE,
          thresholds = c(m = .1),
          colors = c("red","blue"))  

1.4 PS-weighted Kaplan Meier Estimator

dat1$w<-IPTW$weights # adding PS weights back to the data;
t_star <- 120
fitA1 <- survfit(Surv(time, event) ~ 1, data = subset(dat1, A==1), weights = dat1$w[dat1$A==1])
fitA0 <- survfit(Surv(time, event) ~ 1, data = subset(dat1, A==0), weights = dat1$w[dat1$A==0])

S1_t  <- as.numeric(summary(fitA1, times = t_star, extend = TRUE)$surv)
S0_t  <- as.numeric(summary(fitA0, times = t_star, extend = TRUE)$surv)
SPCE  <- S1_t - S0_t
SPCE
[1] -0.1170191
# Plot PS-weighted survival curves
plot(fitA0, mark.time = FALSE, xlab = "Days", ylab = "Survival",
     main = "PS-weighted KM (veteran)", lty = 2, conf.int = FALSE)
lines(fitA1, lty = 1, conf.int = FALSE)
abline(v = t_star, lty = 3)
legend("topright", c("A=0 (control)", "A=1 (treated)"), lty = c(2,1), bty = "n")

1.5 Bootstrap confidence interval for causal estimand

set.seed(123)
n <- nrow(dat1) #sample size;
B <- 1000 #bootstrap iterations;
boot_SPCE  <- numeric(B)

for (b in 1:B) {
  # Resample subjects with replacement
  idx <- sample.int(n, size = n, replace = TRUE)
  bdat <- dat1[idx, ]

  # fit PS and recompute stabilized IPTW (ATE)
  IPTWb <- weightit(A ~ age + karno + diagtime + prior + celltype,
               data = bdat, 
               method = "glm", #using the default logistic regression;
               stabilize = TRUE,
               estimand = "ATE")
  bdat$w <- IPTWb$weights
 
  fitA1_b <- survfit(Surv(time, event) ~ 1, data = subset(bdat, A==1),
                     weights = bdat$w[bdat$A==1])
  fitA0_b <- survfit(Surv(time, event) ~ 1, data = subset(bdat, A==0),
                     weights = bdat$w[bdat$A==0])

  S1_b <- as.numeric(summary(fitA1_b, times = t_star, extend = TRUE)$surv)
  S0_b <- as.numeric(summary(fitA0_b, times = t_star, extend = TRUE)$surv)
  boot_SPCE[b] <- S1_b - S0_b

}

# Percentile 95% CIs
quantile(boot_SPCE, c(0.025, 0.975))
       2.5%       97.5% 
-0.26620470  0.03855422 

1.6 “Causal” Hazard ratio (not causally interpretable!)

cox1 <- coxph(Surv(time, event) ~ A, data = dat1, weights = w, robust = TRUE)
summary(cox1)

2. Hands on tutorial on causal survival analysis with a competing event.

We now consider competing risks where the event of interest is death and the competing event is liver transplant. Let \(A\in\{0,1\}\) and in a competing-risk setting let the outcome of interest be death and the competing event be transplant.

For treatment option, the cumulative incidence function (CIF) for death is \[ F_a^{\text{death}}(t)=\Pr\{T_a \le t,\ \text{cause}=\text{death}\}. \] We target the absolute risk difference between treated and untreated as our causal esitmand of interest. \[ \Delta_F(t)=F_1^{\text{death}}(t)-F_0^{\text{death}}(t). \]

There are two weighting strategies for causal competing risk analysis

  • we have a competing event and censoring is considered non-informative)
    • Use IPTW to reweight the sample to the ATE target and then compute a weighted Aalen-Johansen estimator of the CIF. If administrative censoring is non-informative and independent of covariates, we do not need to estimate the censoring process or weight by censoring under this approach.
  • Alternatively, we can consider applying censoring weights and combine censoring weights with treatmet weights (e.g., IPTW + IPCW; continuous-time censoring model under informative censoring process).
    • Construct time-specific risk estimators that require IPCW and propose a continuous-time censoring model (e.g., Cox) to estimate time-specific plug-in estimator for the CIF of each event (primary and competing) under treated and control.

Take away

  • If you believe administrative censoring is non-informative and independent of covariates, use IPTW weighted AJ estimator.
  • If censoring may depend on \(A\) and/or \(L\), use IPTW x IPCW (continuous time censoring model). We will leave this out for this tutorial.

For reference, please see Austin, P. C., & Fine, J. P. (2025). Inverse probability of treatment weighting using the propensity score with competing risks in survival analysis. Statistics in Medicine, 44(5), e70009.

Causal assumptions

  • SUVTA
  • Conditional exchangeability (no unmeasured confounding)
  • Positivity on treatment and censoring where \(\Pr\{C^*_a > t\} > 0\) over the time.
  • Independent censoring given the model used for IPCW

2.1 Reaching data - Revisit the Mayo Clinic Primary Biliary Cirrhosis

  • An analysis based on the data can be found in Murtagh, et. al. (1994)

  • Murtaugh PA. Dickson ER. Van Dam GM. Malinchoc M. Grambsch PM. Langworthy AL. Gips CH. “Primary biliary cirrhosis: prediction of short-term survival based on repeated patient visits.” Hepatology. 20(1.1):126-34, 1994.

  • id: case number

  • age: in years

  • sex: m/f

  • trt: 1/2/NA for D-penicillmain, placebo, not randomized

  • time: number of days between registration and the earlier of death,

  • transplantation, or study analysis in July, 1986

  • status: status at endpoint, 0/1/2 for censored, transplant, dead

  • day: number of days between enrolment and this visit date

  • all measurements below refer to this date

    • albumin: serum albumin (mg/dl)
    • alk.phos: alkaline phosphotase (U/liter)
    • ascites: presence of ascites
    • ast: aspartate aminotransferase, once called SGOT (U/ml)
    • bili: serum bilirunbin (mg/dl)
    • chol: serum cholesterol (mg/dl)
    • copper: urine copper (ug/day)
    • edema: 0 no edema, 0.5 untreated or successfully treated 1 edema despite diuretic therapy
    • hepato: presence of hepatomegaly or enlarged liver
    • platelet: platelet count
    • protime: standardised blood clotting time
    • spiders: blood vessel malformations in the skin
    • stage: histologic stage of disease (needs biopsy)
    • trig: triglycerides (mg/dl)
  • In this data, patients are censored at the time of transplantation. The data do not follow patients who received transplant.

# Prepare analysis dataset
data(pbc, package = "survival")
dat2 <- pbc %>%
  filter(!is.na(trt)) %>%
  transmute(
    id    = row_number(),
    time  = time,                         # days
    A     = as.integer(trt == 1),         # 1 = D-penicillamine, 0 = placebo
    status = status,                      # 0=censor, 1=transplant, 2=death
    # baseline L:
    age, albumin, bili, edema, protime
  ) %>%
  na.omit()
head(dat2)
  id time A status      age albumin bili edema protime
1  1  400 1      2 58.76523    2.60 14.5   1.0    12.2
2  2 4500 1      0 56.44627    4.14  1.1   0.0    10.6
3  3 1012 1      2 70.07255    3.48  1.4   0.5    12.0
4  4 1925 1      2 54.74059    2.54  1.8   0.5    10.3
5  5 1504 0      1 38.10541    3.53  3.4   0.0    10.9
6  6 2503 0      2 66.25873    3.98  0.8   0.0    11.0

2.2 Checking baseline variable distribution by treatment

baselines <- c("age", "albumin", "bili", "edema", "protime")
tab0 <- CreateTableOne(vars = baselines,
                       data = dat2, 
                       strata = "A", 
                       test = FALSE, #mute P-value calculation;
                       smd = TRUE,
                       addOverall = TRUE)
print(tab0, smd = TRUE, showAllLevels = FALSE)
                     Stratified by A
                      Overall       0            1             SMD   
  n                     312           154          158               
  age (mean (SD))     50.02 (10.58) 48.58 (9.96) 51.42 (11.01)  0.270
  albumin (mean (SD))  3.52 (0.42)   3.52 (0.40)  3.52 (0.44)   0.018
  bili (mean (SD))     3.26 (4.53)   3.65 (5.28)  2.87 (3.63)   0.171
  edema (mean (SD))    0.11 (0.27)   0.11 (0.27)  0.11 (0.28)   0.025
  protime (mean (SD)) 10.73 (1.00)  10.80 (1.14) 10.65 (0.85)   0.146

2.3 Propensity score estimation

library(WeightIt)
IPTW <- weightit(A ~ age + albumin + bili + edema + protime,
               data = dat2, 
               method = "glm",
               stabilize = TRUE,
               estimand = "ATE")
IPTW
A weightit object
 - method: "glm" (propensity score weighting with GLM)
 - number of obs.: 312
 - sampling weights: none
 - treatment: 2-category
 - estimand: ATE
 - covariates: age, albumin, bili, edema, protime
summary(IPTW)
                 Summary of weights

- Weight ranges:

           Min                                  Max
treated 0.6873   |-------------------------| 1.9578
control 0.5936 |----------------------|      1.7189

- Units with the 5 most extreme weights by group:
                                           
             75    195     41     69    128
 treated  1.459 1.6043 1.6091 1.6419 1.9578
            283     14     97    214    238
 control 1.4078  1.415 1.4917 1.5694 1.7189

- Weight statistics:

        Coef of Var   MAD Entropy # Zeros
treated       0.193 0.145   0.017       0
control       0.187 0.148   0.017       0

- Effective Sample Sizes:

           Control Treated
Unweighted  154.    158.  
Weighted    148.81  152.34
# Diagnostics: covariate balance and positivity

# Positivity: PS overlap
bal.plot(IPTW,
         which="both",
         type = "density",
         colors = c("red","blue"))

bal.tab(IPTW, un=TRUE, thresholds = c(m=0.1))
Balance Measures
               Type Diff.Un Diff.Adj    M.Threshold
prop.score Distance  0.3696   0.0048 Balanced, <0.1
age         Contin.  0.2703  -0.0009 Balanced, <0.1
albumin     Contin. -0.0180  -0.0025 Balanced, <0.1
bili        Contin. -0.1711  -0.0100 Balanced, <0.1
edema       Contin.  0.0247  -0.0014 Balanced, <0.1
protime     Contin. -0.1461  -0.0073 Balanced, <0.1

Balance tally for mean differences
                   count
Balanced, <0.1         6
Not Balanced, >0.1     0

Variable with the greatest mean difference
 Variable Diff.Adj    M.Threshold
     bili    -0.01 Balanced, <0.1

Effective sample sizes
           Control Treated
Unadjusted  154.    158.  
Adjusted    148.81  152.34
love.plot(IPTW, 
          binary = "std", 
          grid = TRUE,
          thresholds = c(m = .1),
          colors = c("red","blue"))  

dat2$w<-IPTW$weights # adding PS weights back to the data;

2.4 Weighted AJ estimator with transplant as the competing event under administrative, non-informative censoring

# we do not consider censoring weights under the assumption of non-informative censoring;
library(prodlim)
fit_cif_AJ <- prodlim(
  Hist(time, status) ~ A,
  data = dat2,
  caseweights = dat2$w
)

# Time horizon
t_star <- 365*2  # 2 years
F1_AJ <- predict(fit_cif_AJ, cause = 2, times = t_star,
                                       newdata = data.frame(A = 1))
F0_AJ <- predict(fit_cif_AJ, cause = 2, times = t_star,
                                       newdata = data.frame(A = 0))
RD_AJ <- F1_AJ - F0_AJ
RD_AJ
[1] -0.03139891
# cause-specific CIF
ts <- seq(180, 180*8, by=30)
F1_curve <- predict(fit_cif_AJ, cause = 2, times = ts, newdata = data.frame(A=1))
F0_curve <- predict(fit_cif_AJ, cause = 2, times = ts, newdata = data.frame(A=0))
df_AJ <- data.frame(t=ts, F1=F1_curve, F0=F0_curve, dF=F1_curve-F0_curve)

ggplot(df_AJ, aes(x = t)) +
  geom_step(aes(y = F0, color = "A=0 (control)", linetype = "A=0 (control)")) +
  geom_step(aes(y = F1, color = "A=1 (treated)", linetype = "A=1 (treated)")) +
  scale_color_manual(
    name = "",
    values = c("A=0 (control)" = "black", "A=1 (treated)" = "black")
  ) +
  scale_linetype_manual(
    name = "",
    values = c("A=0 (control)" = "dashed", "A=1 (treated)" = "solid")
  ) +
  labs(
    x = "Days",
    y = "CIF of death",
    title = "Cole: IPTW-weighted Aalen-Johansen (no IPCW)"
  ) +
  theme(plot.title = element_text(size = 11),
        legend.position = "top")

2.5 Adding Bootstrap confidence interval

set.seed(123)
n <- nrow(dat2) #sample size;
B <- 1000 #bootstrap iterations;
boot_wCIF  <- numeric(B)

for (b in 1:B) {
  # Resample subjects with replacement
  idx <- sample.int(n, size = n, replace = TRUE)
  bdat <- dat2[idx, ]

  # fit PS and recompute stabilized IPTW (ATE)
  IPTWb <- weightit(A ~ age + albumin + bili + edema + protime,
               data = bdat, 
               method = "glm",
               stabilize = TRUE,
               estimand = "ATE")
  bdat$w <- IPTWb$weights
 
  fit_cif_AJ <- prodlim(Hist(time, status) ~ A, data = bdat, caseweights = bdat$w)

  t_star <- 365*2  # 2 years
  F1_AJ <- predict(fit_cif_AJ, cause = 2, times = t_star,
                                         newdata = data.frame(A = 1))
  F0_AJ <- predict(fit_cif_AJ, cause = 2, times = t_star,
                                         newdata = data.frame(A = 0))
  RD_AJ <- F1_AJ - F0_AJ

  boot_wCIF[b] <- F1_AJ - F0_AJ

}

# Percentile 95% CIs
quantile(boot_wCIF, c(0.025, 0.975))
#        2.5%       97.5% 
# -0.08819970  0.02251391