Tutorial on Causal Inference Using Machine Learning Methods

AI4PH Summer Institute, July 18, 2023

Author

Kuan Liu

Published

2024-10-12

Welcome

  • Welcome to the causal inference machine learning tutorial!
  • Workshop materials in the github repository AI4PH2023_CausalWorkshop

Learning objectives

Causal inference methods, such as the propensity score analysis, have been established to permit causal inference from observational data. In recent years, a growing number of studies have explored the use of machine learning techniques in the causal modelling of complex health data subjected to high-dimensional confounding and complex causal structure.

The objective of this tutorial is to introduce and demonstrate key machine learning methods used in causal inference for cross-sectional data with examples and ready-to-use code in the R programming language.

By the end of this session, participants should be able to perform causal analysis in R using several machine learning approaches, such as gradient boosting, regression trees, and SuperLearner.

Tutorial outline

In preparation for the Tutorial

Participants are required to follow the next steps before the day of the workshop:

  1. Install R and R Studio

  2. Verify access to the course page, https://kuan-liu.github.io/AI4PH2023_CausalWorkshop/

  3. Clone or download the workshop repository: https://github.com/Kuan-Liu/AI4PH2023_CausalWorkshop

  4. Install the following R packages

    • data import and processing and descriptive analysis: tidyverse, tableone, naniar
    • causal analysis: MatchIt; cobalt; boot; survey; gfoRmula; E-value
    • machine learning: SuperLearner; xgboost; bartCause; caret, glmnet

Dataset - The Right Heart Catheterization

For this tutorial, we will be using the same right heart catheterization dataset you have seen this morning. The original JAMA paper (Connors et al. 1996) and the data csv file can be found in the tutorial repo.

  • We follow Brice’s morning session and this tutorial paper (Smith et al. 2022), which both used the same rhc dataset, to guide our data processing and causal analysis.

Data import and processing

library(tidyverse)
data <- read.csv("data/rhc.csv", header=T)

# define exposure variable
data$A <- ifelse(data$swang1 =="No RHC", 0, 1)

# outcome is dth30, a binary outcome measuring survival status at day 30;
data$Y <- ifelse(data$dth30 =="No", 0, 1)

Data visualization on missing values

library(naniar)
gg_miss_var(data, facet=A, show_pct = TRUE)

# try changing facet to dth30, this examines missingness by outcome;

Finalizing dataset for causal analysis

# we create our analysis data by removing variables with large proportion of missing;
# and variables not used in the analysis;
data2 <- select(data, -c(cat2, adld3p, urin1, swang1,
                         sadmdte, dschdte, dthdte, lstctdte, death, dth30,
                         surv2md1, das2d3pc, t3d30, ptid)) 
data2 <- rename(data2, id = X)

# display data on Quarto page;
library(DT)
data2 %>% datatable(
  rownames = FALSE,
  options = list(
    columnDefs = list(list(className = 'dt-center', 
                      targets = 0:4))))
# verify data structure;
str(data2)
'data.frame':   5735 obs. of  51 variables:
 $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ cat1    : chr  "COPD" "MOSF w/Sepsis" "MOSF w/Malignancy" "ARF" ...
 $ ca      : chr  "Yes" "No" "Yes" "No" ...
 $ cardiohx: int  0 1 0 0 0 0 0 0 0 0 ...
 $ chfhx   : int  0 1 0 0 0 1 0 0 0 0 ...
 $ dementhx: int  0 0 0 0 0 0 0 0 0 0 ...
 $ psychhx : int  0 0 0 0 0 0 0 0 0 0 ...
 $ chrpulhx: int  1 0 0 0 0 1 0 0 0 0 ...
 $ renalhx : int  0 0 0 0 0 0 0 0 0 0 ...
 $ liverhx : int  0 0 0 0 0 0 0 0 0 0 ...
 $ gibledhx: int  0 0 0 0 0 0 0 0 0 0 ...
 $ malighx : int  1 0 1 0 0 0 1 0 0 1 ...
 $ immunhx : int  0 1 1 1 0 0 0 0 0 0 ...
 $ transhx : int  0 1 0 0 0 0 0 1 0 0 ...
 $ amihx   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ age     : num  70.3 78.2 46.1 75.3 67.9 ...
 $ sex     : chr  "Male" "Female" "Female" "Female" ...
 $ edu     : num  12 12 14.07 9 9.95 ...
 $ aps1    : int  46 50 82 48 72 38 29 25 47 48 ...
 $ scoma1  : int  0 0 0 0 41 0 26 100 0 0 ...
 $ meanbp1 : num  41 63 57 55 65 115 67 128 53 73 ...
 $ wblc1   : num  22.1 28.9 0.05 23.3 29.7 ...
 $ hrt1    : int  124 137 130 58 125 134 135 102 118 141 ...
 $ resp1   : num  10 38 40 26 27 36 10 34 30 40 ...
 $ temp1   : num  38.7 38.9 36.4 35.8 34.8 ...
 $ pafi1   : num  68 218 276 157 478 ...
 $ alb1    : num  3.5 2.6 3.5 3.5 3.5 ...
 $ hema1   : num  58 32.5 21.1 26.3 24 ...
 $ bili1   : num  1.01 0.7 1.01 0.4 1.01 ...
 $ crea1   : num  1.2 0.6 2.6 1.7 3.6 ...
 $ sod1    : int  145 137 146 117 126 138 136 136 136 146 ...
 $ pot1    : num  4 3.3 2.9 5.8 5.8 ...
 $ paco21  : num  40 34 16 30 17 68 45 26 40 30 ...
 $ ph1     : num  7.36 7.33 7.36 7.46 7.23 ...
 $ wtkilo1 : num  64.7 45.7 0 54.6 78.4 ...
 $ dnr1    : chr  "No" "No" "No" "No" ...
 $ ninsclas: chr  "Medicare" "Private & Medicare" "Private" "Private & Medicare" ...
 $ resp    : chr  "Yes" "No" "No" "Yes" ...
 $ card    : chr  "Yes" "No" "Yes" "No" ...
 $ neuro   : chr  "No" "No" "No" "No" ...
 $ gastr   : chr  "No" "No" "No" "No" ...
 $ renal   : chr  "No" "No" "No" "No" ...
 $ meta    : chr  "No" "No" "No" "No" ...
 $ hema    : chr  "No" "No" "No" "No" ...
 $ seps    : chr  "No" "Yes" "No" "No" ...
 $ trauma  : chr  "No" "No" "No" "No" ...
 $ ortho   : chr  "No" "No" "No" "No" ...
 $ race    : chr  "white" "white" "white" "white" ...
 $ income  : chr  "Under $11k" "Under $11k" "$25-$50k" "$11-$25k" ...
 $ A       : num  0 1 1 0 1 0 0 0 0 1 ...
 $ Y       : num  0 0 0 0 1 0 0 0 0 0 ...
saveRDS(data2,file="data/data2")

References

Connors, Alfred F, Theodore Speroff, Neal V Dawson, Charles Thomas, Frank E Harrell, Douglas Wagner, Norman Desbiens, et al. 1996. “The Effectiveness of Right Heart Catheterization in the Initial Care of Critically III Patients.” Jama 276 (11): 889–97.
Smith, Matthew J, Mohammad A Mansournia, Camille Maringe, Paul N Zivich, Stephen R Cole, Clémence Leyrat, Aurélien Belot, Bernard Rachet, and Miguel A Luque-Fernandez. 2022. “Introduction to Computational Causal Inference Using Reproducible Stata, r, and Python Code: A Tutorial.” Statistics in Medicine 41 (2): 407–32.