library(tidyverse)
<- read.csv("data/rhc.csv", header=T)
data
# define exposure variable
$A <- ifelse(data$swang1 =="No RHC", 0, 1)
data
# outcome is dth30, a binary outcome measuring survival status at day 30;
$Y <- ifelse(data$dth30 =="No", 0, 1) data
Tutorial on Causal Inference Using Machine Learning Methods
AI4PH Summer Institute, July 18, 2023
Welcome
- Welcome to the causal inference machine learning tutorial!
- Workshop materials in the github repository AI4PH2023_CausalWorkshop
Learning objectives
Causal inference methods, such as the propensity score analysis, have been established to permit causal inference from observational data. In recent years, a growing number of studies have explored the use of machine learning techniques in the causal modelling of complex health data subjected to high-dimensional confounding and complex causal structure.
The objective of this tutorial is to introduce and demonstrate key machine learning methods used in causal inference for cross-sectional data with examples and ready-to-use code in the R programming language.
By the end of this session, participants should be able to perform causal analysis in R using several machine learning approaches, such as gradient boosting, regression trees, and SuperLearner.
Tutorial outline
- Introduction
- Section 1: Conventional causal approaches
- Section 2: Machine learning causal approaches
- Hands-on practice replicating tutorial examples (15-20 mins)
In preparation for the Tutorial
Participants are required to follow the next steps before the day of the workshop:
Install R and R Studio
- Windows operating system
- install R, https://cran.r-project.org/bin/windows/base/
- install RStudio, https://posit.co/download/rstudio-desktop/#download
- macOS operating system
- install R, https://cran.r-project.org/bin/macosx/
- install RStudio, https://posit.co/download/rstudio-desktop/#download
- Windows operating system
Verify access to the course page, https://kuan-liu.github.io/AI4PH2023_CausalWorkshop/
Clone or download the workshop repository: https://github.com/Kuan-Liu/AI4PH2023_CausalWorkshop
Install the following R packages
- data import and processing and descriptive analysis: tidyverse, tableone, naniar
- causal analysis: MatchIt; cobalt; boot; survey; gfoRmula; E-value
- machine learning: SuperLearner; xgboost; bartCause; caret, glmnet
Dataset - The Right Heart Catheterization
For this tutorial, we will be using the same right heart catheterization dataset you have seen this morning. The original JAMA paper (Connors et al. 1996) and the data csv file can be found in the tutorial repo.
- We follow Brice’s morning session and this tutorial paper (Smith et al. 2022), which both used the same rhc dataset, to guide our data processing and causal analysis.
Data import and processing
Data visualization on missing values
library(naniar)
gg_miss_var(data, facet=A, show_pct = TRUE)
# try changing facet to dth30, this examines missingness by outcome;
Finalizing dataset for causal analysis
# we create our analysis data by removing variables with large proportion of missing;
# and variables not used in the analysis;
<- select(data, -c(cat2, adld3p, urin1, swang1,
data2
sadmdte, dschdte, dthdte, lstctdte, death, dth30,
surv2md1, das2d3pc, t3d30, ptid)) <- rename(data2, id = X)
data2
# display data on Quarto page;
library(DT)
%>% datatable(
data2 rownames = FALSE,
options = list(
columnDefs = list(list(className = 'dt-center',
targets = 0:4))))
# verify data structure;
str(data2)
'data.frame': 5735 obs. of 51 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ cat1 : chr "COPD" "MOSF w/Sepsis" "MOSF w/Malignancy" "ARF" ...
$ ca : chr "Yes" "No" "Yes" "No" ...
$ cardiohx: int 0 1 0 0 0 0 0 0 0 0 ...
$ chfhx : int 0 1 0 0 0 1 0 0 0 0 ...
$ dementhx: int 0 0 0 0 0 0 0 0 0 0 ...
$ psychhx : int 0 0 0 0 0 0 0 0 0 0 ...
$ chrpulhx: int 1 0 0 0 0 1 0 0 0 0 ...
$ renalhx : int 0 0 0 0 0 0 0 0 0 0 ...
$ liverhx : int 0 0 0 0 0 0 0 0 0 0 ...
$ gibledhx: int 0 0 0 0 0 0 0 0 0 0 ...
$ malighx : int 1 0 1 0 0 0 1 0 0 1 ...
$ immunhx : int 0 1 1 1 0 0 0 0 0 0 ...
$ transhx : int 0 1 0 0 0 0 0 1 0 0 ...
$ amihx : int 0 0 0 0 0 0 0 0 0 0 ...
$ age : num 70.3 78.2 46.1 75.3 67.9 ...
$ sex : chr "Male" "Female" "Female" "Female" ...
$ edu : num 12 12 14.07 9 9.95 ...
$ aps1 : int 46 50 82 48 72 38 29 25 47 48 ...
$ scoma1 : int 0 0 0 0 41 0 26 100 0 0 ...
$ meanbp1 : num 41 63 57 55 65 115 67 128 53 73 ...
$ wblc1 : num 22.1 28.9 0.05 23.3 29.7 ...
$ hrt1 : int 124 137 130 58 125 134 135 102 118 141 ...
$ resp1 : num 10 38 40 26 27 36 10 34 30 40 ...
$ temp1 : num 38.7 38.9 36.4 35.8 34.8 ...
$ pafi1 : num 68 218 276 157 478 ...
$ alb1 : num 3.5 2.6 3.5 3.5 3.5 ...
$ hema1 : num 58 32.5 21.1 26.3 24 ...
$ bili1 : num 1.01 0.7 1.01 0.4 1.01 ...
$ crea1 : num 1.2 0.6 2.6 1.7 3.6 ...
$ sod1 : int 145 137 146 117 126 138 136 136 136 146 ...
$ pot1 : num 4 3.3 2.9 5.8 5.8 ...
$ paco21 : num 40 34 16 30 17 68 45 26 40 30 ...
$ ph1 : num 7.36 7.33 7.36 7.46 7.23 ...
$ wtkilo1 : num 64.7 45.7 0 54.6 78.4 ...
$ dnr1 : chr "No" "No" "No" "No" ...
$ ninsclas: chr "Medicare" "Private & Medicare" "Private" "Private & Medicare" ...
$ resp : chr "Yes" "No" "No" "Yes" ...
$ card : chr "Yes" "No" "Yes" "No" ...
$ neuro : chr "No" "No" "No" "No" ...
$ gastr : chr "No" "No" "No" "No" ...
$ renal : chr "No" "No" "No" "No" ...
$ meta : chr "No" "No" "No" "No" ...
$ hema : chr "No" "No" "No" "No" ...
$ seps : chr "No" "Yes" "No" "No" ...
$ trauma : chr "No" "No" "No" "No" ...
$ ortho : chr "No" "No" "No" "No" ...
$ race : chr "white" "white" "white" "white" ...
$ income : chr "Under $11k" "Under $11k" "$25-$50k" "$11-$25k" ...
$ A : num 0 1 1 0 1 0 0 0 0 1 ...
$ Y : num 0 0 0 0 1 0 0 0 0 0 ...
saveRDS(data2,file="data/data2")