A quick LCA example in R using mclust package

Author

Kuan Liu

Revisiting the PBC dataset from the survival R package

  • The PBC dataset contains baseline clinical/lab measures for patients with primary biliary cirrhosis.
Variable Description Units / Levels
age Age years
albumin Serum albumin g/dL
alk.phos Alkaline phosphatase U/L
ascites Presence of ascites yes / no
ast Aspartate aminotransferase (AST; formerly SGOT) U/L
bili Serum bilirubin mg/dL
chol Serum cholesterol mg/dL
copper Urine copper \(\mu\)g/day
edema Edema severity 0 = none; 0.5 = untreated or successfully treated; 1 = edema despite diuretics
hepato Hepatomegaly (enlarged liver) yes / no
id Case number -
platelet Platelet count -
protime Standardized prothrombin time seconds
sex Sex m / f
spiders Cutaneous spider angiomas (“spiders”) yes / no
stage Histologic stage of disease (biopsy-based) -
status Endpoint status 0 = censored; 1 = transplant; 2 = dead
time Days from registration to death, transplantation, or study analysis (July 1986) days
trt Treatment assignment 1 = D-penicillamine; 2 = placebo; NA = not randomized
trig Triglycerides mg/dL
  • We use three continuous lab measures (bili, albumin, protime). We apply a log transform to bilirubin to reduce skew, then standardize all indicators to zero mean and unit variance for interpretability in profile plots.
library(mclust)
library(survival)
library(tidyverse)
library(forcats)
library(survminer)
library(gtsummary)

set.seed(123)

needed_vars <- c(
  "time","status","age_years","sex","trt","edema","ascites","hepato","spiders",
  "bili","albumin","protime"
)

dat <- survival::pbc %>%
  as_tibble() %>%
  transmute(
    id,
    time, status, death = as.integer(status == 2),
    age_years = age / 365.25,
    sex, trt, edema, ascites, hepato, spiders,
    bili, albumin, protime
  ) %>%
  tidyr::drop_na(dplyr::all_of(needed_vars)) %>%
  mutate(log_bili = log1p(bili))

# Standardized continuous indicators for LPA (continuous-only)
X <- scale(dat[, c("log_bili", "albumin", "protime")])

head(dat)
# A tibble: 6 × 15
     id  time status death age_years sex     trt edema ascites hepato spiders
  <int> <int>  <int> <int>     <dbl> <fct> <int> <dbl>   <int>  <int>   <int>
1     1   400      2     1     0.161 f         1   1         1      1       1
2     2  4500      0     0     0.155 f         1   0         0      1       1
3     3  1012      2     1     0.192 m         1   0.5       0      0       0
4     4  1925      2     1     0.150 f         1   0.5       0      1       1
5     5  1504      1     0     0.104 f         2   0         0      1       1
6     6  2503      2     1     0.181 f         2   0         0      1       0
# ℹ 4 more variables: bili <dbl>, albumin <dbl>, protime <dbl>, log_bili <dbl>

Quick summary of the data

tbl_overall <-
  dat %>%
  select(age_years, sex, trt, edema, ascites, hepato, spiders,
         bili, albumin, protime, log_bili, time, status, death) %>%
  tbl_summary(
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = list(all_continuous() ~ 2, all_categorical() ~ 1),
    missing = "ifany"
  ) %>%
  add_n()

tbl_overall
Characteristic N N = 3121
age_years 312 0.14 (0.03)
sex 312
    m
36.0 (11.5%)
    f
276.0 (88.5%)
trt 312
    1
158.0 (50.6%)
    2
154.0 (49.4%)
edema 312
    0
263.0 (84.3%)
    0.5
29.0 (9.3%)
    1
20.0 (6.4%)
ascites 312 24.0 (7.7%)
hepato 312 160.0 (51.3%)
spiders 312 90.0 (28.8%)
bili 312 3.26 (4.53)
albumin 312 3.52 (0.42)
protime 312 10.73 (1.00)
log_bili 312 1.13 (0.71)
time 312 2,006.36 (1,123.28)
status 312
    0
168.0 (53.8%)
    1
19.0 (6.1%)
    2
125.0 (40.1%)
death 312 125.0 (40.1%)
1 Mean (SD); n (%)

Latent profile analysis (clustering continuous variables)

  • LPA/LCA assumes that, conditional on class, indicators follow a multivariate normal distribution with class-specific means and covariance structure.

  • Consider candidate number of clusters from 2 clusters to 5 clusters

  • Best model is determined by

    1. Primary: largest BIC
    2. Supported by higher entropy and larger posterior membership probability;
  • Posterior class membership probabilities (per subject): For each individual, the model gives a vector of probabilities-one for each class (e.g., Person A: 0.72 in Class 1, 0.25 in Class 2, 0.03 in Class 3).

  • Entropy:takes all those individual probability vectors and boils them down to one summary number that reflects how concentrated they are on average.

Ks <- 2:5
fits <- vector("list", length(Ks))
names(fits) <- as.character(Ks)

entropy_norm <- function(Z) {
  p <- pmax(as.matrix(Z), .Machine$double.eps)
  N <- nrow(p); K <- ncol(p)
  -sum(p * log(p)) / (N * log(K))   # [0,1], higher = crisper
}

metrics <- lapply(Ks, function(K){
  fitK <- Mclust(X, G = K)  # lets mclust choose best covariance for this K
  fits[[as.character(K)]] <<- fitK
  post <- fitK$z
  tibble(
    K          = K,
    BIC        = fitK$bic,
    Entropy    = entropy_norm(post),
    AvgMaxPost = mean(apply(post, 1, max))
  )
}) %>%
  bind_rows()

metrics
# A tibble: 4 × 4
      K    BIC Entropy AvgMaxPost
  <int>  <dbl>   <dbl>      <dbl>
1     2 -2429.   0.260      0.930
2     3 -2405.   0.339      0.853
3     4 -2433.   0.382      0.783
4     5 -2445.   0.300      0.816
  • Given this example five number of classes is the best model (lower bic).

Interpret clustering results

fit5 <- fits[["5"]]

post5 <- fit5$z
post_df <- as_tibble(post5) %>% setNames(paste0("post_class", 1:5))
class5  <- factor(fit5$classification, levels = 1:5, labels = paste0("Class", 1:5))

dat_lpa <- dat %>%
  bind_cols(post_df) %>%
  mutate(class_hat = class5)

avg_max_post <- dat_lpa %>%
  select(starts_with("post_class")) %>%
  { do.call(pmax, .) } %>%
  mean()

dat_lpa %>%
  count(class_hat) 
# A tibble: 5 × 2
  class_hat     n
  <fct>     <int>
1 Class1       57
2 Class2       84
3 Class3       63
4 Class4       48
5 Class5       60
profile_df <- as.data.frame(X) %>%
  setNames(c("log_bili","albumin","protime")) %>%
  mutate(class = dat_lpa$class_hat) %>%
  pivot_longer(-class, names_to = "indicator", values_to = "z") %>%
  group_by(class, indicator) %>%
  summarize(mean_z = mean(z), .groups = "drop") %>%
  mutate(indicator = forcats::fct_relevel(indicator, c("log_bili","albumin","protime")))

p_profile <- ggplot(profile_df, aes(indicator, mean_z, group = class, linetype = class)) +
  geom_line() + geom_point() +
  labs(title = "Latent Profile Means (standardized), K = 5",
       x = NULL, y = "Mean (z-score)", linetype = "Class") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))
p_profile

dens_df <- as.data.frame(X) %>%
  setNames(c("log_bili","albumin","protime")) %>%
  mutate(class = dat_lpa$class_hat) %>%
  pivot_longer(cols = c(log_bili, albumin, protime),
               names_to = "indicator", values_to = "z") %>%
  mutate(indicator = forcats::fct_relevel(indicator, c("log_bili","albumin","protime")))

p_density <- ggplot(dens_df, aes(x = z, fill = class)) +
  geom_density(alpha = 0.35) +
  facet_wrap(~ indicator, scales = "free", nrow = 1) +
  labs(title = "Standardized Indicator Distributions by Class (K = 3)",
       x = "Z-score", y = "Density", fill = "Class") +
  theme_minimal()
p_density

  • examining association with other variables
tbl_by_class <-
  dat_lpa %>%
  select(
    class_hat,
    age_years, sex, trt, edema, ascites, hepato, spiders,
    bili, albumin, protime, log_bili,
    status, death
  ) %>%
  gtsummary::tbl_summary(
    by = class_hat,
    statistic = list(
      gtsummary::all_continuous()  ~ "{mean} ({sd})",
      gtsummary::all_categorical() ~ "{n} ({p}%)"
    ),
    digits = list(gtsummary::all_continuous() ~ 2),
    missing = "ifany"
  ) %>%
  gtsummary::add_n() %>%           # total N per column
  gtsummary::add_p() %>%           
  gtsummary::bold_labels()

tbl_by_class
Characteristic N Class1
N = 571
Class2
N = 841
Class3
N = 631
Class4
N = 481
Class5
N = 601
p-value2
age_years 312 0.15 (0.03) 0.14 (0.03) 0.14 (0.03) 0.13 (0.03) 0.13 (0.03) 0.005
sex 312




<0.001
    m
6 (11%) 3 (3.6%) 18 (29%) 6 (13%) 3 (5.0%)
    f
51 (89%) 81 (96%) 45 (71%) 42 (88%) 57 (95%)
trt 312




0.3
    1
23 (40%) 40 (48%) 36 (57%) 28 (58%) 31 (52%)
    2
34 (60%) 44 (52%) 27 (43%) 20 (42%) 29 (48%)
edema 312





    0
27 (47%) 79 (94%) 58 (92%) 44 (92%) 55 (92%)
    0.5
13 (23%) 5 (6.0%) 3 (4.8%) 4 (8.3%) 4 (6.7%)
    1
17 (30%) 0 (0%) 2 (3.2%) 0 (0%) 1 (1.7%)
ascites 312 18 (32%) 0 (0%) 5 (7.9%) 0 (0%) 1 (1.7%) <0.001
hepato 312 45 (79%) 23 (27%) 42 (67%) 26 (54%) 24 (40%) <0.001
spiders 312 31 (54%) 10 (12%) 22 (35%) 17 (35%) 10 (17%) <0.001
bili 312 9.23 (7.36) 0.60 (0.13) 3.43 (1.98) 3.29 (2.02) 1.09 (0.16) <0.001
albumin 312 3.08 (0.47) 3.71 (0.34) 3.56 (0.35) 3.47 (0.31) 3.67 (0.29) <0.001
protime 312 12.14 (1.19) 10.43 (0.53) 11.04 (0.41) 9.86 (0.21) 10.15 (0.51) <0.001
log_bili 312 2.03 (0.83) 0.47 (0.09) 1.40 (0.43) 1.37 (0.40) 0.73 (0.08) <0.001
status 312





    0
6 (11%) 72 (86%) 17 (27%) 28 (58%) 45 (75%)
    1
0 (0%) 1 (1.2%) 7 (11%) 8 (17%) 3 (5.0%)
    2
51 (89%) 11 (13%) 39 (62%) 12 (25%) 12 (20%)
death 312 51 (89%) 11 (13%) 39 (62%) 12 (25%) 12 (20%) <0.001
1 Mean (SD); n (%)
2 Kruskal-Wallis rank sum test; Pearson’s Chi-squared test; NA; Fisher’s exact test
km_fit <- survfit(Surv(time, death) ~ class_hat, data = dat_lpa)

# Nicely formatted KM with CI, p-value, and risk table
km_plot <- ggsurvplot(
  km_fit,
  data              = dat_lpa,
  conf.int          = TRUE,          # show confidence bands
  pval              = TRUE,          # log-rank p-value
  xlab              = "Days",
  ylab              = "Survival probability",
  legend.title      = "Class",
  legend            = "right",
  ggtheme           = theme_minimal(),
  break.time.by     = 365            # yearly ticks; adjust as you like
)

km_plot