A quick LCA example in R using mclust package

Author

Kuan Liu

Revisiting the PBC dataset from the survival R package

The PBC dataset contains baseline clinical/lab measures for patients with primary biliary cirrhosis.

Variable	Description	Units / Levels
age	Age	years
albumin	Serum albumin	g/dL
alk.phos	Alkaline phosphatase	U/L
ascites	Presence of ascites	yes / no
ast	Aspartate aminotransferase (AST; formerly SGOT)	U/L
bili	Serum bilirubin	mg/dL
chol	Serum cholesterol	mg/dL
copper	Urine copper	\(\mu\)g/day
edema	Edema severity	0 = none; 0.5 = untreated or successfully treated; 1 = edema despite diuretics
hepato	Hepatomegaly (enlarged liver)	yes / no
id	Case number	-
platelet	Platelet count	-
protime	Standardized prothrombin time	seconds
sex	Sex	m / f
spiders	Cutaneous spider angiomas (“spiders”)	yes / no
stage	Histologic stage of disease (biopsy-based)	-
status	Endpoint status	0 = censored; 1 = transplant; 2 = dead
time	Days from registration to death, transplantation, or study analysis (July 1986)	days
trt	Treatment assignment	1 = D-penicillamine; 2 = placebo; NA = not randomized
trig	Triglycerides	mg/dL

We use three continuous lab measures (bili, albumin, protime). We apply a log transform to bilirubin to reduce skew, then standardize all indicators to zero mean and unit variance for interpretability in profile plots.

library(mclust)
library(survival)
library(tidyverse)
library(forcats)
library(survminer)
library(gtsummary)

set.seed(123)

needed_vars <- c(
  "time","status","age_years","sex","trt","edema","ascites","hepato","spiders",
  "bili","albumin","protime"
)

dat <- survival::pbc %>%
  as_tibble() %>%
  transmute(
    id,
    time, status, death = as.integer(status == 2),
    age_years = age / 365.25,
    sex, trt, edema, ascites, hepato, spiders,
    bili, albumin, protime
  ) %>%
  tidyr::drop_na(dplyr::all_of(needed_vars)) %>%
  mutate(log_bili = log1p(bili))

# Standardized continuous indicators for LPA (continuous-only)
X <- scale(dat[, c("log_bili", "albumin", "protime")])

head(dat)

# A tibble: 6 × 15
     id  time status death age_years sex     trt edema ascites hepato spiders
  <int> <int>  <int> <int>     <dbl> <fct> <int> <dbl>   <int>  <int>   <int>
1     1   400      2     1     0.161 f         1   1         1      1       1
2     2  4500      0     0     0.155 f         1   0         0      1       1
3     3  1012      2     1     0.192 m         1   0.5       0      0       0
4     4  1925      2     1     0.150 f         1   0.5       0      1       1
5     5  1504      1     0     0.104 f         2   0         0      1       1
6     6  2503      2     1     0.181 f         2   0         0      1       0
# ℹ 4 more variables: bili <dbl>, albumin <dbl>, protime <dbl>, log_bili <dbl>

Quick summary of the data

tbl_overall <-
  dat %>%
  select(age_years, sex, trt, edema, ascites, hepato, spiders,
         bili, albumin, protime, log_bili, time, status, death) %>%
  tbl_summary(
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = list(all_continuous() ~ 2, all_categorical() ~ 1),
    missing = "ifany"
  ) %>%
  add_n()

tbl_overall

Characteristic	N	N = 312¹
age_years	312	0.14 (0.03)
sex	312
m		36.0 (11.5%)
f		276.0 (88.5%)
trt	312
1		158.0 (50.6%)
2		154.0 (49.4%)
edema	312
0		263.0 (84.3%)
0.5		29.0 (9.3%)
1		20.0 (6.4%)
ascites	312	24.0 (7.7%)
hepato	312	160.0 (51.3%)
spiders	312	90.0 (28.8%)
bili	312	3.26 (4.53)
albumin	312	3.52 (0.42)
protime	312	10.73 (1.00)
log_bili	312	1.13 (0.71)
time	312	2,006.36 (1,123.28)
status	312
0		168.0 (53.8%)
1		19.0 (6.1%)
2		125.0 (40.1%)
death	312	125.0 (40.1%)
¹ Mean (SD); n (%)

Latent profile analysis (clustering continuous variables)

LPA/LCA assumes that, conditional on class, indicators follow a multivariate normal distribution with class-specific means and covariance structure.
Consider candidate number of clusters from 2 clusters to 5 clusters
Best model is determined by
1. Primary: largest BIC
2. Supported by higher entropy and larger posterior membership probability;
Posterior class membership probabilities (per subject): For each individual, the model gives a vector of probabilities-one for each class (e.g., Person A: 0.72 in Class 1, 0.25 in Class 2, 0.03 in Class 3).
Entropy:takes all those individual probability vectors and boils them down to one summary number that reflects how concentrated they are on average.

Ks <- 2:5
fits <- vector("list", length(Ks))
names(fits) <- as.character(Ks)

entropy_norm <- function(Z) {
  p <- pmax(as.matrix(Z), .Machine$double.eps)
  N <- nrow(p); K <- ncol(p)
  -sum(p * log(p)) / (N * log(K))   # [0,1], higher = crisper
}

metrics <- lapply(Ks, function(K){
  fitK <- Mclust(X, G = K)  # lets mclust choose best covariance for this K
  fits[[as.character(K)]] <<- fitK
  post <- fitK$z
  tibble(
    K          = K,
    BIC        = fitK$bic,
    Entropy    = entropy_norm(post),
    AvgMaxPost = mean(apply(post, 1, max))
  )
}) %>%
  bind_rows()

metrics

# A tibble: 4 × 4
      K    BIC Entropy AvgMaxPost
  <int>  <dbl>   <dbl>      <dbl>
1     2 -2429.   0.260      0.930
2     3 -2405.   0.339      0.853
3     4 -2433.   0.382      0.783
4     5 -2445.   0.300      0.816

Given this example five number of classes is the best model (lower bic).

Interpret clustering results

fit5 <- fits[["5"]]

post5 <- fit5$z
post_df <- as_tibble(post5) %>% setNames(paste0("post_class", 1:5))
class5  <- factor(fit5$classification, levels = 1:5, labels = paste0("Class", 1:5))

dat_lpa <- dat %>%
  bind_cols(post_df) %>%
  mutate(class_hat = class5)

avg_max_post <- dat_lpa %>%
  select(starts_with("post_class")) %>%
  { do.call(pmax, .) } %>%
  mean()

dat_lpa %>%
  count(class_hat)

# A tibble: 5 × 2
  class_hat     n
  <fct>     <int>
1 Class1       57
2 Class2       84
3 Class3       63
4 Class4       48
5 Class5       60

profile_df <- as.data.frame(X) %>%
  setNames(c("log_bili","albumin","protime")) %>%
  mutate(class = dat_lpa$class_hat) %>%
  pivot_longer(-class, names_to = "indicator", values_to = "z") %>%
  group_by(class, indicator) %>%
  summarize(mean_z = mean(z), .groups = "drop") %>%
  mutate(indicator = forcats::fct_relevel(indicator, c("log_bili","albumin","protime")))

p_profile <- ggplot(profile_df, aes(indicator, mean_z, group = class, linetype = class)) +
  geom_line() + geom_point() +
  labs(title = "Latent Profile Means (standardized), K = 5",
       x = NULL, y = "Mean (z-score)", linetype = "Class") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))
p_profile

dens_df <- as.data.frame(X) %>%
  setNames(c("log_bili","albumin","protime")) %>%
  mutate(class = dat_lpa$class_hat) %>%
  pivot_longer(cols = c(log_bili, albumin, protime),
               names_to = "indicator", values_to = "z") %>%
  mutate(indicator = forcats::fct_relevel(indicator, c("log_bili","albumin","protime")))

p_density <- ggplot(dens_df, aes(x = z, fill = class)) +
  geom_density(alpha = 0.35) +
  facet_wrap(~ indicator, scales = "free", nrow = 1) +
  labs(title = "Standardized Indicator Distributions by Class (K = 3)",
       x = "Z-score", y = "Density", fill = "Class") +
  theme_minimal()
p_density

examining association with other variables

tbl_by_class <-
  dat_lpa %>%
  select(
    class_hat,
    age_years, sex, trt, edema, ascites, hepato, spiders,
    bili, albumin, protime, log_bili,
    status, death
  ) %>%
  gtsummary::tbl_summary(
    by = class_hat,
    statistic = list(
      gtsummary::all_continuous()  ~ "{mean} ({sd})",
      gtsummary::all_categorical() ~ "{n} ({p}%)"
    ),
    digits = list(gtsummary::all_continuous() ~ 2),
    missing = "ifany"
  ) %>%
  gtsummary::add_n() %>%           # total N per column
  gtsummary::add_p() %>%           
  gtsummary::bold_labels()

tbl_by_class

Characteristic	N	Class1 N = 57¹	Class2 N = 84¹	Class3 N = 63¹	Class4 N = 48¹	Class5 N = 60¹	p-value²
age_years	312	0.15 (0.03)	0.14 (0.03)	0.14 (0.03)	0.13 (0.03)	0.13 (0.03)	0.005
sex	312						<0.001
m		6 (11%)	3 (3.6%)	18 (29%)	6 (13%)	3 (5.0%)
f		51 (89%)	81 (96%)	45 (71%)	42 (88%)	57 (95%)
trt	312						0.3
1		23 (40%)	40 (48%)	36 (57%)	28 (58%)	31 (52%)
2		34 (60%)	44 (52%)	27 (43%)	20 (42%)	29 (48%)
edema	312
0		27 (47%)	79 (94%)	58 (92%)	44 (92%)	55 (92%)
0.5		13 (23%)	5 (6.0%)	3 (4.8%)	4 (8.3%)	4 (6.7%)
1		17 (30%)	0 (0%)	2 (3.2%)	0 (0%)	1 (1.7%)
ascites	312	18 (32%)	0 (0%)	5 (7.9%)	0 (0%)	1 (1.7%)	<0.001
hepato	312	45 (79%)	23 (27%)	42 (67%)	26 (54%)	24 (40%)	<0.001
spiders	312	31 (54%)	10 (12%)	22 (35%)	17 (35%)	10 (17%)	<0.001
bili	312	9.23 (7.36)	0.60 (0.13)	3.43 (1.98)	3.29 (2.02)	1.09 (0.16)	<0.001
albumin	312	3.08 (0.47)	3.71 (0.34)	3.56 (0.35)	3.47 (0.31)	3.67 (0.29)	<0.001
protime	312	12.14 (1.19)	10.43 (0.53)	11.04 (0.41)	9.86 (0.21)	10.15 (0.51)	<0.001
log_bili	312	2.03 (0.83)	0.47 (0.09)	1.40 (0.43)	1.37 (0.40)	0.73 (0.08)	<0.001
status	312
0		6 (11%)	72 (86%)	17 (27%)	28 (58%)	45 (75%)
1		0 (0%)	1 (1.2%)	7 (11%)	8 (17%)	3 (5.0%)
2		51 (89%)	11 (13%)	39 (62%)	12 (25%)	12 (20%)
death	312	51 (89%)	11 (13%)	39 (62%)	12 (25%)	12 (20%)	<0.001
¹ Mean (SD); n (%)
² Kruskal-Wallis rank sum test; Pearson’s Chi-squared test; NA; Fisher’s exact test

km_fit <- survfit(Surv(time, death) ~ class_hat, data = dat_lpa)

# Nicely formatted KM with CI, p-value, and risk table
km_plot <- ggsurvplot(
  km_fit,
  data              = dat_lpa,
  conf.int          = TRUE,          # show confidence bands
  pval              = TRUE,          # log-rank p-value
  xlab              = "Days",
  ylab              = "Survival probability",
  legend.title      = "Class",
  legend            = "right",
  ggtheme           = theme_minimal(),
  break.time.by     = 365            # yearly ticks; adjust as you like
)

km_plot