Making Friends with {gtsummary}

R

While I have some inkling of how {gt} works fresh in my mind, it would be good to also come to grips with {gtsummary}, so here we go on another journey! From a cursory reading of examples I have seen it is a promising package that may replace {stargazer} (one of the workhorses I have come to love).

Ani Ruhil true
02-13-2021

While I have some inkling of how {gt} works fresh in my mind, it would be good to also come to grips with {gtsummary}, a package that ” provides an elegant and flexible way to create publication-ready analytical and summary tables using the R programming language. The {gtsummary} package summarizes data sets, regression models, and more, using sensible defaults with highly customizable capabilities.” The package’s tutorials are available here. Let us get going and use the trial data bundled with the package:

Variable Class Label
trt character Chemotherapy Treatment
age numeric Age
marker numeric Marker Level (ng/mL)
stage factor T Stage
grade factor Grade
response integer Tumor Response
death integer Patient Died
ttdeath numeric Months to Death/Censor

Summary Statistics Tables

library(gtsummary)
data(trial)

trial %>% 
  select(trt, age) %>%
  tbl_summary()
Characteristic N = 2001
Chemotherapy Treatment
    Drug A 98 (49%)
    Drug B 102 (51%)
Age 47 (38, 57)
    Unknown 11
1 n (%); Median (IQR)

The tbl_summary() command will parse the variable type and then generate either the count and relative frequency (as a percentage of the total) for each unique level of a character or factor, and the median and inter-quartile range (IQR) if it is a numeric or integer variable. Missing values will show up as Unknown for each variable in question, unless the label is tweaked (as shown below).

If we want to view the summary statistics by treatment (or any other categorical) group, then the by = switch can be used. The add_p() switch will calculate and report p-values, with the test(s) run added as footnote(s). The add_overall() switch will give you an additional column with the statistics for the full data-set (i.e., ignoring the by = switch). Similarly, add_n() will give you the length of the data-set. If you want to customize the statistics and other elements, a lot is possible. Titles and subtitles are easy to add by leaning on gt commands.

trial %>% 
  select(trt, age, stage, ttdeath) %>%
  tbl_summary(
    by = trt,
    statistic = list(
      all_continuous() ~ "{median}; {mean} ({sd})",
      all_categorical() ~ "{n} / {N} ({p}%)"
      ),
    digits = all_continuous() ~ 2,
    label = c(age ~ "Patient's Age", stage ~ "T Stage"),
    missing_text = "(Missing)"
    ) %>%
  add_p() %>%
  add_n() %>%
  modify_header(label ~ "**Variable**") %>%
  as_gt() %>%
  gt::tab_header(
    title = "Table 1. Summary Statistics",
    subtitle = "(N = 200)"
    )  
Table 1. Summary Statistics
(N = 200)
Variable N Drug A, N = 981 Drug B, N = 1021 p-value2
Patient's Age 189 46.00; 47.01 (14.71) 48.00; 47.45 (14.01) 0.7
    (Missing) 7 4
T Stage 200 0.9
    T1 28 / 98 (29%) 25 / 102 (25%)
    T2 25 / 98 (26%) 29 / 102 (28%)
    T3 22 / 98 (22%) 21 / 102 (21%)
    T4 23 / 98 (23%) 27 / 102 (26%)
Months to Death/Censor 200 23.51; 20.23 (4.99) 21.22; 19.04 (5.50) 0.14
1 Median; Mean (SD); n / N (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test

Model-based Estimates’ Tables

Summary tables are all well and good, but what about regression results? {stargazer} has been the package I have used for presenting model results but {gtsummary} may change that if it fits well in the work-flow. I’ll switch data-sets here and work with the diamonds data-set. We can fit a simple linear regression model to have some estimates to work with.

library(tidyverse)
glimpse(diamonds)
Rows: 53,940
Columns: 10
$ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22…
$ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very…
$ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J…
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, …
$ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1…
$ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, …
$ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 33…
$ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87…
$ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78…
$ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49…
diamonds %>%
  mutate(cut.f = factor(cut, ordered = FALSE)) -> diamonds

lm(price ~ carat + cut.f, data = diamonds) -> lm01

lm01 %>%
  tbl_regression(
    intercept = TRUE
  ) %>%
  add_n() %>%
  bold_labels() %>%
  italicize_levels() %>%
  modify_header(label ~ "**Independent Variables**") %>%
  add_glance_source_note(
    label = list(
      df.residual  ~ "Residual df",
      adj.r.squared ~ "Adjusted R-Squared"
      ),
    fmt_fun = df ~ style_number#,
    #include = c(adj.r.squared, df.residual)
  ) %>% 
  as_gt() %>%
  gt::tab_header(
    title = "Table 2. Linear Regression Estimates",
    subtitle = "The Diamonds Data-set"
    ) 
Table 2. Linear Regression Estimates
The Diamonds Data-set
Independent Variables N Beta 95% CI1 p-value
(Intercept) 53,940 -3,875 -3,955, -3,796 <0.001
carat 53,940 7,871 7,844, 7,898 <0.001
cut.f 53,940
    Fair
    Good 1,120 1,035, 1,206 <0.001
    Very Good 1,510 1,431, 1,589 <0.001
    Premium 1,439 1,361, 1,517 <0.001
    Ideal 1,801 1,724, 1,878 <0.001
R² = 0.856; Adjusted R-Squared = 0.856; Sigma = 1,511; Statistic = 64,369; p-value = <0.001; df = 5; Log-likelihood = -471,420; AIC = 942,854; BIC = 942,916; Deviance = 123,212,493,961; Residual df = 53,934; No. Obs. = 53,940
1 CI = Confidence Interval

What about a logit model, for example?

glm(
  response ~ trt + age + grade, data = trial, family = binomial
  ) -> glm01

glm01 %>%
  tbl_regression(
    intercept = TRUE,
    exponentiate = TRUE
    ) %>%
  add_n() %>%
  bold_labels() %>%
  italicize_levels() %>%
  modify_header(label ~ "**Independent Variables**") %>%
    add_glance_source_note() %>%
  as_gt() %>%
  gt::tab_header(
    title = "Table 3. Logit Estimates",
    subtitle = "The Trial Data-set"
    )  
Table 3. Logit Estimates
The Trial Data-set
Independent Variables N OR1 95% CI1 p-value
(Intercept) 183 0.18 0.05, 0.61 0.007
Chemotherapy Treatment 183
    Drug A
    Drug B 1.13 0.60, 2.13 0.7
Age 183 1.02 1.00, 1.04 0.10
Grade 183
    I
    II 0.85 0.39, 1.85 0.7
    III 1.01 0.47, 2.15 >0.9
Null deviance = 229; Null df = 182; Log-likelihood = -113; AIC = 235; BIC = 251; Deviance = 225; Residual df = 178; No. Obs. = 183
1 OR = Odds Ratio, CI = Confidence Interval

So, just the basics and little more than a quick tour but things are looking good with {gtsummary} thus far.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Ruhil (2021, Feb. 13). From an Attican Hollow: Making Friends with {gtsummary}. Retrieved from https://aniruhil.org/posts/2021-02-13-making-friends-with-gtsummary/

BibTeX citation

@misc{ruhil2021making,
  author = {Ruhil, Ani},
  title = {From an Attican Hollow: Making Friends with {gtsummary}},
  url = {https://aniruhil.org/posts/2021-02-13-making-friends-with-gtsummary/},
  year = {2021}
}