From an Attican Hollow: Making Friends with {gtsummary}

Ani Ruhil

While I have some inkling of how {gt} works fresh in my mind, it would be good to also come to grips with {gtsummary}, a package that ” provides an elegant and flexible way to create publication-ready analytical and summary tables using the R programming language. The {gtsummary} package summarizes data sets, regression models, and more, using sensible defaults with highly customizable capabilities.” The package’s tutorials are available here. Let us get going and use the trial data bundled with the package:

Variable	Class	Label
trt	character	Chemotherapy Treatment
age	numeric	Age
marker	numeric	Marker Level (ng/mL)
stage	factor	T Stage
grade	factor	Grade
response	integer	Tumor Response
death	integer	Patient Died
ttdeath	numeric	Months to Death/Censor

Summary Statistics Tables

library(gtsummary)
data(trial)

trial %>% 
  select(trt, age) %>%
  tbl_summary()

Characteristic	N = 200¹
Chemotherapy Treatment
Drug A	98 (49%)
Drug B	102 (51%)
Age	47 (38, 57)
Unknown	11
¹ n (%); Median (IQR)

The tbl_summary() command will parse the variable type and then generate either the count and relative frequency (as a percentage of the total) for each unique level of a character or factor, and the median and inter-quartile range (IQR) if it is a numeric or integer variable. Missing values will show up as Unknown for each variable in question, unless the label is tweaked (as shown below).

If we want to view the summary statistics by treatment (or any other categorical) group, then the by = switch can be used. The add_p() switch will calculate and report p-values, with the test(s) run added as footnote(s). The add_overall() switch will give you an additional column with the statistics for the full data-set (i.e., ignoring the by = switch). Similarly, add_n() will give you the length of the data-set. If you want to customize the statistics and other elements, a lot is possible. Titles and subtitles are easy to add by leaning on gt commands.

trial %>% 
  select(trt, age, stage, ttdeath) %>%
  tbl_summary(
    by = trt,
    statistic = list(
      all_continuous() ~ "{median}; {mean} ({sd})",
      all_categorical() ~ "{n} / {N} ({p}%)"
      ),
    digits = all_continuous() ~ 2,
    label = c(age ~ "Patient's Age", stage ~ "T Stage"),
    missing_text = "(Missing)"
    ) %>%
  add_p() %>%
  add_n() %>%
  modify_header(label ~ "**Variable**") %>%
  as_gt() %>%
  gt::tab_header(
    title = "Table 1. Summary Statistics",
    subtitle = "(N = 200)"
    )

Variable	N	Drug A, N = 98¹	Drug B, N = 102¹	p-value²
Table 1. Summary Statistics
(N = 200)
Patient's Age	189	46.00; 47.01 (14.71)	48.00; 47.45 (14.01)	0.7
(Missing)		7	4
T Stage	200			0.9
T1		28 / 98 (29%)	25 / 102 (25%)
T2		25 / 98 (26%)	29 / 102 (28%)
T3		22 / 98 (22%)	21 / 102 (21%)
T4		23 / 98 (23%)	27 / 102 (26%)
Months to Death/Censor	200	23.51; 20.23 (4.99)	21.22; 19.04 (5.50)	0.14
¹ Median; Mean (SD); n / N (%)
² Wilcoxon rank sum test; Pearson’s Chi-squared test

Model-based Estimates’ Tables

Summary tables are all well and good, but what about regression results? {stargazer} has been the package I have used for presenting model results but {gtsummary} may change that if it fits well in the work-flow. I’ll switch data-sets here and work with the diamonds data-set. We can fit a simple linear regression model to have some estimates to work with.

library(tidyverse)
glimpse(diamonds)

Rows: 53,940
Columns: 10
$ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22…
$ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very…
$ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J…
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, …
$ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1…
$ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, …
$ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 33…
$ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87…
$ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78…
$ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49…

diamonds %>%
  mutate(cut.f = factor(cut, ordered = FALSE)) -> diamonds

lm(price ~ carat + cut.f, data = diamonds) -> lm01

lm01 %>%
  tbl_regression(
    intercept = TRUE
  ) %>%
  add_n() %>%
  bold_labels() %>%
  italicize_levels() %>%
  modify_header(label ~ "**Independent Variables**") %>%
  add_glance_source_note(
    label = list(
      df.residual  ~ "Residual df",
      adj.r.squared ~ "Adjusted R-Squared"
      ),
    fmt_fun = df ~ style_number#,
    #include = c(adj.r.squared, df.residual)
  ) %>% 
  as_gt() %>%
  gt::tab_header(
    title = "Table 2. Linear Regression Estimates",
    subtitle = "The Diamonds Data-set"
    )

Independent Variables	N	Beta	95% CI¹	p-value
Table 2. Linear Regression Estimates
The Diamonds Data-set
(Intercept)	53,940	-3,875	-3,955, -3,796	<0.001
carat	53,940	7,871	7,844, 7,898	<0.001
cut.f	53,940
Fair		—	—
Good		1,120	1,035, 1,206	<0.001
Very Good		1,510	1,431, 1,589	<0.001
Premium		1,439	1,361, 1,517	<0.001
Ideal		1,801	1,724, 1,878	<0.001
R² = 0.856; Adjusted R-Squared = 0.856; Sigma = 1,511; Statistic = 64,369; p-value = <0.001; df = 5; Log-likelihood = -471,420; AIC = 942,854; BIC = 942,916; Deviance = 123,212,493,961; Residual df = 53,934; No. Obs. = 53,940
¹ CI = Confidence Interval

What about a logit model, for example?

glm(
  response ~ trt + age + grade, data = trial, family = binomial
  ) -> glm01

glm01 %>%
  tbl_regression(
    intercept = TRUE,
    exponentiate = TRUE
    ) %>%
  add_n() %>%
  bold_labels() %>%
  italicize_levels() %>%
  modify_header(label ~ "**Independent Variables**") %>%
    add_glance_source_note() %>%
  as_gt() %>%
  gt::tab_header(
    title = "Table 3. Logit Estimates",
    subtitle = "The Trial Data-set"
    )

Independent Variables	N	OR¹	95% CI¹	p-value
Table 3. Logit Estimates
The Trial Data-set
(Intercept)	183	0.18	0.05, 0.61	0.007
Chemotherapy Treatment	183
Drug A		—	—
Drug B		1.13	0.60, 2.13	0.7
Age	183	1.02	1.00, 1.04	0.10
Grade	183
I		—	—
II		0.85	0.39, 1.85	0.7
III		1.01	0.47, 2.15	>0.9
Null deviance = 229; Null df = 182; Log-likelihood = -113; AIC = 235; BIC = 251; Deviance = 225; Residual df = 178; No. Obs. = 183
¹ OR = Odds Ratio, CI = Confidence Interval

So, just the basics and little more than a quick tour but things are looking good with {gtsummary} thus far.

Making Friends with {gtsummary}

Summary Statistics Tables

Model-based Estimates’ Tables

Reuse

Citation