While I have some inkling of how {gt} works fresh in my mind, it would be good to also come to grips with {gtsummary}, so here we go on another journey! From a cursory reading of examples I have seen it is a promising package that may replace {stargazer} (one of the workhorses I have come to love).
While I have some inkling of how {gt} works fresh in my mind, it would be good to also come to grips with {gtsummary}, a package that ” provides an elegant and flexible way to create publication-ready analytical and summary tables using the R programming language. The {gtsummary} package summarizes data sets, regression models, and more, using sensible defaults with highly customizable capabilities.” The package’s tutorials are available here. Let us get going and use the trial
data bundled with the package:
Variable | Class | Label |
---|---|---|
trt | character | Chemotherapy Treatment |
age | numeric | Age |
marker | numeric | Marker Level (ng/mL) |
stage | factor | T Stage |
grade | factor | Grade |
response | integer | Tumor Response |
death | integer | Patient Died |
ttdeath | numeric | Months to Death/Censor |
library(gtsummary)
data(trial)
%>%
trial select(trt, age) %>%
tbl_summary()
Characteristic | N = 2001 |
---|---|
Chemotherapy Treatment | |
Drug A | 98 (49%) |
Drug B | 102 (51%) |
Age | 47 (38, 57) |
Unknown | 11 |
1 n (%); Median (IQR) |
The tbl_summary()
command will parse the variable type and then generate either the count and relative frequency (as a percentage of the total) for each unique level of a character or factor, and the median and inter-quartile range (IQR) if it is a numeric or integer variable. Missing values will show up as Unknown
for each variable in question, unless the label is tweaked (as shown below).
If we want to view the summary statistics by treatment (or any other categorical) group, then the by =
switch can be used. The add_p()
switch will calculate and report p-values, with the test(s) run added as footnote(s). The add_overall()
switch will give you an additional column with the statistics for the full data-set (i.e., ignoring the by =
switch). Similarly, add_n()
will give you the length of the data-set. If you want to customize the statistics and other elements, a lot is possible. Titles and subtitles are easy to add by leaning on gt
commands.
%>%
trial select(trt, age, stage, ttdeath) %>%
tbl_summary(
by = trt,
statistic = list(
all_continuous() ~ "{median}; {mean} ({sd})",
all_categorical() ~ "{n} / {N} ({p}%)"
),digits = all_continuous() ~ 2,
label = c(age ~ "Patient's Age", stage ~ "T Stage"),
missing_text = "(Missing)"
%>%
) add_p() %>%
add_n() %>%
modify_header(label ~ "**Variable**") %>%
as_gt() %>%
::tab_header(
gttitle = "Table 1. Summary Statistics",
subtitle = "(N = 200)"
)
Table 1. Summary Statistics | ||||
(N = 200) | ||||
Variable | N | Drug A, N = 981 | Drug B, N = 1021 | p-value2 |
---|---|---|---|---|
Patient's Age | 189 | 46.00; 47.01 (14.71) | 48.00; 47.45 (14.01) | 0.7 |
(Missing) | 7 | 4 | ||
T Stage | 200 | 0.9 | ||
T1 | 28 / 98 (29%) | 25 / 102 (25%) | ||
T2 | 25 / 98 (26%) | 29 / 102 (28%) | ||
T3 | 22 / 98 (22%) | 21 / 102 (21%) | ||
T4 | 23 / 98 (23%) | 27 / 102 (26%) | ||
Months to Death/Censor | 200 | 23.51; 20.23 (4.99) | 21.22; 19.04 (5.50) | 0.14 |
1 Median; Mean (SD); n / N (%) | ||||
2 Wilcoxon rank sum test; Pearson’s Chi-squared test |
Summary tables are all well and good, but what about regression results? {stargazer}
has been the package I have used for presenting model results but {gtsummary}
may change that if it fits well in the work-flow. I’ll switch data-sets here and work with the diamonds
data-set. We can fit a simple linear regression model to have some estimates to work with.
library(tidyverse)
glimpse(diamonds)
Rows: 53,940
Columns: 10
$ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22…
$ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very…
$ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J…
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, …
$ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1…
$ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, …
$ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 33…
$ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87…
$ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78…
$ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49…
%>%
diamonds mutate(cut.f = factor(cut, ordered = FALSE)) -> diamonds
lm(price ~ carat + cut.f, data = diamonds) -> lm01
%>%
lm01 tbl_regression(
intercept = TRUE
%>%
) add_n() %>%
bold_labels() %>%
italicize_levels() %>%
modify_header(label ~ "**Independent Variables**") %>%
add_glance_source_note(
label = list(
~ "Residual df",
df.residual ~ "Adjusted R-Squared"
adj.r.squared
),fmt_fun = df ~ style_number#,
#include = c(adj.r.squared, df.residual)
%>%
) as_gt() %>%
::tab_header(
gttitle = "Table 2. Linear Regression Estimates",
subtitle = "The Diamonds Data-set"
)
Table 2. Linear Regression Estimates | ||||
The Diamonds Data-set | ||||
Independent Variables | N | Beta | 95% CI1 | p-value |
---|---|---|---|---|
(Intercept) | 53,940 | -3,875 | -3,955, -3,796 | <0.001 |
carat | 53,940 | 7,871 | 7,844, 7,898 | <0.001 |
cut.f | 53,940 | |||
Fair | — | — | ||
Good | 1,120 | 1,035, 1,206 | <0.001 | |
Very Good | 1,510 | 1,431, 1,589 | <0.001 | |
Premium | 1,439 | 1,361, 1,517 | <0.001 | |
Ideal | 1,801 | 1,724, 1,878 | <0.001 | |
R² = 0.856; Adjusted R-Squared = 0.856; Sigma = 1,511; Statistic = 64,369; p-value = <0.001; df = 5; Log-likelihood = -471,420; AIC = 942,854; BIC = 942,916; Deviance = 123,212,493,961; Residual df = 53,934; No. Obs. = 53,940 | ||||
1 CI = Confidence Interval |
What about a logit model, for example?
glm(
~ trt + age + grade, data = trial, family = binomial
response -> glm01
)
%>%
glm01 tbl_regression(
intercept = TRUE,
exponentiate = TRUE
%>%
) add_n() %>%
bold_labels() %>%
italicize_levels() %>%
modify_header(label ~ "**Independent Variables**") %>%
add_glance_source_note() %>%
as_gt() %>%
::tab_header(
gttitle = "Table 3. Logit Estimates",
subtitle = "The Trial Data-set"
)
Table 3. Logit Estimates | ||||
The Trial Data-set | ||||
Independent Variables | N | OR1 | 95% CI1 | p-value |
---|---|---|---|---|
(Intercept) | 183 | 0.18 | 0.05, 0.61 | 0.007 |
Chemotherapy Treatment | 183 | |||
Drug A | — | — | ||
Drug B | 1.13 | 0.60, 2.13 | 0.7 | |
Age | 183 | 1.02 | 1.00, 1.04 | 0.10 |
Grade | 183 | |||
I | — | — | ||
II | 0.85 | 0.39, 1.85 | 0.7 | |
III | 1.01 | 0.47, 2.15 | >0.9 | |
Null deviance = 229; Null df = 182; Log-likelihood = -113; AIC = 235; BIC = 251; Deviance = 225; Residual df = 178; No. Obs. = 183 | ||||
1 OR = Odds Ratio, CI = Confidence Interval |
So, just the basics and little more than a quick tour but things are looking good with {gtsummary}
thus far.
Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Ruhil (2021, Feb. 13). From an Attican Hollow: Making Friends with {gtsummary}. Retrieved from https://aniruhil.org/posts/2021-02-13-making-friends-with-gtsummary/
BibTeX citation
@misc{ruhil2021making, author = {Ruhil, Ani}, title = {From an Attican Hollow: Making Friends with {gtsummary}}, url = {https://aniruhil.org/posts/2021-02-13-making-friends-with-gtsummary/}, year = {2021} }