The last example we worked through was to use regex
to convert data to the long format. Let us redo that exercise.
library(tidyverse)
library(tidylog)
library(DT)
read_csv(
"https://extranet.who.int/tme/generateCSV.asp?ds=notifications"
) %>%
filter(year == 2010) %>%
select(
country, year, contains("new_sp_")
) -> tb_sp
tb_sp %>%
pivot_longer(
cols = new_sp_m04:new_sp_fu,
names_to = c("test_type", "sex", "age_group"),
names_pattern = "(new_sp_)(.)(.*)",
values_to = "frequency"
) -> tb_sp.long
datatable(tb_sp.long[1:100, ])
Focus on names_to = c("test_type", "sex", "age_group")
… this is specifying the three name the three new columns should have.
In turn, names_pattern = "(new_sp_)(.)(.*)"
is specifying that the existing columns should be broken up into three pieces as follows:
new_sp_
demarcate the first new column test_type
(.)
demarcates the second new column sex
(.*)
demarcates the third new column age_group
The names_pattern()
will be tricky to decipher without good working knowledge of regular expressions (regex). Say we did not know regex. What could we do?
This will still pivot the columns so that the data are in the long
format, but we still need to split the test_type_sex_age_group
column into the three pieces of information it encapsulates. That is the subject of our next encounter with tidyr
– understanding how separate()
and unite()
work on data columns.
separate()
I would like to split test_type_sex_age_group
into three variables, and separate()
can do that with ease. All we need to do is to specify the column that needs to be split, the new column names that should be created, what separates the columns, and a few other bits.
col
indicates the column that is to be splitinto
spells out the names of the new columns that will be createdsep
indicates the column separator. This could be any alphanumeric character or, as in this case, specific positions. For example, here, we are saying split after the \(7^{th}\) position, which makes the first split at new_sp_
. The second split is supposed to be after the \(8^{th}\) position, which makes the second split at m
or f
. The rest is then the age-group indicators of 04
, 514
, etc.I left remove = FALSE
so that the original column was retained for illustration purposes. If I had set remove = TRUE
instead the original column wwould have been dropped after being split.
convert = TRUE
allows the operation to decipher if the new column should be numeriuc or a character. If you set it to FALSE then what should be numeric columns will be retained as character columns.
unite()
This is the opposite of separate()
, and allows us to combine the contents of two or more columns into a single column. In the example below, I am combining the state and county FIPS codes into a single column, and the state and county names into another column. Here the data to start with:
mydf <- cbind.data.frame(
statefips = c(39, 39, 39, 39),
countyfips = c("001", "003", "005", "007"),
state = c("Ohio", "Ohio", "Ohio", "Ohio"),
county = c("Adams", "Allen", "Ashland", "Ashtabula")
)
mydf
statefips countyfips state county
1 39 001 Ohio Adams
2 39 003 Ohio Allen
3 39 005 Ohio Ashland
4 39 007 Ohio Ashtabula
Now we lean on unite()
…
mydf %>%
unite(
col = "scfips",
statefips, countyfips,
sep = "",
remove = FALSE
) -> mydf.unite.01
mydf.unite.01
scfips statefips countyfips state county
1 39001 39 001 Ohio Adams
2 39003 39 003 Ohio Allen
3 39005 39 005 Ohio Ashland
4 39007 39 007 Ohio Ashtabula
Watch how the next example allows you to include a string and characters, if that is what you need:
mydf %>%
unite(
col = "scnames",
county, state,
sep = " County, ",
remove = FALSE
) -> mydf.unite.02
mydf.unite.02
statefips countyfips scnames state county
1 39 001 Adams County, Ohio Ohio Adams
2 39 003 Allen County, Ohio Ohio Allen
3 39 005 Ashland County, Ohio Ohio Ashland
4 39 007 Ashtabula County, Ohio Ohio Ashtabula
Note that here col
specifies the name of the new column to be created.
pivot_wider()
This is the opposite of pivot_longer()
and converts long data to the wide format. For example, say we have the following data:
datatable(fish_encounters, caption = "The Fish Encounters Data")
Note that in the code below, the names of the new columns aree being taken from the station
column while the values these columns will be populated with are being taken from the seen
column.
fish_encounters %>%
pivot_wider(
names_from = station,
values_from = seen
) -> fish.wide.01
datatable(fish.wide.01, caption = "The Fish Encounters Data in Wide Format (with NAs)")
Every fish is not seen at every station, leaving us some blank cells. If we want these blank cells to be populated with 0
, that is easy to do:
fish_encounters %>%
pivot_wider(
names_from = station,
values_from = seen,
values_fill = list(seen = 0)
) -> fish.wide.02
datatable(fish.wide.02, caption = "The Fish Encounters Data in Wide Format (with Zeroes)")
What if there are multiple unique things to be pivoted to the wide format? An example of the target data is shown below.
datatable(us_rent_income)
Notice that variable
assumes two unique values, (1) income, and (2) rent. Each also has two values – estimate
and an moe
(which stands for margin of error).
us_rent_income %>%
pivot_wider(
names_from = variable,
values_from = c(estimate, moe)
) -> rent.01
datatable(rent.01)
Notice how values_from
was used to make sure that the estimate and moe for income were attached to it while the estimate and moe for rent were attached to it, respectively.
values_fn()
We can also convert data to wide format and populate the new columns with aggregated values such as the mean, the sum, etc. See the data created below:
warpbreaks %>%
pivot_wider(
names_from = wool,
values_from = breaks,
values_fn = list(breaks = median)
) -> warps.wide
warps.wide
# A tibble: 3 x 3
tension A B
<fct> <dbl> <dbl>
1 L 51 29
2 M 21 28
3 H 24 17