Tidy Data (2/2): Wide to Long with pivot_longer()

@aniruhil
Updated on 2020-12-18

Continuing to tidy your data

The last example we worked through was to use regex to convert data to the long format. Let us redo that exercise.

library(tidyverse)
library(tidylog)
library(DT)

read_csv(
  "https://extranet.who.int/tme/generateCSV.asp?ds=notifications"
  ) %>%
  filter(year == 2010) %>%
  select(
    country, year, contains("new_sp_")
    ) -> tb_sp

tb_sp %>%
  pivot_longer(
    cols = new_sp_m04:new_sp_fu,
    names_to = c("test_type", "sex", "age_group"),
    names_pattern = "(new_sp_)(.)(.*)",
    values_to = "frequency"
  ) -> tb_sp.long

datatable(tb_sp.long[1:100, ]) 

Focus on names_to = c("test_type", "sex", "age_group") … this is specifying the three name the three new columns should have.

In turn, names_pattern = "(new_sp_)(.)(.*)" is specifying that the existing columns should be broken up into three pieces as follows:

The names_pattern() will be tricky to decipher without good working knowledge of regular expressions (regex). Say we did not know regex. What could we do?

tb_sp %>%
  group_by(country, year) %>%
  gather(test_type_sex_age_group, frequency, 3:22,
         convert = TRUE) -> tab_sp_long_old

datatable(tab_sp_long_old)

This will still pivot the columns so that the data are in the long format, but we still need to split the test_type_sex_age_group column into the three pieces of information it encapsulates. That is the subject of our next encounter with tidyr – understanding how separate() and unite() work on data columns.

separate()

I would like to split test_type_sex_age_group into three variables, and separate() can do that with ease. All we need to do is to specify the column that needs to be split, the new column names that should be created, what separates the columns, and a few other bits.

tab_sp_long_old %>%
  separate(
    col = test_type_sex_age_group,
    into = c("test_type", "sex", "age_group"),
    sep = c(7, 8),
    remove = FALSE,
    convert = TRUE
  ) -> tab_sp_split

datatable(tab_sp_split[1:100, ])

I left remove = FALSE so that the original column was retained for illustration purposes. If I had set remove = TRUE instead the original column wwould have been dropped after being split.

convert = TRUE allows the operation to decipher if the new column should be numeriuc or a character. If you set it to FALSE then what should be numeric columns will be retained as character columns.

unite()

This is the opposite of separate(), and allows us to combine the contents of two or more columns into a single column. In the example below, I am combining the state and county FIPS codes into a single column, and the state and county names into another column. Here the data to start with:

mydf <- cbind.data.frame(
  statefips = c(39, 39, 39, 39),
  countyfips = c("001", "003", "005", "007"),
  state = c("Ohio", "Ohio", "Ohio", "Ohio"),
  county = c("Adams", "Allen", "Ashland", "Ashtabula")
  )

mydf
  statefips countyfips state    county
1        39        001  Ohio     Adams
2        39        003  Ohio     Allen
3        39        005  Ohio   Ashland
4        39        007  Ohio Ashtabula

Now we lean on unite()

mydf %>%
  unite(
    col = "scfips",
    statefips, countyfips,
    sep = "",
    remove = FALSE
    ) -> mydf.unite.01

mydf.unite.01
  scfips statefips countyfips state    county
1  39001        39        001  Ohio     Adams
2  39003        39        003  Ohio     Allen
3  39005        39        005  Ohio   Ashland
4  39007        39        007  Ohio Ashtabula

Watch how the next example allows you to include a string and characters, if that is what you need:

mydf %>%
    unite(
        col = "scnames", 
        county, state,
        sep = " County, ",
        remove = FALSE
    ) -> mydf.unite.02

mydf.unite.02
  statefips countyfips                scnames state    county
1        39        001     Adams County, Ohio  Ohio     Adams
2        39        003     Allen County, Ohio  Ohio     Allen
3        39        005   Ashland County, Ohio  Ohio   Ashland
4        39        007 Ashtabula County, Ohio  Ohio Ashtabula

Note that here col specifies the name of the new column to be created.

pivot_wider()

This is the opposite of pivot_longer() and converts long data to the wide format. For example, say we have the following data:

datatable(fish_encounters, caption = "The Fish Encounters Data")

Note that in the code below, the names of the new columns aree being taken from the station column while the values these columns will be populated with are being taken from the seen column.

fish_encounters %>%
  pivot_wider(
    names_from = station,
    values_from = seen
    ) -> fish.wide.01

datatable(fish.wide.01, caption = "The Fish Encounters Data in Wide Format (with NAs)")

Every fish is not seen at every station, leaving us some blank cells. If we want these blank cells to be populated with 0, that is easy to do:

fish_encounters %>%
  pivot_wider(
    names_from = station,
    values_from = seen,
    values_fill = list(seen = 0)
    ) -> fish.wide.02

datatable(fish.wide.02, caption = "The Fish Encounters Data in Wide Format (with Zeroes)")

What if there are multiple unique things to be pivoted to the wide format? An example of the target data is shown below.

datatable(us_rent_income)

Notice that variable assumes two unique values, (1) income, and (2) rent. Each also has two values – estimate and an moe (which stands for margin of error).

us_rent_income %>%
  pivot_wider(
    names_from = variable,
    values_from = c(estimate, moe)
    ) -> rent.01

datatable(rent.01)

Notice how values_from was used to make sure that the estimate and moe for income were attached to it while the estimate and moe for rent were attached to it, respectively.

values_fn()

We can also convert data to wide format and populate the new columns with aggregated values such as the mean, the sum, etc. See the data created below:

warpbreaks <- as_tibble(
  warpbreaks[c("wool", "tension", "breaks")]
  )

datatable(warpbreaks)
warpbreaks %>%
  pivot_wider(
    names_from = wool,
    values_from = breaks,
    values_fn = list(breaks = median)
    ) -> warps.wide

warps.wide
# A tibble: 3 x 3
  tension     A     B
  <fct>   <dbl> <dbl>
1 L          51    29
2 M          21    28
3 H          24    17