Tidy Data (1/2): Wide to Long with pivot_longer()

@aniruhil
Updated on 2020-12-18

Why tidy your data?

My goal here is to introduce you to some basic tidy operations that are often necessary when working with ‘untidy’ data. What are tidy data? Well, tidy data have the following characteristics

What then are untidy data? Here are two examples:

library(tidyverse)
library(DT)
datatable(relig_income, caption = "Untidy Data Example #1")
datatable(billboard[, c(1:11)], caption = "Untidy Data Example #2")

In Example 1, why is each income-level a column? Would it not make more sense to have the data structured as follows:

relig_income %>%
  pivot_longer(-religion, names_to = "income", values_to = "frequency")
# A tibble: 180 x 3
   religion income             frequency
   <chr>    <chr>                  <dbl>
 1 Agnostic <$10k                     27
 2 Agnostic $10-20k                   34
 3 Agnostic $20-30k                   60
 4 Agnostic $30-40k                   81
 5 Agnostic $40-50k                   76
 6 Agnostic $50-75k                  137
 7 Agnostic $75-100k                 122
 8 Agnostic $100-150k                109
 9 Agnostic >150k                     84
10 Agnostic Don't know/refused        96
# … with 170 more rows

Note that now each column is a unique variable.

What about the billboard data? Same thing; why not have the weeks in rows? After all they are measuring the same thing – measurement at specific time intervals – aren’t they?

billboard %>% 
  pivot_longer(
    wk1:wk76, 
    names_to = "week", 
    values_to = "rank"
    )
# A tibble: 24,092 x 5
   artist track                   date.entered week   rank
   <chr>  <chr>                   <date>       <chr> <dbl>
 1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87
 2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82
 3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72
 4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77
 5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87
 6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6      94
 7 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk7      99
 8 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk8      NA
 9 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk9      NA
10 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk10     NA
# … with 24,082 more rows

Both these examples had data in the wide format that were tidied up by converting the data into the long format.

The World Health Organization’s Influenza Database

Here is a small example of untidy data from the WHO, grabbed as available on February 6, 2020. Data are in the file titled Rpt_LabSurveillanceDataLatestWeekByCtry.csv. Make sure you place this csv file in the data sub-folder and then read it in as shown below. If you have your project setup correctly, you should be all set. If you missed our first meeting or forgot to create a project, follow the steps outlined below:

  1. Create a directory and call it athensR
  2. Inside this directory create a sub-folder called data
  3. Launch RStudio and then use File -> New Project..., choosing Existing Directory to find your athensr folder.

RStudio will restart and you should see a project file called athensR.Rproj. Next time you want to work on this material, find this file and double-click; RStudio will start in your athensR folder.

All data sent to you should be saved in the data sub-folder.

I am loading up some packages we will need –

library(here)
library(janitor)
library(tidylog)

read_csv(
  here("workshops/athensr/handouts/data",
       "Rpt_LabSurveillanceDataLatestWeekByCtry.csv"),
  skip = 3
  )  -> flu

Notice an important switch used here: skip = 3. This is being used because the csv file has some header information that occupies the first three rows. If you open the csv file in MS Excel you will see this problem. But the information in these initial rows is not needed and so is skipped when reading in the data. The result? See below:

datatable(flu, caption = "The WHO Data")

Here are the column names and their meaning:

Name Meaning
SP_RECEIVED Number of specimens received/collected
SP_PROCESSED Number of specimens processed
AH1N12009 Number of A(H1N1)pdm09 influenza viruses detected
AH3 Number of A(H3) influenza viruses detected
AH5 Number of A(H5) influenza viruses detected
ANOTSUBTYPED Number of A(not subtyped) influenza viruses detected
BYAMAGATA Number of influenza B(yamagata lineage) viruses detected
BVICTORIA Number of influenza B(victoria lineage) viruses detected
BNOTDETERMINED Number of influenza B(lineage not determined) viruses detected
INF_B Total Number of influenza B viruses detected
INF_TOTAL Total Number of influenza positive viruses
INF_TOTAL2 Total Number of influenza negative viruses
Title ILI Activity

One way of tidying these data would be to create three columns, one that lists, for each country, the type of specimen (received versus processed), the second listing for each country the type of A virus detected, and the third listing for each country the type of B virus detected.

We can do this the ‘long’ way (no pun intended), by first working on just the type of specimen. In the code below I am selecting the columns I would like retained, and this is being done with the select() command. In pivot_longer() I am specifying that the Country column should not be pivoted from wide to long by prefacing the column name with -, as in -Country.

flu %>%
  select(Country, SP_RECEIVED, SP_PROCESSED) %>%
  pivot_longer(
    -Country, 
    names_to = c("Specimen Status"),
    values_to = "Number"
    ) -> flu_long_1

datatable(flu_long_1)

Now for influenza virus type A.

flu %>%
  select(Country, AH1N12009:INF_A) %>%
  pivot_longer(
    -Country, 
    names_to = c("Virus Type A"),
    values_to = "Number"
    ) -> flu_long_2

datatable(flu_long_2)

… and now for influenza virus type B.

flu %>%
  select(Country, BYAMAGATA:INF_B) %>%
  pivot_longer(
    -Country, 
    names_to = c("Virus Type B"),
    values_to = "Number"
    ) -> flu_long_3

datatable(flu_long_3)

What if we wanted all columns to be long in one go?

flu %>%
  select(Country:INF_B) %>%
  pivot_longer(
    -Country, 
    names_to = c("Indicator"),
    values_to = "Number"
    ) -> flu_long_4

datatable(flu_long_4)

Some more functions to lean on …

pivot_longer() has several other options that can be handy if needed; here they are:

  pivot_longer(
  data, cols, names_to = "name", names_prefix = NULL,
  names_sep = NULL, names_pattern = NULL, names_ptypes = list(),
  names_repair = "check_unique", values_to = "value",
  values_drop_na = FALSE, values_ptypes = list()
  )

names_prefix()

I’d like to walk through each of these in turn. Let us start with names_prefix = in the context of the following data-set that shows some general reasons people were admitted to hospital by financial year from July 1993 to June 1998 ([Source:]).

read_csv(
  "http://www.mm-c.me/mdsi/hospitals93to98.csv"
  ) -> hosp

datatable(hosp)
hosp %>%
  pivot_longer(
    -c(IcdChapter, Field),
    names_to = "Fiscal Year",
    names_prefix = "FY",
    values_to = "value"
    ) -> hosp2

datatable(hosp2, caption = "Long format of hosp dataframe")

What if there are deferentially named columns, as in the following example?

mydf <- tibble(
    name = c("Jack", "Jill"),
    sex = c("Male", "Female"),
    test_pre = c(3.21, 3.85),
    test_post = c(3.82, 3.97)
    )

mydf
# A tibble: 2 x 4
  name  sex    test_pre test_post
  <chr> <chr>     <dbl>     <dbl>
1 Jack  Male       3.21      3.82
2 Jill  Female     3.85      3.97

The goal is to move test_pre and test_post to rows, holding all other columns fixed.

mydf %>%
  pivot_longer(
    -c(1:2),
    names_prefix = "test_",
    names_to = "pre_post",
    values_to = "score"
  ) -> mydf.long

mydf.long
# A tibble: 4 x 4
  name  sex    pre_post score
  <chr> <chr>  <chr>    <dbl>
1 Jack  Male   pre       3.21
2 Jack  Male   post      3.82
3 Jill  Female pre       3.85
4 Jill  Female post      3.97

WHO’s Tuberculosis Data and names_pattern()

There is a more complicated example that involves the use of regular expressions (i.e., regex). See the following data that comes from the WHO. You should access the data dictionary here and the most recently – as of February 8, 2020 – available dataset here. Let us load the data first and then review the data dictionary.

read_csv(
  "https://extranet.who.int/tme/generateCSV.asp?ds=notifications"
  ) -> tb_data
read_csv(
  "https://extranet.who.int/tme/generateCSV.asp?ds=dictionary"
  ) -> tb_dictionary

datatable(tb_dictionary, caption = "Data Dictionary for WHO's Tuberculosis Case Notifications Data")

The data dictionary has more detail than we need so it will help to focus on a few variables, say new_sp_m04 through new_sp_fu, keeping the country and year columns of course. I will also select just one year (= 2010) to make things easier to follow.

tb_data %>%
  filter(year == 2010) %>%
  select(
    country, year, contains("new_sp_")
    ) -> tb_sp

glimpse(tb_sp)
Rows: 214
Columns: 22
$ country      <chr> "Afghanistan", "Albania", "Algeria", "American…
$ year         <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010…
$ new_sp_m04   <dbl> 4, 0, NA, 0, 0, NA, 0, 0, 13, 0, NA, 0, 0, 0, …
$ new_sp_m514  <dbl> 193, 0, NA, 0, 0, NA, 0, 0, 43, 0, NA, 2, 0, 0…
$ new_sp_m014  <dbl> 197, 0, 52, 0, 0, 448, 0, 0, 56, 0, NA, 2, 0, …
$ new_sp_m1524 <dbl> 986, 28, 1203, 0, 0, 2900, 0, 0, 536, 36, NA, …
$ new_sp_m2534 <dbl> 819, 17, 1669, 0, 0, 3584, 0, 2, 491, 75, NA, …
$ new_sp_m3544 <dbl> 491, 14, 825, 0, 0, 2415, 0, 0, 309, 49, NA, 2…
$ new_sp_m4554 <dbl> 490, 16, 513, 0, 0, 1424, 0, 2, 302, 68, NA, 2…
$ new_sp_m5564 <dbl> 641, 16, 392, 0, 0, 691, 0, 1, 340, 27, NA, 9,…
$ new_sp_m65   <dbl> 622, 15, 397, 0, 0, 355, 1, 0, 282, 15, NA, 27…
$ new_sp_mu    <dbl> 0, 0, NA, 0, 0, NA, NA, 0, 2, 0, NA, 0, 0, NA,…
$ new_sp_f04   <dbl> 16, 1, NA, 0, 0, NA, 0, 0, 7, 0, NA, 1, 0, 0, …
$ new_sp_f514  <dbl> 429, 1, NA, 0, 0, NA, 0, 0, 52, 1, NA, 3, 1, 3…
$ new_sp_f014  <dbl> 445, 2, 79, 0, 0, 558, 0, 0, 59, 1, NA, 4, 1, …
$ new_sp_f1524 <dbl> 2107, 11, 1086, 0, 0, 2763, 0, 0, 421, 24, NA,…
$ new_sp_f2534 <dbl> 2263, 7, 826, 0, 0, 2594, 0, 1, 426, 17, NA, 4…
$ new_sp_f3544 <dbl> 1455, 6, 417, 0, 0, 1688, 0, 0, 233, 4, NA, 12…
$ new_sp_f4554 <dbl> 1112, 3, 251, 0, 0, 958, 0, 0, 184, 7, NA, 2, …
$ new_sp_f5564 <dbl> 831, 2, 222, 0, 0, 482, 0, 0, 153, 8, NA, 5, 5…
$ new_sp_f65   <dbl> 488, 8, 367, 0, 0, 286, 0, 0, 176, 8, NA, 12, …
$ new_sp_fu    <dbl> 0, 0, NA, 0, 0, NA, NA, 0, 1, 0, NA, 0, 0, NA,…

The new_sp_* variable names provide information for males and females of specific age-groups. For example, new_sp_m04 provides the number of new cases of males in the 0-4 year age-group testing positive on the pulmonary smear test. Similarly, new_sp_f04 is for females 0-4 years of age. Let us tidy these data.

tb_sp %>%
  pivot_longer(
    cols = new_sp_m04:new_sp_fu,
    names_to = c("test_type", "sex", "age_group"),
    names_pattern = "(new_sp_)(.)(.*)",
    values_to = "frequency"
  ) -> tb_sp.long

datatable(tb_sp.long[1:100, ]) 

Focus on names_to = c("test_type", "sex", "age_group") … this is specifying the three name the three new columns should have.

In turn, names_pattern = "(new_sp_)(.)(.*)" is specifying that the existing columns should be broken up into three pieces as follows:

The names_pattern() will be tricky to decipher without good working knowledge of regular expressions (regex). Say we did not know regex. What could we do?

tb_sp %>%
  group_by(country, year) %>%
  gather(test_type_sex_age_group, frequency, 3:22,
         convert = TRUE) -> tab_sp_long_old

datatable(tab_sp_long_old)

This will still pivot the columns so that the data are in the long format, but we still need to split the test_type_sex_age_group column into the three pieces of information it encapsulates. That is the subject of our next encounter with tidyr – understanding how separate() and unite() work on data columns.