Tidy Data (1/2): Wide to Long with pivot_longer()

Author

Affiliation

Published

Dec. 18, 2020

DOI

Why tidy your data?

My goal here is to introduce you to some basic tidy operations that are often necessary when working with ‘untidy’ data. What are tidy data? Well, tidy data have the following characteristics

Each variable must be in its own column.
Each observation must be in its own row.
Each value must be in its own cell.

What then are untidy data? Here are two examples:

library(tidyverse)
library(DT)
datatable(relig_income, caption = "Untidy Data Example #1")

Show entries

Search:

Untidy Data Example #1
	religion	<$10k	$10-20k	$20-30k	$30-40k	$40-50k	$50-75k	$75-100k	$100-150k	>150k	Don't know/refused
1	Agnostic	27	34	60	81	76	137	122	109	84	96
2	Atheist	12	27	37	52	35	70	73	59	74	76
3	Buddhist	27	21	30	34	33	58	62	39	53	54
4	Catholic	418	617	732	670	638	1116	949	792	633	1489
5	Don’t know/refused	15	14	15	11	10	35	21	17	18	116
6	Evangelical Prot	575	869	1064	982	881	1486	949	723	414	1529
7	Hindu	1	9	7	9	11	34	47	48	54	37
8	Historically Black Prot	228	244	236	238	197	223	131	81	78	339
9	Jehovah's Witness	20	27	24	24	21	30	15	11	6	37
10	Jewish	19	19	25	25	30	95	69	87	151	162

Showing 1 to 10 of 18 entries

Previous1 2Next

datatable(billboard[, c(1:11)], caption = "Untidy Data Example #2")

Show entries

Search:

Untidy Data Example #2
	artist	track	date.entered	wk1	wk2	wk3	wk4	wk5	wk6	wk7	wk8
1	2 Pac	Baby Don't Cry (Keep...	2000-02-26	87	82	72	77	87	94	99
2	2Ge+her	The Hardest Part Of ...	2000-09-02	91	87	92
3	3 Doors Down	Kryptonite	2000-04-08	81	70	68	67	66	57	54	53
4	3 Doors Down	Loser	2000-10-21	76	76	72	69	67	65	55	59
5	504 Boyz	Wobble Wobble	2000-04-15	57	34	25	17	17	31	36	49
6	98^0	Give Me Just One Nig...	2000-08-19	51	39	34	26	26	19	2	2
7	A*Teens	Dancing Queen	2000-07-08	97	97	96	95	100
8	Aaliyah	I Don't Wanna	2000-01-29	84	62	51	41	38	35	35	38
9	Aaliyah	Try Again	2000-03-18	59	53	38	28	21	18	16	14
10	Adams, Yolanda	Open My Heart	2000-08-26	76	76	74	69	68	67	61	58

Showing 1 to 10 of 317 entries

Previous1 2 3 4 5…32Next

In Example 1, why is each income-level a column? Would it not make more sense to have the data structured as follows:

relig_income %>%
  pivot_longer(-religion, names_to = "income", values_to = "frequency")

# A tibble: 180 x 3
   religion income             frequency
   <chr>    <chr>                  <dbl>
 1 Agnostic <$10k                     27
 2 Agnostic $10-20k                   34
 3 Agnostic $20-30k                   60
 4 Agnostic $30-40k                   81
 5 Agnostic $40-50k                   76
 6 Agnostic $50-75k                  137
 7 Agnostic $75-100k                 122
 8 Agnostic $100-150k                109
 9 Agnostic >150k                     84
10 Agnostic Don't know/refused        96
# … with 170 more rows

Note that now each column is a unique variable.

What about the billboard data? Same thing; why not have the weeks in rows? After all they are measuring the same thing – measurement at specific time intervals – aren’t they?

billboard %>% 
  pivot_longer(
    wk1:wk76, 
    names_to = "week", 
    values_to = "rank"
    )

# A tibble: 24,092 x 5
   artist track                   date.entered week   rank
   <chr>  <chr>                   <date>       <chr> <dbl>
 1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87
 2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82
 3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72
 4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77
 5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87
 6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6      94
 7 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk7      99
 8 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk8      NA
 9 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk9      NA
10 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk10     NA
# … with 24,082 more rows

Both these examples had data in the wide format that were tidied up by converting the data into the long format.

The World Health Organization’s Influenza Database

Here is a small example of untidy data from the WHO, grabbed as available on February 6, 2020. Data are in the file titled Rpt_LabSurveillanceDataLatestWeekByCtry.csv. Make sure you place this csv file in the data sub-folder and then read it in as shown below. If you have your project setup correctly, you should be all set. If you missed our first meeting or forgot to create a project, follow the steps outlined below:

Create a directory and call it athensR

Inside this directory create a sub-folder called data

Launch RStudio and then use File -> New Project..., choosing Existing Directory to find your athensr folder.

RStudio will restart and you should see a project file called athensR.Rproj. Next time you want to work on this material, find this file and double-click; RStudio will start in your athensR folder.

All data sent to you should be saved in the data sub-folder.

I am loading up some packages we will need –

here will help manage data and other files so that you do not waste time solving file-path problems
janitor will help clean up our data-set and is a package we will use often
tidylog will track the consequence of almost all tidyverse commands

library(here)
library(janitor)
library(tidylog)

read_csv(
  here("workshops/athensr/handouts/data",
       "Rpt_LabSurveillanceDataLatestWeekByCtry.csv"),
  skip = 3
  )  -> flu

Notice an important switch used here: skip = 3. This is being used because the csv file has some header information that occupies the first three rows. If you open the csv file in MS Excel you will see this problem. But the information in these initial rows is not needed and so is skipped when reading in the data. The result? See below:

datatable(flu, caption = "The WHO Data")

Show entries

Search:

The WHO Data
	Country	SP_RECEIVED	SP_PROCESSED	AH1N12009	AH3	AH5	ANOTSUBTYPED	INF_A	BYAMAGATA	BVICTORIA	BNOTDETERMINED	INF_B	INF_TOTAL	INF_TOTAL2	Title
1	Afghanistan	106	106	0	4	0	0	4	1	3	0	4	8	98	No Report
2	Albania		113	6	3		16	25	0	0	7	7	32		No Report
3	Algeria	36	36	7	2	0	0	9	0	6	0	6	15	20	Widespread Outbreak
4	Armenia		102	42	0		0	42	0	11	0	11	53		No Report
5	Aruba		41	35				35	5	0		5	40	1	No Report
6	Australia		284	37	1		3	41			5	5	46	238	Local Outbreak
7	Austria		1650	69	74		478	621	0	0	121	121	742		No Report
8	Azerbaijan		15	0	0		3	3	0	0	1	1	4		No Report
9	Bangladesh	51	51	1	1	0	0	2	0	0	0	0	2	49	Sporadic
10	Belize		9	1				1		0	1	1	2	7	No Report

Showing 1 to 10 of 91 entries

Previous1 2 3 4 5…10Next

Here are the column names and their meaning:

Name	Meaning
SP_RECEIVED	Number of specimens received/collected
SP_PROCESSED	Number of specimens processed
AH1N12009	Number of A(H1N1)pdm09 influenza viruses detected
AH3	Number of A(H3) influenza viruses detected
AH5	Number of A(H5) influenza viruses detected
ANOTSUBTYPED	Number of A(not subtyped) influenza viruses detected
BYAMAGATA	Number of influenza B(yamagata lineage) viruses detected
BVICTORIA	Number of influenza B(victoria lineage) viruses detected
BNOTDETERMINED	Number of influenza B(lineage not determined) viruses detected
INF_B	Total Number of influenza B viruses detected
INF_TOTAL	Total Number of influenza positive viruses
INF_TOTAL2	Total Number of influenza negative viruses
Title	ILI Activity

One way of tidying these data would be to create three columns, one that lists, for each country, the type of specimen (received versus processed), the second listing for each country the type of A virus detected, and the third listing for each country the type of B virus detected.

We can do this the ‘long’ way (no pun intended), by first working on just the type of specimen. In the code below I am selecting the columns I would like retained, and this is being done with the select() command. In pivot_longer() I am specifying that the Country column should not be pivoted from wide to long by prefacing the column name with -, as in -Country.

flu %>%
  select(Country, SP_RECEIVED, SP_PROCESSED) %>%
  pivot_longer(
    -Country, 
    names_to = c("Specimen Status"),
    values_to = "Number"
    ) -> flu_long_1

datatable(flu_long_1)

Show entries

Search:

	Country	Specimen Status	Number
1	Afghanistan	SP_RECEIVED	106
2	Afghanistan	SP_PROCESSED	106
3	Albania	SP_RECEIVED
4	Albania	SP_PROCESSED	113
5	Algeria	SP_RECEIVED	36
6	Algeria	SP_PROCESSED	36
7	Armenia	SP_RECEIVED
8	Armenia	SP_PROCESSED	102
9	Aruba	SP_RECEIVED
10	Aruba	SP_PROCESSED	41

Showing 1 to 10 of 182 entries

Previous1 2 3 4 5…19Next

Now for influenza virus type A.

flu %>%
  select(Country, AH1N12009:INF_A) %>%
  pivot_longer(
    -Country, 
    names_to = c("Virus Type A"),
    values_to = "Number"
    ) -> flu_long_2

datatable(flu_long_2)

Show entries

Search:

	Country	Virus Type A	Number
1	Afghanistan	AH1N12009	0
2	Afghanistan	AH3	4
3	Afghanistan	AH5	0
4	Afghanistan	ANOTSUBTYPED	0
5	Afghanistan	INF_A	4
6	Albania	AH1N12009	6
7	Albania	AH3	3
8	Albania	AH5
9	Albania	ANOTSUBTYPED	16
10	Albania	INF_A	25

Showing 1 to 10 of 455 entries

Previous1 2 3 4 5…46Next

… and now for influenza virus type B.

flu %>%
  select(Country, BYAMAGATA:INF_B) %>%
  pivot_longer(
    -Country, 
    names_to = c("Virus Type B"),
    values_to = "Number"
    ) -> flu_long_3

datatable(flu_long_3)

Show entries

Search:

	Country	Virus Type B	Number
1	Afghanistan	BYAMAGATA	1
2	Afghanistan	BVICTORIA	3
3	Afghanistan	BNOTDETERMINED	0
4	Afghanistan	INF_B	4
5	Albania	BYAMAGATA	0
6	Albania	BVICTORIA	0
7	Albania	BNOTDETERMINED	7
8	Albania	INF_B	7
9	Algeria	BYAMAGATA	0
10	Algeria	BVICTORIA	6

Showing 1 to 10 of 364 entries

Previous1 2 3 4 5…37Next

What if we wanted all columns to be long in one go?

flu %>%
  select(Country:INF_B) %>%
  pivot_longer(
    -Country, 
    names_to = c("Indicator"),
    values_to = "Number"
    ) -> flu_long_4

datatable(flu_long_4)

Show entries

Search:

	Country	Indicator	Number
1	Afghanistan	SP_RECEIVED	106
2	Afghanistan	SP_PROCESSED	106
3	Afghanistan	AH1N12009	0
4	Afghanistan	AH3	4
5	Afghanistan	AH5	0
6	Afghanistan	ANOTSUBTYPED	0
7	Afghanistan	INF_A	4
8	Afghanistan	BYAMAGATA	1
9	Afghanistan	BVICTORIA	3
10	Afghanistan	BNOTDETERMINED	0

Showing 1 to 10 of 1,001 entries

Previous1 2 3 4 5…101Next

Some more functions to lean on …

pivot_longer() has several other options that can be handy if needed; here they are:

  pivot_longer(
  data, cols, names_to = "name", names_prefix = NULL,
  names_sep = NULL, names_pattern = NULL, names_ptypes = list(),
  names_repair = "check_unique", values_to = "value",
  values_drop_na = FALSE, values_ptypes = list()
  )

`names_prefix()`

I’d like to walk through each of these in turn. Let us start with names_prefix = in the context of the following data-set that shows some general reasons people were admitted to hospital by financial year from July 1993 to June 1998 ([Source:]).

read_csv(
  "http://www.mm-c.me/mdsi/hospitals93to98.csv"
  ) -> hosp

datatable(hosp)

Show entries

Search:

	IcdChapter	Field	FY1993	FY1994	FY1995	FY1996	FY1997	FY1998
1	0. Not Reported	PatientDays	257965	55582	128507	182226	61599	685879
2	0. Not Reported	Separations	37178	6146	3832	4861	1558	53575
3	1. Infectious and Parasitic Diseases	PatientDays	311221	313386	324693	311560	306688	1567548
4	1. Infectious and Parasitic Diseases	Separations	75857	78323	84631	80864	79148	398823
5	2. Neoplasms	PatientDays	1686919	1707437	1795751	1770559	1777452	8738118
6	2. Neoplasms	Separations	301928	336447	348905	360578	378070	1725928
7	3. Endocrine Nutritional, and Metabolic Diseases and Immunity Disorders	PatientDays	328354	326877	349671	351119	354723	1710744
8	3. Endocrine Nutritional, and Metabolic Diseases and Immunity Disorders	Separations	50365	54292	60655	65483	68605	299400
9	4. Diseases of the Blood and Blood?Forming Organs	PatientDays	142332	147120	156280	163412	166802	775946
10	4. Diseases of the Blood and Blood?Forming Organs	Separations	46969	50769	56758	62771	67672	284939

Showing 1 to 10 of 38 entries

Previous1 2 3 4Next

hosp %>%
  pivot_longer(
    -c(IcdChapter, Field),
    names_to = "Fiscal Year",
    names_prefix = "FY",
    values_to = "value"
    ) -> hosp2

datatable(hosp2, caption = "Long format of hosp dataframe")

Show entries

Search:

Long format of hosp dataframe
	IcdChapter	Field	Fiscal Year	value
1	0. Not Reported	PatientDays	1993	257965
2	0. Not Reported	PatientDays	1994	55582
3	0. Not Reported	PatientDays	1995	128507
4	0. Not Reported	PatientDays	1996	182226
5	0. Not Reported	PatientDays	1997	61599
6	0. Not Reported	PatientDays	1998	685879
7	0. Not Reported	Separations	1993	37178
8	0. Not Reported	Separations	1994	6146
9	0. Not Reported	Separations	1995	3832
10	0. Not Reported	Separations	1996	4861

Showing 1 to 10 of 228 entries

Previous1 2 3 4 5…23Next

What if there are deferentially named columns, as in the following example?

mydf <- tibble(
    name = c("Jack", "Jill"),
    sex = c("Male", "Female"),
    test_pre = c(3.21, 3.85),
    test_post = c(3.82, 3.97)
    )

mydf

# A tibble: 2 x 4
  name  sex    test_pre test_post
  <chr> <chr>     <dbl>     <dbl>
1 Jack  Male       3.21      3.82
2 Jill  Female     3.85      3.97

The goal is to move test_pre and test_post to rows, holding all other columns fixed.

mydf %>%
  pivot_longer(
    -c(1:2),
    names_prefix = "test_",
    names_to = "pre_post",
    values_to = "score"
  ) -> mydf.long

mydf.long

# A tibble: 4 x 4
  name  sex    pre_post score
  <chr> <chr>  <chr>    <dbl>
1 Jack  Male   pre       3.21
2 Jack  Male   post      3.82
3 Jill  Female pre       3.85
4 Jill  Female post      3.97

WHO’s Tuberculosis Data and `names_pattern()`

There is a more complicated example that involves the use of regular expressions (i.e., regex). See the following data that comes from the WHO. You should access the data dictionary here and the most recently – as of February 8, 2020 – available dataset here. Let us load the data first and then review the data dictionary.

read_csv(
  "https://extranet.who.int/tme/generateCSV.asp?ds=notifications"
  ) -> tb_data

read_csv(
  "https://extranet.who.int/tme/generateCSV.asp?ds=dictionary"
  ) -> tb_dictionary

datatable(tb_dictionary, caption = "Data Dictionary for WHO's Tuberculosis Case Notifications Data")

Show entries

Search:

Data Dictionary for WHO's Tuberculosis Case Notifications Data
	variable_name	dataset	definition
1	budget_cpp_dstb	Budget	Average cost of drugs budgeted per patient for drug-susceptible TB treatment, excluding buffer stock (US Dollars)
2	budget_cpp_mdr	Budget	Average cost of drugs budgeted per patient for MDR-TB treatment, excluding buffer stock (US Dollars)
3	budget_cpp_tpt	Budget	Average cost of drugs budgeted per patient for TB preventive treatment, excluding buffer stock (US Dollars)
4	budget_cpp_xdr	Budget	Average cost of drugs budgeted per patient for XDR-TB treatment, excluding buffer stock (US Dollars)
5	budget_fld	Budget	Budget required for drugs to treat drug-susceptible TB (US Dollars)
6	budget_lab	Budget	Budget required for laboratory infrastructure, equipment and supplies (US Dollars)
7	budget_mdrmgt	Budget	Budget required for programme costs to treat drug-resistant TB (US Dollars)
8	budget_orsrvy	Budget	Budget required for operational research and surveys (US Dollars)
9	budget_oth	Budget	Budget required for all other budget line items (US Dollars)
10	budget_patsup	Budget	Budget required for patient support (US Dollars)

Showing 1 to 10 of 532 entries

Previous1 2 3 4 5…54Next

The data dictionary has more detail than we need so it will help to focus on a few variables, say new_sp_m04 through new_sp_fu, keeping the country and year columns of course. I will also select just one year (= 2010) to make things easier to follow.

tb_data %>%
  filter(year == 2010) %>%
  select(
    country, year, contains("new_sp_")
    ) -> tb_sp

glimpse(tb_sp)

Rows: 214
Columns: 22
$ country      <chr> "Afghanistan", "Albania", "Algeria", "American…
$ year         <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010…
$ new_sp_m04   <dbl> 4, 0, NA, 0, 0, NA, 0, 0, 13, 0, NA, 0, 0, 0, …
$ new_sp_m514  <dbl> 193, 0, NA, 0, 0, NA, 0, 0, 43, 0, NA, 2, 0, 0…
$ new_sp_m014  <dbl> 197, 0, 52, 0, 0, 448, 0, 0, 56, 0, NA, 2, 0, …
$ new_sp_m1524 <dbl> 986, 28, 1203, 0, 0, 2900, 0, 0, 536, 36, NA, …
$ new_sp_m2534 <dbl> 819, 17, 1669, 0, 0, 3584, 0, 2, 491, 75, NA, …
$ new_sp_m3544 <dbl> 491, 14, 825, 0, 0, 2415, 0, 0, 309, 49, NA, 2…
$ new_sp_m4554 <dbl> 490, 16, 513, 0, 0, 1424, 0, 2, 302, 68, NA, 2…
$ new_sp_m5564 <dbl> 641, 16, 392, 0, 0, 691, 0, 1, 340, 27, NA, 9,…
$ new_sp_m65   <dbl> 622, 15, 397, 0, 0, 355, 1, 0, 282, 15, NA, 27…
$ new_sp_mu    <dbl> 0, 0, NA, 0, 0, NA, NA, 0, 2, 0, NA, 0, 0, NA,…
$ new_sp_f04   <dbl> 16, 1, NA, 0, 0, NA, 0, 0, 7, 0, NA, 1, 0, 0, …
$ new_sp_f514  <dbl> 429, 1, NA, 0, 0, NA, 0, 0, 52, 1, NA, 3, 1, 3…
$ new_sp_f014  <dbl> 445, 2, 79, 0, 0, 558, 0, 0, 59, 1, NA, 4, 1, …
$ new_sp_f1524 <dbl> 2107, 11, 1086, 0, 0, 2763, 0, 0, 421, 24, NA,…
$ new_sp_f2534 <dbl> 2263, 7, 826, 0, 0, 2594, 0, 1, 426, 17, NA, 4…
$ new_sp_f3544 <dbl> 1455, 6, 417, 0, 0, 1688, 0, 0, 233, 4, NA, 12…
$ new_sp_f4554 <dbl> 1112, 3, 251, 0, 0, 958, 0, 0, 184, 7, NA, 2, …
$ new_sp_f5564 <dbl> 831, 2, 222, 0, 0, 482, 0, 0, 153, 8, NA, 5, 5…
$ new_sp_f65   <dbl> 488, 8, 367, 0, 0, 286, 0, 0, 176, 8, NA, 12, …
$ new_sp_fu    <dbl> 0, 0, NA, 0, 0, NA, NA, 0, 1, 0, NA, 0, 0, NA,…

The new_sp_* variable names provide information for males and females of specific age-groups. For example, new_sp_m04 provides the number of new cases of males in the 0-4 year age-group testing positive on the pulmonary smear test. Similarly, new_sp_f04 is for females 0-4 years of age. Let us tidy these data.

tb_sp %>%
  pivot_longer(
    cols = new_sp_m04:new_sp_fu,
    names_to = c("test_type", "sex", "age_group"),
    names_pattern = "(new_sp_)(.)(.*)",
    values_to = "frequency"
  ) -> tb_sp.long

datatable(tb_sp.long[1:100, ])

Show entries

Search:

	country	year	test_type	sex	age_group	frequency
1	Afghanistan	2010	new_sp_	m	04	4
2	Afghanistan	2010	new_sp_	m	514	193
3	Afghanistan	2010	new_sp_	m	014	197
4	Afghanistan	2010	new_sp_	m	1524	986
5	Afghanistan	2010	new_sp_	m	2534	819
6	Afghanistan	2010	new_sp_	m	3544	491
7	Afghanistan	2010	new_sp_	m	4554	490
8	Afghanistan	2010	new_sp_	m	5564	641
9	Afghanistan	2010	new_sp_	m	65	622
10	Afghanistan	2010	new_sp_	m	u	0

Showing 1 to 10 of 100 entries

Previous1 2 3 4 5…10Next

Focus on names_to = c("test_type", "sex", "age_group") … this is specifying the three name the three new columns should have.

In turn, names_pattern = "(new_sp_)(.)(.*)" is specifying that the existing columns should be broken up into three pieces as follows:

new_sp_ demarcate the first new column test_type
(.) demarcates the second new column sex
(.*) demarcates the third new column age_group

The names_pattern() will be tricky to decipher without good working knowledge of regular expressions (regex). Say we did not know regex. What could we do?

tb_sp %>%
  group_by(country, year) %>%
  gather(test_type_sex_age_group, frequency, 3:22,
         convert = TRUE) -> tab_sp_long_old

datatable(tab_sp_long_old)

Show entries

Search:

	country	year	test_type_sex_age_group	frequency
1	Afghanistan	2010	new_sp_m04	4
2	Albania	2010	new_sp_m04	0
3	Algeria	2010	new_sp_m04
4	American Samoa	2010	new_sp_m04	0
5	Andorra	2010	new_sp_m04	0
6	Angola	2010	new_sp_m04
7	Anguilla	2010	new_sp_m04	0
8	Antigua and Barbuda	2010	new_sp_m04	0
9	Argentina	2010	new_sp_m04	13
10	Armenia	2010	new_sp_m04	0

Showing 1 to 10 of 4,280 entries

Previous1 2 3 4 5…428Next

This will still pivot the columns so that the data are in the long format, but we still need to split the test_type_sex_age_group column into the three pieces of information it encapsulates. That is the subject of our next encounter with tidyr – understanding how separate() and unite() work on data columns.

Tidy Data (1/2): Wide to Long with pivot_longer()

Author

Affiliation

Published

DOI

Why tidy your data?

The World Health Organization’s Influenza Database

Some more functions to lean on …

names_prefix()

WHO’s Tuberculosis Data and names_pattern()

Footnotes

`names_prefix()`

WHO’s Tuberculosis Data and `names_pattern()`