Tidy Data (2/2): Wide to Long with pivot

The last example we worked through was to use regex to convert data to the long format. Let us redo that exercise.

library(tidyverse)
library(tidylog)
library(DT)

read_csv(
  "https://extranet.who.int/tme/generateCSV.asp?ds=notifications"
  ) %>%
  filter(year == 2010) %>%
  select(
    country, year, contains("new_sp_")
    ) -> tb_sp

tb_sp %>%
  pivot_longer(
    cols = new_sp_m04:new_sp_fu,
    names_to = c("test_type", "sex", "age_group"),
    names_pattern = "(new_sp_)(.)(.*)",
    values_to = "frequency"
  ) -> tb_sp.long

datatable(tb_sp.long[1:100, ])

Show entries

Search:

	country	year	test_type	sex	age_group	frequency
1	Afghanistan	2010	new_sp_	m	04	4
2	Afghanistan	2010	new_sp_	m	514	193
3	Afghanistan	2010	new_sp_	m	014	197
4	Afghanistan	2010	new_sp_	m	1524	986
5	Afghanistan	2010	new_sp_	m	2534	819
6	Afghanistan	2010	new_sp_	m	3544	491
7	Afghanistan	2010	new_sp_	m	4554	490
8	Afghanistan	2010	new_sp_	m	5564	641
9	Afghanistan	2010	new_sp_	m	65	622
10	Afghanistan	2010	new_sp_	m	u	0

Showing 1 to 10 of 100 entries

Previous1 2 3 4 5…10Next

Focus on names_to = c("test_type", "sex", "age_group") … this is specifying the three name the three new columns should have.

In turn, names_pattern = "(new_sp_)(.)(.*)" is specifying that the existing columns should be broken up into three pieces as follows:

The names_pattern() will be tricky to decipher without good working knowledge of regular expressions (regex). Say we did not know regex. What could we do?

tb_sp %>%
  group_by(country, year) %>%
  gather(test_type_sex_age_group, frequency, 3:22,
         convert = TRUE) -> tab_sp_long_old

datatable(tab_sp_long_old)

Show entries

Search:

	country	year	test_type_sex_age_group	frequency
1	Afghanistan	2010	new_sp_m04	4
2	Albania	2010	new_sp_m04	0
3	Algeria	2010	new_sp_m04
4	American Samoa	2010	new_sp_m04	0
5	Andorra	2010	new_sp_m04	0
6	Angola	2010	new_sp_m04
7	Anguilla	2010	new_sp_m04	0
8	Antigua and Barbuda	2010	new_sp_m04	0
9	Argentina	2010	new_sp_m04	13
10	Armenia	2010	new_sp_m04	0

Showing 1 to 10 of 4,280 entries

Previous1 2 3 4 5…428Next

This will still pivot the columns so that the data are in the long format, but we still need to split the test_type_sex_age_group column into the three pieces of information it encapsulates. That is the subject of our next encounter with tidyr – understanding how separate() and unite() work on data columns.

separate()

I would like to split test_type_sex_age_group into three variables, and separate() can do that with ease. All we need to do is to specify the column that needs to be split, the new column names that should be created, what separates the columns, and a few other bits.

tab_sp_long_old %>%
  separate(
    col = test_type_sex_age_group,
    into = c("test_type", "sex", "age_group"),
    sep = c(7, 8),
    remove = FALSE,
    convert = TRUE
  ) -> tab_sp_split

datatable(tab_sp_split[1:100, ])

Show entries

Search:

	country	year	test_type_sex_age_group	test_type	sex	age_group	frequency
1	Afghanistan	2010	new_sp_m04	new_sp_	m	04	4
2	Albania	2010	new_sp_m04	new_sp_	m	04	0
3	Algeria	2010	new_sp_m04	new_sp_	m	04
4	American Samoa	2010	new_sp_m04	new_sp_	m	04	0
5	Andorra	2010	new_sp_m04	new_sp_	m	04	0
6	Angola	2010	new_sp_m04	new_sp_	m	04
7	Anguilla	2010	new_sp_m04	new_sp_	m	04	0
8	Antigua and Barbuda	2010	new_sp_m04	new_sp_	m	04	0
9	Argentina	2010	new_sp_m04	new_sp_	m	04	13
10	Armenia	2010	new_sp_m04	new_sp_	m	04	0

Showing 1 to 10 of 100 entries

Previous1 2 3 4 5…10Next

I left remove = FALSE so that the original column was retained for illustration purposes. If I had set remove = TRUE instead the original column wwould have been dropped after being split.

convert = TRUE allows the operation to decipher if the new column should be numeriuc or a character. If you set it to FALSE then what should be numeric columns will be retained as character columns.

unite()

This is the opposite of separate(), and allows us to combine the contents of two or more columns into a single column. In the example below, I am combining the state and county FIPS codes into a single column, and the state and county names into another column. Here the data to start with:

mydf <- cbind.data.frame(
  statefips = c(39, 39, 39, 39),
  countyfips = c("001", "003", "005", "007"),
  state = c("Ohio", "Ohio", "Ohio", "Ohio"),
  county = c("Adams", "Allen", "Ashland", "Ashtabula")
  )

mydf

mydf %>%
  unite(
    col = "scfips",
    statefips, countyfips,
    sep = "",
    remove = FALSE
    ) -> mydf.unite.01

mydf.unite.01

Watch how the next example allows you to include a string and characters, if that is what you need:

mydf %>%
    unite(
        col = "scnames", 
        county, state,
        sep = " County, ",
        remove = FALSE
    ) -> mydf.unite.02

mydf.unite.02

pivot_wider()

This is the opposite of pivot_longer() and converts long data to the wide format. For example, say we have the following data:

datatable(fish_encounters, caption = "The Fish Encounters Data")

Show entries

Search:

The Fish Encounters Data
	fish	station	seen
1	4842	Release	1
2	4842	I80_1	1
3	4842	Lisbon	1
4	4842	Rstr	1
5	4842	Base_TD	1
6	4842	BCE	1
7	4842	BCW	1
8	4842	BCE2	1
9	4842	BCW2	1
10	4842	MAE	1

Showing 1 to 10 of 114 entries

Previous1 2 3 4 5…12Next

Note that in the code below, the names of the new columns aree being taken from the station column while the values these columns will be populated with are being taken from the seen column.

fish_encounters %>%
  pivot_wider(
    names_from = station,
    values_from = seen
    ) -> fish.wide.01

datatable(fish.wide.01, caption = "The Fish Encounters Data in Wide Format (with NAs)")

Show entries

Search:

The Fish Encounters Data in Wide Format (with NAs)
	fish	Release	I80_1	Lisbon	Rstr	Base_TD	BCE	BCW	BCE2	BCW2	MAE	MAW
1	4842	1	1	1	1	1	1	1	1	1	1	1
2	4843	1	1	1	1	1	1	1	1	1	1	1
3	4844	1	1	1	1	1	1	1	1	1	1	1
4	4845	1	1	1	1	1
5	4847	1	1	1
6	4848	1	1	1	1
7	4849	1	1
8	4850	1	1		1	1	1	1
9	4851	1	1
10	4854	1	1

Showing 1 to 10 of 19 entries

Previous1 2Next

Every fish is not seen at every station, leaving us some blank cells. If we want these blank cells to be populated with 0, that is easy to do:

fish_encounters %>%
  pivot_wider(
    names_from = station,
    values_from = seen,
    values_fill = list(seen = 0)
    ) -> fish.wide.02

datatable(fish.wide.02, caption = "The Fish Encounters Data in Wide Format (with Zeroes)")

Show entries

Search:

The Fish Encounters Data in Wide Format (with Zeroes)
	fish	Release	I80_1	Lisbon	Rstr	Base_TD	BCE	BCW	BCE2	BCW2	MAE	MAW
1	4842	1	1	1	1	1	1	1	1	1	1	1
2	4843	1	1	1	1	1	1	1	1	1	1	1
3	4844	1	1	1	1	1	1	1	1	1	1	1
4	4845	1	1	1	1	1	0	0	0	0	0	0
5	4847	1	1	1	0	0	0	0	0	0	0	0
6	4848	1	1	1	1	0	0	0	0	0	0	0
7	4849	1	1	0	0	0	0	0	0	0	0	0
8	4850	1	1	0	1	1	1	1	0	0	0	0
9	4851	1	1	0	0	0	0	0	0	0	0	0
10	4854	1	1	0	0	0	0	0	0	0	0	0

Showing 1 to 10 of 19 entries

Previous1 2Next

What if there are multiple unique things to be pivoted to the wide format? An example of the target data is shown below.

datatable(us_rent_income)

Show entries

Search:

	GEOID	NAME	variable	estimate	moe
1	01	Alabama	income	24476	136
2	01	Alabama	rent	747	3
3	02	Alaska	income	32940	508
4	02	Alaska	rent	1200	13
5	04	Arizona	income	27517	148
6	04	Arizona	rent	972	4
7	05	Arkansas	income	23789	165
8	05	Arkansas	rent	709	5
9	06	California	income	29454	109
10	06	California	rent	1358	3

Showing 1 to 10 of 104 entries

Previous1 2 3 4 5…11Next

Notice that variable assumes two unique values, (1) income, and (2) rent. Each also has two values – estimate and an moe (which stands for margin of error).

us_rent_income %>%
  pivot_wider(
    names_from = variable,
    values_from = c(estimate, moe)
    ) -> rent.01

datatable(rent.01)

Show entries

Search:

	GEOID	NAME	estimate_income	estimate_rent	moe_income	moe_rent
1	01	Alabama	24476	747	136	3
2	02	Alaska	32940	1200	508	13
3	04	Arizona	27517	972	148	4
4	05	Arkansas	23789	709	165	5
5	06	California	29454	1358	109	3
6	08	Colorado	32401	1125	109	5
7	09	Connecticut	35326	1123	195	5
8	10	Delaware	31560	1076	247	10
9	11	District of Columbia	43198	1424	681	17
10	12	Florida	25952	1077	70	3

Showing 1 to 10 of 52 entries

Previous1 2 3 4 5 6Next

Notice how values_from was used to make sure that the estimate and moe for income were attached to it while the estimate and moe for rent were attached to it, respectively.

values_fn()

We can also convert data to wide format and populate the new columns with aggregated values such as the mean, the sum, etc. See the data created below:

warpbreaks <- as_tibble(
  warpbreaks[c("wool", "tension", "breaks")]
  )

datatable(warpbreaks)

	wool	tension	breaks
1	A	L	26
2	A	L	30
3	A	L	54
4	A	L	25
5	A	L	70
6	A	L	52
7	A	L	51
8	A	L	26
9	A	L	67
10	A	M	18

warpbreaks %>%
  pivot_wider(
    names_from = wool,
    values_from = breaks,
    values_fn = list(breaks = median)
    ) -> warps.wide

warps.wide

Tidy Data (2/2): Wide to Long with pivot_longer()

Author

Affiliation

Published

DOI

Continuing to tidy your data

`separate()`

`unite()`

`pivot_wider()`

`values_fn()`

Footnotes