My goal here is to introduce you to some basic tidy
operations that are often necessary when working with ‘untidy’ data. What are tidy data? Well, tidy data have the following characteristics
What then are untidy
data? Here are two examples:
In Example 1, why is each income-level a column? Would it not make more sense to have the data structured as follows:
relig_income %>%
pivot_longer(-religion, names_to = "income", values_to = "frequency")
# A tibble: 180 x 3
religion income frequency
<chr> <chr> <dbl>
1 Agnostic <$10k 27
2 Agnostic $10-20k 34
3 Agnostic $20-30k 60
4 Agnostic $30-40k 81
5 Agnostic $40-50k 76
6 Agnostic $50-75k 137
7 Agnostic $75-100k 122
8 Agnostic $100-150k 109
9 Agnostic >150k 84
10 Agnostic Don't know/refused 96
# … with 170 more rows
Note that now each column is a unique variable.
What about the billboard data? Same thing; why not have the weeks in rows? After all they are measuring the same thing – measurement at specific time intervals – aren’t they?
billboard %>%
pivot_longer(
wk1:wk76,
names_to = "week",
values_to = "rank"
)
# A tibble: 24,092 x 5
artist track date.entered week rank
<chr> <chr> <date> <chr> <dbl>
1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
8 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk8 NA
9 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk9 NA
10 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk10 NA
# … with 24,082 more rows
Both these examples had data in the wide
format that were tidied up by converting the data into the long
format.
Here is a small example of untidy data from the WHO, grabbed as available on February 6, 2020. Data are in the file titled Rpt_LabSurveillanceDataLatestWeekByCtry.csv
. Make sure you place this csv file in the data sub-folder and then read it in as shown below. If you have your project setup correctly, you should be all set. If you missed our first meeting or forgot to create a project, follow the steps outlined below:
- Create a directory and call it
athensR
- Inside this directory create a sub-folder called
data
- Launch RStudio and then use
File -> New Project...
, choosingExisting Directory
to find yourathensr
folder.
RStudio will restart and you should see a project file called
athensR.Rproj
. Next time you want to work on this material, find this file and double-click; RStudio will start in yourathensR
folder.
All data sent to you should be saved in the data sub-folder.
I am loading up some packages we will need –
here
will help manage data and other files so that you do not waste time solving file-path problemsjanitor
will help clean up our data-set and is a package we will use oftentidylog
will track the consequence of almost all tidyverse commandsNotice an important switch used here: skip = 3
. This is being used because the csv file has some header information that occupies the first three rows. If you open the csv file in MS Excel you will see this problem. But the information in these initial rows is not needed and so is skipped when reading in the data. The result? See below:
datatable(flu, caption = "The WHO Data")
Here are the column names and their meaning:
Name | Meaning |
---|---|
SP_RECEIVED | Number of specimens received/collected |
SP_PROCESSED | Number of specimens processed |
AH1N12009 | Number of A(H1N1)pdm09 influenza viruses detected |
AH3 | Number of A(H3) influenza viruses detected |
AH5 | Number of A(H5) influenza viruses detected |
ANOTSUBTYPED | Number of A(not subtyped) influenza viruses detected |
BYAMAGATA | Number of influenza B(yamagata lineage) viruses detected |
BVICTORIA | Number of influenza B(victoria lineage) viruses detected |
BNOTDETERMINED | Number of influenza B(lineage not determined) viruses detected |
INF_B | Total Number of influenza B viruses detected |
INF_TOTAL | Total Number of influenza positive viruses |
INF_TOTAL2 | Total Number of influenza negative viruses |
Title | ILI Activity |
One way of tidying these data would be to create three columns, one that lists, for each country, the type of specimen (received versus processed), the second listing for each country the type of A virus detected, and the third listing for each country the type of B virus detected.
We can do this the ‘long’ way (no pun intended), by first working on just the type of specimen. In the code below I am selecting the columns I would like retained, and this is being done with the select()
command. In pivot_longer()
I am specifying that the Country column should not be pivoted from wide to long by prefacing the column name with -
, as in -Country
.
flu %>%
select(Country, SP_RECEIVED, SP_PROCESSED) %>%
pivot_longer(
-Country,
names_to = c("Specimen Status"),
values_to = "Number"
) -> flu_long_1
datatable(flu_long_1)
Now for influenza virus type A.
flu %>%
select(Country, AH1N12009:INF_A) %>%
pivot_longer(
-Country,
names_to = c("Virus Type A"),
values_to = "Number"
) -> flu_long_2
datatable(flu_long_2)
… and now for influenza virus type B.
flu %>%
select(Country, BYAMAGATA:INF_B) %>%
pivot_longer(
-Country,
names_to = c("Virus Type B"),
values_to = "Number"
) -> flu_long_3
datatable(flu_long_3)
What if we wanted all columns to be long in one go?
flu %>%
select(Country:INF_B) %>%
pivot_longer(
-Country,
names_to = c("Indicator"),
values_to = "Number"
) -> flu_long_4
datatable(flu_long_4)
pivot_longer()
has several other options that can be handy if needed; here they are:
pivot_longer(
data, cols, names_to = "name", names_prefix = NULL,
names_sep = NULL, names_pattern = NULL, names_ptypes = list(),
names_repair = "check_unique", values_to = "value",
values_drop_na = FALSE, values_ptypes = list()
)
names_prefix()
I’d like to walk through each of these in turn. Let us start with names_prefix =
in the context of the following data-set that shows some general reasons people were admitted to hospital by financial year from July 1993 to June 1998 ([Source:]).
read_csv(
"http://www.mm-c.me/mdsi/hospitals93to98.csv"
) -> hosp
datatable(hosp)
hosp %>%
pivot_longer(
-c(IcdChapter, Field),
names_to = "Fiscal Year",
names_prefix = "FY",
values_to = "value"
) -> hosp2
datatable(hosp2, caption = "Long format of hosp dataframe")
What if there are deferentially named columns, as in the following example?
mydf <- tibble(
name = c("Jack", "Jill"),
sex = c("Male", "Female"),
test_pre = c(3.21, 3.85),
test_post = c(3.82, 3.97)
)
mydf
# A tibble: 2 x 4
name sex test_pre test_post
<chr> <chr> <dbl> <dbl>
1 Jack Male 3.21 3.82
2 Jill Female 3.85 3.97
The goal is to move test_pre
and test_post
to rows, holding all other columns fixed.
mydf %>%
pivot_longer(
-c(1:2),
names_prefix = "test_",
names_to = "pre_post",
values_to = "score"
) -> mydf.long
mydf.long
# A tibble: 4 x 4
name sex pre_post score
<chr> <chr> <chr> <dbl>
1 Jack Male pre 3.21
2 Jack Male post 3.82
3 Jill Female pre 3.85
4 Jill Female post 3.97
names_pattern()
There is a more complicated example that involves the use of regular expressions (i.e., regex). See the following data that comes from the WHO. You should access the data dictionary here and the most recently – as of February 8, 2020 – available dataset here. Let us load the data first and then review the data dictionary.
read_csv(
"https://extranet.who.int/tme/generateCSV.asp?ds=notifications"
) -> tb_data
read_csv(
"https://extranet.who.int/tme/generateCSV.asp?ds=dictionary"
) -> tb_dictionary
datatable(tb_dictionary, caption = "Data Dictionary for WHO's Tuberculosis Case Notifications Data")
The data dictionary has more detail than we need so it will help to focus on a few variables, say new_sp_m04
through new_sp_fu
, keeping the country
and year
columns of course. I will also select just one year (= 2010) to make things easier to follow.
tb_data %>%
filter(year == 2010) %>%
select(
country, year, contains("new_sp_")
) -> tb_sp
glimpse(tb_sp)
Rows: 214
Columns: 22
$ country <chr> "Afghanistan", "Albania", "Algeria", "American…
$ year <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010…
$ new_sp_m04 <dbl> 4, 0, NA, 0, 0, NA, 0, 0, 13, 0, NA, 0, 0, 0, …
$ new_sp_m514 <dbl> 193, 0, NA, 0, 0, NA, 0, 0, 43, 0, NA, 2, 0, 0…
$ new_sp_m014 <dbl> 197, 0, 52, 0, 0, 448, 0, 0, 56, 0, NA, 2, 0, …
$ new_sp_m1524 <dbl> 986, 28, 1203, 0, 0, 2900, 0, 0, 536, 36, NA, …
$ new_sp_m2534 <dbl> 819, 17, 1669, 0, 0, 3584, 0, 2, 491, 75, NA, …
$ new_sp_m3544 <dbl> 491, 14, 825, 0, 0, 2415, 0, 0, 309, 49, NA, 2…
$ new_sp_m4554 <dbl> 490, 16, 513, 0, 0, 1424, 0, 2, 302, 68, NA, 2…
$ new_sp_m5564 <dbl> 641, 16, 392, 0, 0, 691, 0, 1, 340, 27, NA, 9,…
$ new_sp_m65 <dbl> 622, 15, 397, 0, 0, 355, 1, 0, 282, 15, NA, 27…
$ new_sp_mu <dbl> 0, 0, NA, 0, 0, NA, NA, 0, 2, 0, NA, 0, 0, NA,…
$ new_sp_f04 <dbl> 16, 1, NA, 0, 0, NA, 0, 0, 7, 0, NA, 1, 0, 0, …
$ new_sp_f514 <dbl> 429, 1, NA, 0, 0, NA, 0, 0, 52, 1, NA, 3, 1, 3…
$ new_sp_f014 <dbl> 445, 2, 79, 0, 0, 558, 0, 0, 59, 1, NA, 4, 1, …
$ new_sp_f1524 <dbl> 2107, 11, 1086, 0, 0, 2763, 0, 0, 421, 24, NA,…
$ new_sp_f2534 <dbl> 2263, 7, 826, 0, 0, 2594, 0, 1, 426, 17, NA, 4…
$ new_sp_f3544 <dbl> 1455, 6, 417, 0, 0, 1688, 0, 0, 233, 4, NA, 12…
$ new_sp_f4554 <dbl> 1112, 3, 251, 0, 0, 958, 0, 0, 184, 7, NA, 2, …
$ new_sp_f5564 <dbl> 831, 2, 222, 0, 0, 482, 0, 0, 153, 8, NA, 5, 5…
$ new_sp_f65 <dbl> 488, 8, 367, 0, 0, 286, 0, 0, 176, 8, NA, 12, …
$ new_sp_fu <dbl> 0, 0, NA, 0, 0, NA, NA, 0, 1, 0, NA, 0, 0, NA,…
The new_sp_*
variable names provide information for males and females of specific age-groups. For example, new_sp_m04
provides the number of new cases of males in the 0-4 year age-group testing positive on the pulmonary smear test. Similarly, new_sp_f04
is for females 0-4 years of age. Let us tidy these data.
tb_sp %>%
pivot_longer(
cols = new_sp_m04:new_sp_fu,
names_to = c("test_type", "sex", "age_group"),
names_pattern = "(new_sp_)(.)(.*)",
values_to = "frequency"
) -> tb_sp.long
datatable(tb_sp.long[1:100, ])
Focus on names_to = c("test_type", "sex", "age_group")
… this is specifying the three name the three new columns should have.
In turn, names_pattern = "(new_sp_)(.)(.*)"
is specifying that the existing columns should be broken up into three pieces as follows:
new_sp_
demarcate the first new column test_type
(.)
demarcates the second new column sex
(.*)
demarcates the third new column age_group
The names_pattern()
will be tricky to decipher without good working knowledge of regular expressions (regex). Say we did not know regex. What could we do?
This will still pivot the columns so that the data are in the long
format, but we still need to split the test_type_sex_age_group
column into the three pieces of information it encapsulates. That is the subject of our next encounter with tidyr
– understanding how separate()
and unite()
work on data columns.