Reading and Cleaning Data

Ani Ruhil
2020-07-07

Reading in data

Some 99% of the time you will be working with primary or secondary data gathered in the course of a program evaluation. Often these data will have been stored in one or more Excel sheets. The file extension may be *.xlsx or the older *.xls but I would encourage you to consider saving your data in *.csv or *.txt formats because these do not mangle the file with needless bells and whistles. Every now and then you may be working with colleagues who lean on SPSS, Stata or SAS and hence must work with data stored in each software’s proprietary format. On the rare occasion you may even have to wrestle with flat ascii files, those relics from the days of mainframe computing. R will read these with ease.1

Let us see how R reads in data from existing files, starting with the two most common Excel formats. In the commands that follow you will see me using the here library to tell R where the data file is located and the name of the file. This is useful for many reasons including the fact that whether someone is on windows versus mac versus linux or running RStudio Cloud becomes irrelevant. Otherwise you have to specify the full filepath and this leads to more headaches than blessings!

Importing xlsx and xls data

The file popn2019.xlsx in the data folder has data on the estimated annual population in 2019 of incorporated places in Ohio. Reading it should be simple with the readxl library:

library(readxl)
library(here)
read_excel(
  here("workshops/ropeg/handouts/data", "popn2019.xlsx")
  ) -> popn.xlsx

names(popn.xlsx) # show me the name of each column
[1] "Geography" "2019"     
summary(popn.xlsx) # show me summary statistics 
  Geography              2019         
 Length:930         Min.   :    36.0  
 Class :character   1st Qu.:   446.8  
 Mode  :character   Median :  1421.0  
                    Mean   :  8176.7  
                    3rd Qu.:  5905.5  
                    Max.   :898553.0  

If instead we wanted the *.xls file:

read_excel(
  here("workshops/ropeg/handouts/data", "popn2019.xls")
  ) -> popn.xls
names(popn.xls) # show me the name of each column
[1] "Geography" "2019"     
summary(popn.xls) # show me summary statistics 
  Geography              2019         
 Length:930         Min.   :    36.0  
 Class :character   1st Qu.:   446.8  
 Mode  :character   Median :  1421.0  
                    Mean   :  8176.7  
                    3rd Qu.:  5905.5  
                    Max.   :898553.0  

Importing csv and txt data

The *.csv file can be read with readr as follows:

library(readr)
read_csv(
  here("workshops/ropeg/handouts/data", "popn2019.csv")
  ) -> popn.csv
names(popn.csv) # show me the name of each column
[1] "Geography" "2019"     
summary(popn.csv) # show me summary statistics 
  Geography              2019         
 Length:930         Min.   :    36.0  
 Class :character   1st Qu.:   446.8  
 Mode  :character   Median :  1421.0  
                    Mean   :  8176.7  
                    3rd Qu.:  5905.5  
                    Max.   :898553.0  

And that *.txt file?

read_delim(
  here("workshops/ropeg/handouts/data", "popn2019.txt"),
  "\t"
  ) -> popn.txt
names(popn.txt) # show me the name of each column
[1] "Geography" "2019"     
summary(popn.txt) # show me summary statistics 
  Geography              2019         
 Length:930         Min.   :    36.0  
 Class :character   1st Qu.:   446.8  
 Mode  :character   Median :  1421.0  
                    Mean   :  8176.7  
                    3rd Qu.:  5905.5  
                    Max.   :898553.0  

Here the "\t" switch is communicating to R that the columns are tab-delimited.

Importing SPSS, Stata, and SAS data

Now we lean on the haven library but we could have also leaned on a few other libraries.2 Let us read in the three formats …

library(haven)
read_spss(
  here("workshops/ropeg/handouts/data", "popn2019.sav")
  ) -> popn.sav
summary(popn.sav)
  Geography             v2019         
 Length:930         Min.   :    36.0  
 Class :character   1st Qu.:   446.8  
 Mode  :character   Median :  1421.0  
                    Mean   :  8176.7  
                    3rd Qu.:  5905.5  
                    Max.   :898553.0  
read_dta(
  here("workshops/ropeg/handouts/data", "popn2019.dta")
  ) -> popn.stata
summary(popn.stata)
  Geography             v2019         
 Length:930         Min.   :    36.0  
 Class :character   1st Qu.:   446.8  
 Mode  :character   Median :  1421.0  
                    Mean   :  8176.7  
                    3rd Qu.:  5905.5  
                    Max.   :898553.0  
read_sas(
  here("workshops/ropeg/handouts/data", "popn2019.sas7bdat")
  ) -> popn.sas
summary(popn.sas)
  Geography             v2019         
 Length:930         Min.   :    36.0  
 Class :character   1st Qu.:   446.8  
 Mode  :character   Median :  1421.0  
                    Mean   :  8176.7  
                    3rd Qu.:  5905.5  
                    Max.   :898553.0  

Notice that the year column is title v2019 in the SAS and SPSS files as read and _2019 in the Stata file as read. We will fix this in a little bit but the lesson here is that it is always good to have column names not start with a number; many software suites dislike that and automatically assume it is a character vector.

Importing data from the web

This is a useful facility because if you have to keep pulling updated versions of the same data or you don’t want to keep local copies of the data file. The fill I am going to grab is the master population data file we have been reading in.

library(readxl)
"https://www2.census.gov/programs-surveys/popest/tables/2010-2019/cities/totals/SUB-IP-EST2019-ANNRES-39.xlsx" -> url
"data/SUB-IP-EST2019-ANNRES-39.xlsx" -> destfile 
curl::curl_download(url, destfile)
read_excel(destfile, skip = 3, n_max = 930) -> SUB_IP_EST2019_ANNRES_39

names(SUB_IP_EST2019_ANNRES_39)
 [1] "...1"           "Census"         "Estimates Base"
 [4] "2010"           "2011"           "2012"          
 [7] "2013"           "2014"           "2015"          
[10] "2016"           "2017"           "2018"          
[13] "2019"          

Fixing column names

Look at the column names. Terrible! Let us fix these with the colnames() command.

colnames(SUB_IP_EST2019_ANNRES_39) = c(
  "Geography", "Census_2010", "Estimates_base", "year_2010",
  "year_2011", "year_2012", "year_2013", "year_2014",
  "year_2015", "year_2016", "year_2017", "year_2018",
  "year_2019"
  )

names(SUB_IP_EST2019_ANNRES_39)
 [1] "Geography"      "Census_2010"    "Estimates_base"
 [4] "year_2010"      "year_2011"      "year_2012"     
 [7] "year_2013"      "year_2014"      "year_2015"     
[10] "year_2016"      "year_2017"      "year_2018"     
[13] "year_2019"     

The important thing is to get the order of the column names right. That is, if I put "Geography" anywhere but as the name of the first column my data will get messed up.

You can also rename individual columns by referencing their column positions, for example, you could have done the following to rename only the first column, the second column, and the third column, respectively.

colnames(SUB_IP_EST2019_ANNRES_39)[1] = "geography"
colnames(SUB_IP_EST2019_ANNRES_39)[2] = "census2010"
colnames(SUB_IP_EST2019_ANNRES_39)[3] = "base"

names(SUB_IP_EST2019_ANNRES_39)
 [1] "geography"  "census2010" "base"       "year_2010"  "year_2011" 
 [6] "year_2012"  "year_2013"  "year_2014"  "year_2015"  "year_2016" 
[11] "year_2017"  "year_2018"  "year_2019" 

Cleaning your data

If you work with data you know what this entails … making sure there are no missing values where there should be none, adding value labels to categorical variables, making sure the categories are ordinal where they should be ordinal, creating grouped frequency tables for age, income, etc, maybe transforming numerical variables into z-scores, collapsing some categories, making sure strings have the proper case (sentence-case or title-case), and so on. To work through these basic steps I will lean on a small data-set that is often used for teaching statistics – the hsb2 data-set with 200 observations on each student’s scaled scores on five standardized tests, and then their race/ethnicity, sex, whether they attended a private or public school, and so on.

read.table(
  'https://stats.idre.ucla.edu/stat/data/hsb2.csv',
  header = TRUE, 
  sep = ","
  ) -> hsb2
read_csv(here("workshops/ropeg/handouts/data", "hsb2.csv")) -> hsb2

library(tidyverse)

glimpse(hsb2)
Rows: 200
Columns: 11
$ id      <dbl> 70, 121, 86, 141, 172, 113, 50, 11, 84, 48, 75, 60,…
$ female  <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ race    <dbl> 4, 4, 4, 4, 4, 4, 3, 1, 4, 3, 4, 4, 4, 4, 3, 4, 4, …
$ ses     <dbl> 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 1, 1, 3, …
$ schtyp  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ prog    <dbl> 1, 3, 1, 3, 2, 2, 1, 2, 1, 2, 3, 2, 2, 2, 2, 1, 2, …
$ read    <dbl> 57, 68, 44, 63, 47, 44, 50, 34, 63, 57, 60, 57, 73,…
$ write   <dbl> 52, 59, 33, 44, 52, 52, 59, 46, 57, 55, 46, 65, 60,…
$ math    <dbl> 41, 53, 54, 47, 57, 51, 42, 45, 54, 52, 51, 51, 71,…
$ science <dbl> 47, 63, 58, 53, 53, 63, 53, 39, 58, 50, 53, 63, 61,…
$ socst   <dbl> 57, 61, 31, 56, 61, 61, 61, 36, 51, 51, 61, 61, 71,…

Creating labelled factors

Notice the absence of value labels. Luckily we know what these should be on the categorical variables:

Now we can add these labels to each variable. When we do this we create what R calls factors – categorical variables with value levels (0, 1, 2, etc) and labels (male, female, etc).

Rather than overwrite each variable I am going to assign new names to each factor I create by appending .f to each original variable name. This is a good policy to adopt because in case something gets messed up you still have the original variables to rework.

factor(
  hsb2$female,
  levels = c(0, 1),
  labels = c("Male", "Female")
  ) -> hsb2$female.f 

Notice the sequence here: factor(dataset$variable, levels, labels)

Let us finish the rest:

factor(hsb2$race,
       levels = c(1:4),
       labels = c("Hispanic", "Asian", "African American", "White")
       ) -> hsb2$race.f

factor(hsb2$ses, 
       levels = c(1:3),
       labels = c("Low", "Middle", "High")
       ) -> hsb2$ses.f

factor(hsb2$schtyp,
       levels = c(1:2),
       labels = c("Public", "Private")
       ) -> hsb2$schtyp.f

factor(hsb2$prog,
       levels = c(1:3), 
       labels = c("General", "Academic", "Vocational")
       ) -> hsb2$prog.f

Dropping spare columns

Now we could drop the original columns that lack the value labels. How? Well, one way to do it would be to simply list the columns we want by noting and using the column numbers. Run the names(hsb2) command and you will see each column’s position.

names(hsb2)
 [1] "id"       "female"   "race"     "ses"      "schtyp"   "prog"    
 [7] "read"     "write"    "math"     "science"  "socst"    "female.f"
[13] "race.f"   "ses.f"    "schtyp.f" "prog.f"  

Okay, so I do not want columns 2 through 6 and I want all the rows. I can specify this by listing the rows and columns I want:

hsb2[c(1:200), c(1, 7:16)] -> myhsb

names(myhsb)
 [1] "id"       "read"     "write"    "math"     "science"  "socst"   
 [7] "female.f" "race.f"   "ses.f"    "schtyp.f" "prog.f"  

Changing the order of a factor

One of the common tasks with categorical measures tends to be the need to organize the levels of the factor so that they retain their ordinal property. Let us assume there is the following hierarchy (in descending order) of the levels of prog.f: Academic > General > Vocational. We can enforce this ordinal structure as follows:

ordered(
  myhsb$prog.f,
  levels = c("Vocational", "General", "Academic")
  ) -> myhsb$prog.ordered

This is a reasonable solution when you are basically making frequency tables, cross-tabulations, charts, etc.3

Basic Frequency Tables and Cross-tabulations

Let us look at some simple frequency tables. Mind you, there are several ways to build these but I will lean on what I think is the easiest for those new to R: With the janitor package.

library(janitor)

tabyl(myhsb$prog.ordered) %>% 
  adorn_totals(where = "row") %>% # sum the rows and show the result
  adorn_pct_formatting(digits = 0, affix_sign = TRUE) # show percentages with % 
 myhsb$prog.ordered   n percent
         Vocational  50     25%
            General  45     22%
           Academic 105     52%
              Total 200    100%

Now say I wanted a simple cross-tabulation of prog.ordered by female.f.

myhsb %>%
  tabyl(female.f, prog.ordered) # the cross-tabulation
 female.f Vocational General Academic
     Male         23      21       47
   Female         27      24       58

This is a simple table with no row or column totals or percentages. Let us add row and col totals.

myhsb %>%
  tabyl(female.f, prog.ordered) %>% # the cross-tabulation
  adorn_totals(where = c("row", "col")) # row & column totals
 female.f Vocational General Academic Total
     Male         23      21       47    91
   Female         27      24       58   109
    Total         50      45      105   200

Ideally you will want to show the frequencies and the percentages. Say we go with column percentages.

myhsb %>%
  tabyl(female.f, prog.ordered) %>% # the cross-tabulation
  adorn_totals(where = c("row", "col")) %>% # row & column totals
  adorn_percentages("col") %>% # column percentages
  adorn_pct_formatting() %>%
  adorn_ns() %>%
  adorn_title("combined")
 female.f/prog.ordered  Vocational     General     Academic
                  Male  46.0% (23)  46.7% (21)  44.8%  (47)
                Female  54.0% (27)  53.3% (24)  55.2%  (58)
                 Total 100.0% (50) 100.0% (45) 100.0% (105)
        Total
  45.5%  (91)
  54.5% (109)
 100.0% (200)

Grouped Frequency Tables

Often you have age, income, standardized test scores, etc. that you want to flip into a grouped frequency table. This is easily done with the santoku package. Now, bear in mind that we can construct the groups in various ways so let us start with fixed-width intervals with chop_evenly

library(santoku)

chop_evenly(myhsb$read, interval = 5) -> myhsb$read.evenly

tabyl(myhsb$read.evenly) %>% # class-width is exactly 9.6
  adorn_pct_formatting()
 myhsb$read.evenly  n percent
        [28, 37.6) 14    7.0%
      [37.6, 47.2) 68   34.0%
      [47.2, 56.8) 48   24.0%
      [56.8, 66.4) 50   25.0%
        [66.4, 76] 20   10.0%

Maybe you want equal-sized groups with chop_equally?

chop_equally(myhsb$read, groups = 5) -> myhsb$read.equally

tabyl(myhsb$read.equally) %>% # Here you have quintiles
  adorn_pct_formatting()
 myhsb$read.equally  n percent
          [0%, 20%) 39   19.5%
         [20%, 40%) 16    8.0%
         [40%, 60%) 62   31.0%
         [60%, 80%) 37   18.5%
        [80%, 100%] 46   23.0%

No? You wanted the quartiles instead?

chop_equally(myhsb$read, groups = 4) -> myhsb$read.quartiles

tabyl(myhsb$read.quartiles) %>% # Here you have quartiles
  adorn_pct_formatting()
 myhsb$read.quartiles  n percent
            [0%, 25%) 39   19.5%
           [25%, 50%) 44   22.0%
           [50%, 75%) 61   30.5%
          [75%, 100%] 56   28.0%

Perhaps we had certain breaks in mind, for example, reading score below 40 is “Low”, 40-60 is “Medium”, and above 60 is “High”

chop(myhsb$read, breaks = c(40, 60)) -> myhsb$read.breaks

tabyl(myhsb$read.breaks) %>% # Here you have specified breaks
  adorn_pct_formatting()
 myhsb$read.breaks   n percent
          [28, 40)  22   11.0%
          [40, 60) 122   61.0%
          [60, 76]  56   28.0%

Note that when you create the groups, you have to be sure of how values are being included if they fall on the boundary. For example, if read = 60, should it be in the 40-60 group or 60-80 group? Turns out the default here is to include it in the 60-80 group.4

Other Simple Operations

You may need to operate on some variables, converting them into z-scores, multiplying them by some number, taking the natural logarithm, and so on.

(myhsb$math / 2) -> myhsb$math.halved
(myhsb$math)^2 -> myhsb$math.squared
log(myhsb$math) -> myhsb$math.ln
sqrt(myhsb$math) -> myhsb$math.sqrt
scale(myhsb$math) -> myhsb$math.zscore

Basic Summary Statistics

You will often want means, medians, standard deviations, minimum/maximum, maybe quartiles, etc. Again, there are many ways to do this but the easiest way might be with the skimr package.

library(skimr)

skim(myhsb)
Table 1: Data summary
Name myhsb
Number of rows 200
Number of columns 21
_______________________
Column type frequency:
factor 10
numeric 11
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
female.f 0 1 FALSE 2 Fem: 109, Mal: 91
race.f 0 1 FALSE 4 Whi: 145, His: 24, Afr: 20, Asi: 11
ses.f 0 1 FALSE 3 Mid: 95, Hig: 58, Low: 47
schtyp.f 0 1 FALSE 2 Pub: 168, Pri: 32
prog.f 0 1 FALSE 3 Aca: 105, Voc: 50, Gen: 45
prog.ordered 0 1 TRUE 3 Aca: 105, Voc: 50, Gen: 45
read.evenly 0 1 FALSE 5 [37: 68, [56: 50, [47: 48, [66: 20
read.equally 0 1 FALSE 5 [40: 62, [80: 46, [0%: 39, [60: 37
read.quartiles 0 1 FALSE 4 [50: 61, [75: 56, [25: 44, [0%: 39
read.breaks 0 1 FALSE 3 [40: 122, [60: 56, [28: 22

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 100.50 57.88 1.00 50.75 100.50 150.25 200.00 ▇▇▇▇▇
read 0 1 52.23 10.25 28.00 44.00 50.00 60.00 76.00 ▂▇▆▆▂
write 0 1 52.77 9.48 31.00 45.75 54.00 60.00 67.00 ▂▅▆▇▇
math 0 1 52.65 9.37 33.00 45.00 52.00 59.00 75.00 ▃▆▇▅▂
science 0 1 51.85 9.90 26.00 44.00 53.00 58.00 74.00 ▂▅▇▇▂
socst 0 1 52.41 10.74 26.00 46.00 52.00 61.00 71.00 ▂▃▇▇▃
math.halved 0 1 26.32 4.68 16.50 22.50 26.00 29.50 37.50 ▃▆▇▅▂
math.squared 0 1 2858.82 1016.05 1089.00 2025.00 2704.00 3481.00 5625.00 ▆▇▇▃▂
math.ln 0 1 3.95 0.18 3.50 3.81 3.95 4.08 4.32 ▁▆▇▇▃
math.sqrt 0 1 7.23 0.64 5.74 6.71 7.21 7.68 8.66 ▂▆▇▆▂
math.zscore 0 1 0.00 1.00 -2.10 -0.82 -0.07 0.68 2.39 ▃▆▇▅▂

Saving Data in R

Once we have cleaned and otherwise modified our data we usually want to store it in a native R-data format. There are a couple of ways to do this but the simplest is to save it as an *.RData or *.rdata format. I will use the former file-extension (my habit).

save(myhsb,
     file = here("workshops/ropeg/handouts/data", "myhsb.RData"))

Check your data folder and you will see this file we just saved.

Sometimes you may want to output the data or share it with some as a csv, xlsx, or SPSS file. No problem.5

write.csv(myhsb,
     file = here("workshops/ropeg/handouts/data", "myhsb.csv"),
     row.names = FALSE)

library(xlsx)
write.xlsx(myhsb, file = here("workshops/ropeg/handouts/data", "myhsb.xlsx")) # Excel format

write_sav(myhsb, here("workshops/ropeg/handouts/data", "myhsb.sav")) # SPSS format

When we start working on data visualizations we will see how to load our myhsb.RData file.


  1. In fact, R will also read files directly from the web, and even download and expand compressed (*.zip or *.tar.gz) files.↩︎

  2. Check out sas7bdat, foreign, and the rio libraries if you have trouble with haven.↩︎

  3. But if you need to use an ordinal measure in a regression-type setting, it will not work as expected because R treats an ordered factor as warranting a polynomial.↩︎

  4. By default, intervals are left-closed, i.e. they include their left endpoints. If you want right-closed intervals, set left = FALSE.↩︎

  5. Other file formats are possible as well, including but not limited to Stata and SAS.↩︎