In this module you will learn how to
(a) access data files in R, both locally (i.e., from a folder on your computer) and from the web (i.e., without manually downloading a data file);
(b) explore the contents of a data file with specific comands,
(c) add value labels and create a new qualitative variable,
(d) save data files in native R format,
(e) load RData files, and
(f) create, populate, and save a Jupyter Notebook.
Base R comes bundled with some built-in functions (pieces of computer code) that do a variety of things. But one often needs additional pieces of computer code to accomplish a task. Often, base R could dprobably have done the task but we might be able to complete the task faster\easier by relying on these additional pieces of computer code. You will see these additioal pieces of code being used time to time, and we call these little bundles of code packages.
We start by loading up a package -- tidyverse -- "an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures."
Well, what exactly is a package in the R language? A package is best thought of as a collection of R functions, data, and documentation that explains what each function does. R has over 14,000 packages, each designed to do something very specific such as, for example, make maps, read large data files very speedily, make animated graphics, and so on.
Each week you will see several new packages come into play because we will be relying on each to do something very specific for us. The tidyverse will be a regular feature.
How do we load a package? Well, first it has to be installed, but that has been taken care of for you. Once installed, we load it via the library(packaganame)
command, as shown below.
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ✔ ggplot2 3.3.5 ✔ purrr 0.3.4 ✔ tibble 3.1.4 ✔ dplyr 1.0.7 ✔ tidyr 1.1.3 ✔ stringr 1.4.0 ✔ readr 2.0.1 ✔ forcats 0.5.1 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()
The message you see with a pinkish background results from our library(tidyverse)
command, and is just a message indicating what packages were loaded (ggplot2, tibble, tidyr, readr, purr, dplyr, stringr, forcats
), and if there is anything else we should be aware of. In this case, there is -- some commands (filter()
and lag()
) used by the dplyr
package conflict with identically named commands in the stats
package.
readr
is an excellent package designed to read data that might be available in various formats. Since it is loaded as a part of the tidyverse
, let us move on to seeing how to read data.
One of the golden rules to follow when reading data is to make your life easy by not having to memorize or discover through frantic trial-and-error runs where a particular data file is located. So how do we make our life easy? By placing all data in a data folder.
Data come in several formats but I will walk you through the formats you are most likely to encounter -- MS Excel, CSV, TXT, fixed-width, and then in any one of these commercial software formats: SAS, Stata, or SPSS.
We start by reading a simple comma-separated variable
format file and then a tab-delimited variable
format file. A CSV file has each column separated by a comma
while a tab-delimited file has every column separated by a tab
-- ,
versus \t
read.csv(
"data/ImportDataCSV.csv",
sep = ",",
header = TRUE
) -> df.csv
read.csv(
"data/ImportDataTAB.txt",
sep = "\t",
header = TRUE
) -> df.tab
Note ...
sep = ","
switch says the individual variables are separated by a comma header = TRUE
switch indicates that the first row in the file that is being read in has the column names sep = "\t"
because the columns are separated by a tab If no errors are thrown, then the files should have been read into memory and we can check their contents. The names(filename)
command will show you the column names, and glimpse(filename)
will show you a snippet of the data in the file.
names(df.csv)
glimpse(df.csv)
Rows: 7 Columns: 3 $ x <int> 1, 4, 7, 10, 13, 16, 19 $ y <int> 2, 5, 8, 11, 14, 17, 20 $ z <int> 3, 6, 9, 12, 15, 18, 21
# And now we repeat the preceding commands for the df.tab file
names(df.tab)
glimpse(df.tab)
Rows: 7 Columns: 3 $ x <int> 1, 4, 7, 10, 13, 16, 19 $ y <int> 2, 5, 8, 11, 14, 17, 20 $ z <int> 3, 6, 9, 12, 15, 18, 21
Microsoft Excel files can be read via the readxl
package, and you see two versions below -- one for the older *.xls
format and the other for the newer *.xlsx
format.
As a rule, I would recommend against storing data in Excel formats since Excel tends to do some funny things. If you must store and share data, try to use the CSV format.
library(readxl)
read_excel(
"data/ImportDataXLS.xls"
) -> df.xls
read_excel(
"data/ImportDataXLSX.xlsx"
) -> df.xlsx
names(df.xls)
names(df.xlsx)
glimpse(df.xlsx)
glimpse(df.xlsx)
Rows: 7 Columns: 3 $ x <dbl> 1, 4, 7, 10, 13, 16, 19 $ y <dbl> 2, 5, 8, 11, 14, 17, 20 $ z <dbl> 3, 6, 9, 12, 15, 18, 21 Rows: 7 Columns: 3 $ x <dbl> 1, 4, 7, 10, 13, 16, 19 $ y <dbl> 2, 5, 8, 11, 14, 17, 20 $ z <dbl> 3, 6, 9, 12, 15, 18, 21
Some governmental agencies and other sources tend to use SAS, Stata, or SPSS for storing data and for carrying out various data analyses. This is a legacy issue that is changing but a little too slowly for most of us who do not use these commercial software packages as the mainstays of our data work. But, even if you do not use these packages, you should know how to read in data created in their native formats. As it turns out, SPSS, Stata, SAS files can be read via the haven
package, and with relative ease.
library(haven)
read_stata(
"data/ImportDataStata.dta"
) -> df.stata
read_sas(
"data/ImportDataSAS.sas7bdat"
) -> df.sas
read_sav(
"data/ImportDataSPSS.sav"
) -> df.spss
# Check the files
names(df.stata)
glimpse(df.sas)
Rows: 7 Columns: 3 $ x <dbl> 1, 4, 7, 10, 13, 16, 19 $ y <dbl> 2, 5, 8, 11, 14, 17, 20 $ z <dbl> 3, 6, 9, 12, 15, 18, 21
It is also common to encounter fixed-width files where the raw data are stored without any gaps between successive columns. This is also a legacy format from the early days of computers and punch cards, and one of the most efficient ways of storing large amounts of data. These files come with documentation that will give you the necessary details about where each column starts and ends, etc. See here for some examples of layouts from the Census Bureau.
read.fwf(
"data/fwfdata.txt",
widths = c(4, 9, 2, 4),
header = FALSE,
col.names = c("Name", "Month", "Day", "Year")
) -> df.fw
Notice we need widths = c(4,9,2,4)
to indicate how many slots each column takes and then col.names = c("Name", "Month", "Day", "Year")
to label the columns since the data file does not have variable names. if you mess up with widths =
you end up with garbage because R does not know where any column starts or ends so be careful.
glimpse(df.fw)
One of the benefits of software like R is its ability to read data files over the web without requiring you to manually download the file and save a physical copy to be read in. Specifically, it is possible to list the full web-path for a file and read it in. This ability is invaluable when the data tend to be periodically updated by the source (for example, by the Census Bureau, Bureau of Labor, Bureau of Economic Analysis, etc.). Here are a few examples.
read.table(
"http://data.princeton.edu/wws509/datasets/effort.dat"
) -> fpe
fpe # This command asks R to show us what fpe contains
setting | effort | change | |
---|---|---|---|
<int> | <int> | <int> | |
Bolivia | 46 | 0 | 1 |
Brazil | 74 | 0 | 10 |
Chile | 89 | 16 | 29 |
Colombia | 77 | 16 | 25 |
CostaRica | 84 | 21 | 29 |
Cuba | 89 | 15 | 40 |
DominicanRep | 68 | 14 | 21 |
Ecuador | 70 | 6 | 0 |
ElSalvador | 60 | 13 | 13 |
Guatemala | 55 | 9 | 4 |
Haiti | 35 | 3 | 0 |
Honduras | 51 | 7 | 7 |
Jamaica | 87 | 23 | 21 |
Mexico | 83 | 4 | 9 |
Nicaragua | 68 | 0 | 7 |
Panama | 84 | 19 | 22 |
Paraguay | 74 | 3 | 6 |
Peru | 73 | 0 | 2 |
TrinidadTobago | 84 | 15 | 29 |
Venezuela | 91 | 7 | 11 |
# Some more files read from the web
read.table(
"https://stats.idre.ucla.edu/stat/data/test.txt",
header = TRUE
) -> test.txt
read_csv(
"https://stats.idre.ucla.edu/stat/data/test.csv"
) -> test.csv
read_sav(
"https://stats.idre.ucla.edu/stat/data/hsb2.sav"
) -> hsb2.spss
Rows: 8 Columns: 6 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (1): prgtype dbl (5): gender, id, ses, schtyp, level ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
If you look at fpe
you will notice that there are three columns but the rows each have a unique name. These are row names
that are remembered by R but will not show up with a column name.
We can also ask R to both download and unzip a *.zip
file from the web. Here I am pulling down a file from a collection maintained and updated by the Centers for Disease Control and Prevention (CDC). The file has more than 4 million
rows of data and 14 columns of information.
temp = tempfile()
download.file("ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NVSS/bridgepop/2016/pcen_v2016_y1016.sas7bdat.zip", temp)
oursasdata = haven::read_sas(unz(temp, "pcen_v2016_y1016.sas7bdat"))
unlink(temp)
Now we will see three new commands -- dim(filename)
, head(filename)
, and tail(filename)
.
dim(filename)
shows the number of rows and number of columns in the dataset -- i.e., the dimensions of the dataset.
head(filename)
show the first 6 rows and all columns whiletail(filename)
shows the last 6 rows and all columns
dim(oursasdata)
head(oursasdata)
age | hisp | RACESEX | VINTAGE | POP2010_apr | POP2010_jul | POP2011 | POP2012 | POP2013 | POP2014 | POP2015 | POP2016 | ST_FIPS | CO_FIPS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
0 | 1 | 1 | 2016 | 236 | 242 | 237 | 229 | 224 | 240 | 236 | 225 | 1 | 1 |
1 | 1 | 1 | 2016 | 299 | 284 | 241 | 254 | 231 | 249 | 238 | 230 | 1 | 1 |
2 | 1 | 1 | 2016 | 287 | 292 | 294 | 233 | 240 | 237 | 234 | 255 | 1 | 1 |
3 | 1 | 1 | 2016 | 286 | 286 | 310 | 283 | 241 | 244 | 229 | 240 | 1 | 1 |
4 | 1 | 1 | 2016 | 270 | 273 | 280 | 319 | 278 | 247 | 252 | 239 | 1 | 1 |
5 | 1 | 1 | 2016 | 279 | 277 | 285 | 279 | 305 | 287 | 265 | 254 | 1 | 1 |
tail(oursasdata)
age | hisp | RACESEX | VINTAGE | POP2010_apr | POP2010_jul | POP2011 | POP2012 | POP2013 | POP2014 | POP2015 | POP2016 | ST_FIPS | CO_FIPS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
80 | 2 | 8 | 2016 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 56 | 45 |
81 | 2 | 8 | 2016 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 56 | 45 |
82 | 2 | 8 | 2016 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 56 | 45 |
83 | 2 | 8 | 2016 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 56 | 45 |
84 | 2 | 8 | 2016 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 56 | 45 |
85 | 2 | 8 | 2016 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 56 | 45 |
Often the data contain columns that record information with numeric codes that reflect some qualitative attributes. For example, whether an individual is Male or Female may be stored as 0 and 1, respectively. We can append the correct qualitative labels to these columns, and then hen we create graphs or tables, what is being displayed will be readily apparent (i.e., the reader\viewer does not have to work hard to figure out what is a 0
and what is a 1
).
We will read in a small dataset that has information on 200 students. The data come from the High School and Beyond study
Column Name | Values and Labels\Meanings |
---|---|
female | (0/1) |
race | (1=hispanic 2=asian 3=african-amer 4=white) |
ses | socioeconomic status (1=low 2=middle 3=high) |
schtyp | type of school (1=public 2=private) |
prog | type of program (1=general 2=academic 3=vocational) |
read | standardized reading score |
write | standardized writing score |
math | standardized math score |
science | standardized science score |
socst | standardized social studies score |
You can save your data in a format that R will recognize, giving it the RData or rdata extension
Check the data directory to confirm that both files are present
Working with the hsb2 data: 200 students from the High School and Beyond study. The variables in this file are:
Let us start by reading in these data.
read.table(
'https://stats.idre.ucla.edu/stat/data/hsb2.csv',
header = TRUE,
sep = ","
) -> hsb2
We can easily check the basic descriptive statistics of each column in a data file by running the summary(filename)
command. Let us see what results if we do this with the hsb2
data set.
summary(hsb2)
id female race ses schtyp Min. : 1.00 Min. :0.000 Min. :1.00 Min. :1.000 Min. :1.00 1st Qu.: 50.75 1st Qu.:0.000 1st Qu.:3.00 1st Qu.:2.000 1st Qu.:1.00 Median :100.50 Median :1.000 Median :4.00 Median :2.000 Median :1.00 Mean :100.50 Mean :0.545 Mean :3.43 Mean :2.055 Mean :1.16 3rd Qu.:150.25 3rd Qu.:1.000 3rd Qu.:4.00 3rd Qu.:3.000 3rd Qu.:1.00 Max. :200.00 Max. :1.000 Max. :4.00 Max. :3.000 Max. :2.00 prog read write math Min. :1.000 Min. :28.00 Min. :31.00 Min. :33.00 1st Qu.:2.000 1st Qu.:44.00 1st Qu.:45.75 1st Qu.:45.00 Median :2.000 Median :50.00 Median :54.00 Median :52.00 Mean :2.025 Mean :52.23 Mean :52.77 Mean :52.65 3rd Qu.:2.250 3rd Qu.:60.00 3rd Qu.:60.00 3rd Qu.:59.00 Max. :3.000 Max. :76.00 Max. :67.00 Max. :75.00 science socst Min. :26.00 Min. :26.00 1st Qu.:44.00 1st Qu.:46.00 Median :53.00 Median :52.00 Mean :51.85 Mean :52.41 3rd Qu.:58.00 3rd Qu.:61.00 Max. :74.00 Max. :71.00
summary(hsb2)
Look at the output! R mistakenly treats several of the qualitative variables as if they are numeric, rendering incorrect information. Let us fix this problem and then we can rerun the summary()
command to see if things have improved.
The way we append value labels is by converting a numeric variable\column to what R calls a factor
. When we do so, we tell R to assign a particular label (the word or the phrase) to a particular level (the number) as shown below.
It is good habit to create a new variable each time you do some conversion so that the original variable does not get overwritten. If you do not get into the habit and end up overwriting the original variable, you will have to start from scratch if you end up making a mistake.
factor(hsb2$female,
levels = c(0, 1),
labels = c("Male", "Female")
) -> hsb2$female.f
factor(hsb2$race,
levels = c(1:4),
labels = c("Hispanic", "Asian", "African American", "White")
) -> hsb2$race.f
factor(hsb2$ses,
levels = c(1:3),
labels = c("Low", "Middle", "High")
) -> hsb2$ses.f
factor(hsb2$schtyp,
levels = c(1:2),
labels = c("Public", "Private")
) -> hsb2$schtyp.f
factor(hsb2$prog,
levels = c(1:3),
labels = c("General", "Academic", "Vocational")
) -> hsb2$prog.f
Now we check our work by re-running the summary(hsb2)
command.
summary(hsb2)
id female race ses schtyp Min. : 1.00 Min. :0.000 Min. :1.00 Min. :1.000 Min. :1.00 1st Qu.: 50.75 1st Qu.:0.000 1st Qu.:3.00 1st Qu.:2.000 1st Qu.:1.00 Median :100.50 Median :1.000 Median :4.00 Median :2.000 Median :1.00 Mean :100.50 Mean :0.545 Mean :3.43 Mean :2.055 Mean :1.16 3rd Qu.:150.25 3rd Qu.:1.000 3rd Qu.:4.00 3rd Qu.:3.000 3rd Qu.:1.00 Max. :200.00 Max. :1.000 Max. :4.00 Max. :3.000 Max. :2.00 prog read write math Min. :1.000 Min. :28.00 Min. :31.00 Min. :33.00 1st Qu.:2.000 1st Qu.:44.00 1st Qu.:45.75 1st Qu.:45.00 Median :2.000 Median :50.00 Median :54.00 Median :52.00 Mean :2.025 Mean :52.23 Mean :52.77 Mean :52.65 3rd Qu.:2.250 3rd Qu.:60.00 3rd Qu.:60.00 3rd Qu.:59.00 Max. :3.000 Max. :76.00 Max. :67.00 Max. :75.00 science socst female.f race.f Min. :26.00 Min. :26.00 Male : 91 Hispanic : 24 1st Qu.:44.00 1st Qu.:46.00 Female:109 Asian : 11 Median :53.00 Median :52.00 African American: 20 Mean :51.85 Mean :52.41 White :145 3rd Qu.:58.00 3rd Qu.:61.00 Max. :74.00 Max. :71.00 ses.f schtyp.f prog.f Low :47 Public :168 General : 45 Middle:95 Private: 32 Academic :105 High :58 Vocational: 50
Aha! Everything is as it should be.
Before we move on though, let us save our hsb2 data file that has new columns. We will save it in the native R
format. The command to save a file is very simple -- save(filename, file = "filepath/filename.RData")
You could also do
save(filename, file = "filepath/filename.rdata")
if that is what you intuitively prefer.
save(hsb2, file = "data/hsb2.RData")
When we want to work with RData files, we have to load them with the load("filepath/filename.RData")
command. Here I am doing it for the hsb2 data.
load("data/hsb2.RData")
summary(hsb2)
id female race ses schtyp Min. : 1.00 Min. :0.000 Min. :1.00 Min. :1.000 Min. :1.00 1st Qu.: 50.75 1st Qu.:0.000 1st Qu.:3.00 1st Qu.:2.000 1st Qu.:1.00 Median :100.50 Median :1.000 Median :4.00 Median :2.000 Median :1.00 Mean :100.50 Mean :0.545 Mean :3.43 Mean :2.055 Mean :1.16 3rd Qu.:150.25 3rd Qu.:1.000 3rd Qu.:4.00 3rd Qu.:3.000 3rd Qu.:1.00 Max. :200.00 Max. :1.000 Max. :4.00 Max. :3.000 Max. :2.00 prog read write math Min. :1.000 Min. :28.00 Min. :31.00 Min. :33.00 1st Qu.:2.000 1st Qu.:44.00 1st Qu.:45.75 1st Qu.:45.00 Median :2.000 Median :50.00 Median :54.00 Median :52.00 Mean :2.025 Mean :52.23 Mean :52.77 Mean :52.65 3rd Qu.:2.250 3rd Qu.:60.00 3rd Qu.:60.00 3rd Qu.:59.00 Max. :3.000 Max. :76.00 Max. :67.00 Max. :75.00 science socst female.f race.f Min. :26.00 Min. :26.00 Male : 91 Hispanic : 24 1st Qu.:44.00 1st Qu.:46.00 Female:109 Asian : 11 Median :53.00 Median :52.00 African American: 20 Mean :51.85 Mean :52.41 White :145 3rd Qu.:58.00 3rd Qu.:61.00 Max. :74.00 Max. :71.00 ses.f schtyp.f prog.f Low :47 Public :168 General : 45 Middle:95 Private: 32 Academic :105 High :58 Vocational: 50
Almost all R packages come bundled with data-sets, too many of them to walk you through but
To load data from a package, if you know the data-set's name, it is easy to load it, as shown below.
library(palmerpenguins)
data(penguins, package = 'palmerpenguins')
head(penguins)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
<fct> | <fct> | <dbl> | <dbl> | <int> | <int> | <fct> | <int> |
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
library(ggplot2)
data(diamonds, package = 'ggplot2')
head(diamonds)
carat | cut | color | clarity | depth | table | price | x | y | z |
---|---|---|---|---|---|---|---|---|---|
<dbl> | <ord> | <ord> | <ord> | <dbl> | <dbl> | <int> | <dbl> | <dbl> | <dbl> |
0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
0.23 | Good | E | VS1 | 56.9 | 65 | 327 | 4.05 | 4.07 | 2.31 |
0.29 | Premium | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 |
0.31 | Good | J | SI2 | 63.3 | 58 | 335 | 4.34 | 4.35 | 2.75 |
0.24 | Very Good | J | VVS2 | 62.8 | 57 | 336 | 3.94 | 3.96 | 2.48 |
These are some exercises you can use to practice and build your R skills. They are not for grade.
Create a new Notebook by going to File -> New -> Notebook.
When prompted to select a kernel, use the dropdown menu to select the R kernel.
The notebook will be untitled, so go ahead and save it with a name, something like yourlastname_ex01
and you will see yourlastname_ex01.ipynb
as the name.
Now read in the Stata
data file found here
Create a new cell and run the summary
command to check the contents of this data file and look at the first 6 rows of data with the appropriate head
command.and look at the first 6 rows of data with the appropriate head
command.
In a new cell, read in tthe SPSS
file found here
In a new cell, run the summary
command and look at the first 6 rows of data with the appropriate head
command.
Download this dataset, extract the file inside the zip archive and upload it to the data
folder.
In a new cell, read in this uploaded data file with the appropriate commands.
The variable gender
has the following codes: Zero = unknown; 1 = male; 2 = female
. Use this coding scheme to create a new column that shows gender
as a factor
with these value labels
Check the first 6 rows of the dataset and also run summary
to check the new column was created as desired.
In a new cell, write the commands necessary to save each of the three data-sets as separate RData
files.
Go to this page on Kaggle and read the description of the data-set on mass shootings in the United States that occurred during the 1966-2017 period. once you have read the overview of the data, click the "Data" tab and download the file called Mass Shootings Dataset.csv
. Be careful; there are several versions so the one you want is the very last one and not any that have a version number attached, such as "Mass Shootings Dataset Ver 2.csv" for example.
Now read this file, perhaps naming it shootings
and run the summary()
command on it. Note the number of observations and the number of variables in the data-set.
Make sure you save the file in RData format as well.
Go to this page on Kaggle and download the file called aac_shelter_outcomes.zip
, unzip it, and AFTER reading the data overview, read in the file and generate the usual summary
and also save it as an RData file.