MPA 5830 - Module 01 (Fall 2021)

What will you learn?

In this module you will learn how to

(a) access data files in R, both locally (i.e., from a folder on your computer) and from the web (i.e., without manually downloading a data file);

(b) explore the contents of a data file with specific comands,

(c) add value labels and create a new qualitative variable,

(d) save data files in native R format,

(e) load RData files, and

(f) create, populate, and save a Jupyter Notebook.

Base R comes bundled with some built-in functions (pieces of computer code) that do a variety of things. But one often needs additional pieces of computer code to accomplish a task. Often, base R could dprobably have done the task but we might be able to complete the task faster\easier by relying on these additional pieces of computer code. You will see these additioal pieces of code being used time to time, and we call these little bundles of code packages.

We start by loading up a package -- tidyverse -- "an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures."

Well, what exactly is a package in the R language? A package is best thought of as a collection of R functions, data, and documentation that explains what each function does. R has over 14,000 packages, each designed to do something very specific such as, for example, make maps, read large data files very speedily, make animated graphics, and so on.

Each week you will see several new packages come into play because we will be relying on each to do something very specific for us. The tidyverse will be a regular feature.

How do we load a package? Well, first it has to be installed, but that has been taken care of for you. Once installed, we load it via the library(packaganame) command, as shown below.

The message you see with a pinkish background results from our library(tidyverse) command, and is just a message indicating what packages were loaded (ggplot2, tibble, tidyr, readr, purr, dplyr, stringr, forcats), and if there is anything else we should be aware of. In this case, there is -- some commands (filter() and lag()) used by the dplyr package conflict with identically named commands in the stats package.

readr is an excellent package designed to read data that might be available in various formats. Since it is loaded as a part of the tidyverse, let us move on to seeing how to read data.

Reading data

One of the golden rules to follow when reading data is to make your life easy by not having to memorize or discover through frantic trial-and-error runs where a particular data file is located. So how do we make our life easy? By placing all data in a data folder.

Data come in several formats but I will walk you through the formats you are most likely to encounter -- MS Excel, CSV, TXT, fixed-width, and then in any one of these commercial software formats: SAS, Stata, or SPSS.

CSV data files

We start by reading a simple comma-separated variable format file and then a tab-delimited variable format file. A CSV file has each column separated by a comma while a tab-delimited file has every column separated by a tab -- , versus \t

Note ...

If no errors are thrown, then the files should have been read into memory and we can check their contents. The names(filename) command will show you the column names, and glimpse(filename) will show you a snippet of the data in the file.

MS Excel files

Microsoft Excel files can be read via the readxl package, and you see two versions below -- one for the older *.xls format and the other for the newer *.xlsx format.

As a rule, I would recommend against storing data in Excel formats since Excel tends to do some funny things. If you must store and share data, try to use the CSV format.

SPSS, Stata, SAS files

Some governmental agencies and other sources tend to use SAS, Stata, or SPSS for storing data and for carrying out various data analyses. This is a legacy issue that is changing but a little too slowly for most of us who do not use these commercial software packages as the mainstays of our data work. But, even if you do not use these packages, you should know how to read in data created in their native formats. As it turns out, SPSS, Stata, SAS files can be read via the haven package, and with relative ease.

Fixed-width files

It is also common to encounter fixed-width files where the raw data are stored without any gaps between successive columns. This is also a legacy format from the early days of computers and punch cards, and one of the most efficient ways of storing large amounts of data. These files come with documentation that will give you the necessary details about where each column starts and ends, etc. See here for some examples of layouts from the Census Bureau.

Notice we need widths = c(4,9,2,4) to indicate how many slots each column takes and then col.names = c("Name", "Month", "Day", "Year") to label the columns since the data file does not have variable names. if you mess up with widths = you end up with garbage because R does not know where any column starts or ends so be careful.

Reading Files from the Web

One of the benefits of software like R is its ability to read data files over the web without requiring you to manually download the file and save a physical copy to be read in. Specifically, it is possible to list the full web-path for a file and read it in. This ability is invaluable when the data tend to be periodically updated by the source (for example, by the Census Bureau, Bureau of Labor, Bureau of Economic Analysis, etc.). Here are a few examples.

If you look at fpe you will notice that there are three columns but the rows each have a unique name. These are row names that are remembered by R but will not show up with a column name.

We can also ask R to both download and unzip a *.zip file from the web. Here I am pulling down a file from a collection maintained and updated by the Centers for Disease Control and Prevention (CDC). The file has more than 4 million rows of data and 14 columns of information.

Now we will see three new commands -- dim(filename), head(filename), and tail(filename).

dim(filename) shows the number of rows and number of columns in the dataset -- i.e., the dimensions of the dataset.

head(filename) show the first 6 rows and all columns while tail(filename) shows the last 6 rows and all columns

Labeling data values

Often the data contain columns that record information with numeric codes that reflect some qualitative attributes. For example, whether an individual is Male or Female may be stored as 0 and 1, respectively. We can append the correct qualitative labels to these columns, and then hen we create graphs or tables, what is being displayed will be readily apparent (i.e., the reader\viewer does not have to work hard to figure out what is a 0 and what is a 1).

We will read in a small dataset that has information on 200 students. The data come from the High School and Beyond study

Column Name Values and Labels\Meanings
female (0/1)
race (1=hispanic 2=asian 3=african-amer 4=white)
ses socioeconomic status (1=low 2=middle 3=high)
schtyp type of school (1=public 2=private)
prog type of program (1=general 2=academic 3=vocational)
read standardized reading score
write standardized writing score
math standardized math score
science standardized science score
socst standardized social studies score

Saving R data files

You can save your data in a format that R will recognize, giving it the RData or rdata extension

Check the data directory to confirm that both files are present

Minimal example of data processing

Working with the hsb2 data: 200 students from the High School and Beyond study. The variables in this file are:

Let us start by reading in these data.

We can easily check the basic descriptive statistics of each column in a data file by running the summary(filename) command. Let us see what results if we do this with the hsb2 data set.

Look at the output! R mistakenly treats several of the qualitative variables as if they are numeric, rendering incorrect information. Let us fix this problem and then we can rerun the summary() command to see if things have improved.

The way we append value labels is by converting a numeric variable\column to what R calls a factor. When we do so, we tell R to assign a particular label (the word or the phrase) to a particular level (the number) as shown below.

It is good habit to create a new variable each time you do some conversion so that the original variable does not get overwritten. If you do not get into the habit and end up overwriting the original variable, you will have to start from scratch if you end up making a mistake.

Now we check our work by re-running the summary(hsb2) command.

Aha! Everything is as it should be.

Before we move on though, let us save our hsb2 data file that has new columns. We will save it in the native R format. The command to save a file is very simple -- save(filename, file = "filepath/filename.RData")

You could also do save(filename, file = "filepath/filename.rdata") if that is what you intuitively prefer.

Loading RData files

When we want to work with RData files, we have to load them with the load("filepath/filename.RData") command. Here I am doing it for the hsb2 data.

Data in packages

Almost all R packages come bundled with data-sets, too many of them to walk you through but

To load data from a package, if you know the data-set's name, it is easy to load it, as shown below.


Exercises for practice

These are some exercises you can use to practice and build your R skills. They are not for grade.

Exercise 01: Reading in some data files

  1. Create a new Notebook by going to File -> New -> Notebook.

  2. When prompted to select a kernel, use the dropdown menu to select the R kernel.

  3. The notebook will be untitled, so go ahead and save it with a name, something like yourlastname_ex01 and you will see yourlastname_ex01.ipynb as the name.

  4. Now read in the Stata data file found here

  5. Create a new cell and run the summary command to check the contents of this data file and look at the first 6 rows of data with the appropriate head command.and look at the first 6 rows of data with the appropriate head command.

  6. In a new cell, read in tthe SPSS file found here

  7. In a new cell, run the summary command and look at the first 6 rows of data with the appropriate head command.

Exercise 02: Reading in local data and labeling some values

  1. Download this dataset, extract the file inside the zip archive and upload it to the data folder.

  2. In a new cell, read in this uploaded data file with the appropriate commands.

  3. The variable gender has the following codes: Zero = unknown; 1 = male; 2 = female. Use this coding scheme to create a new column that shows gender as a factor with these value labels

  4. Check the first 6 rows of the dataset and also run summary to check the new column was created as desired.

  5. In a new cell, write the commands necessary to save each of the three data-sets as separate RData files.

Exercise 03: Welcome to Kaggle & Mass Shootings

Go to this page on Kaggle and read the description of the data-set on mass shootings in the United States that occurred during the 1966-2017 period. once you have read the overview of the data, click the "Data" tab and download the file called Mass Shootings Dataset.csv. Be careful; there are several versions so the one you want is the very last one and not any that have a version number attached, such as "Mass Shootings Dataset Ver 2.csv" for example.

Now read this file, perhaps naming it shootings and run the summary() command on it. Note the number of observations and the number of variables in the data-set.

Make sure you save the file in RData format as well.

Exercise 04: Animal Shelters

Go to this page on Kaggle and download the file called aac_shelter_outcomes.zip, unzip it, and AFTER reading the data overview, read in the file and generate the usual summary and also save it as an RData file.