Introduction to R and RStudio

Workshop Session 01 @ Ohio University

Ani Ruhil

2019-05-19`

Introduction

We begin by installing the tools we will need – R and RStudio. Why two tools? Chester Ismay, a prominent R developer, describes it thus:

“R is like the engine of your car, and RStudio is the dashboard of your car.”
— Chester Ismay

Every time you need to use R/RStudio, this is the shortcut/icon you will click. If everything installs without a hitch, look for the RStudio shortcut/icon on your machine.

Go ahead and launch RStudio. The following screenshot is how RStudio looks on my screen at the time of assembling this material.

Panes can be detached. This is very helpful when you want another application next to the pane or behind it, or if you are using multiple monitors since then you can execute commands in one monitor and watch the output in another monitor. You see three window panes. Each serves an important purpose so let us look at some core functions nested in these panes.

  1. Console = This is where commands are issued to R, either by typing and hitting enter or running commands from a script (like your R Markdown file)
  2. Environment = stores and shows you all the objects created
  3. History shows you a running list of all commands issued to R
  4. Connections = shows you any databases/servers you are connected to and also allows you to initiate a new connection
  5. Files = shows you files and folders in your current working directory, and you can move up/down in the folder hierarchy
  6. Plots = show you all plots that have been generated
  7. Packages = shows you installed packages
  8. help = allows you to get the help pages by typing in keywords
  9. Viewer = shows you “live” documents running on the local “server”
  10. Knit = allows you to generated html/pdf/word documents from a script
  11. Insert = allows you to insert a vanilla R code chunk.
  12. Run = allows you to run lines/chunks of R code
  13. You also have a spellcheck; use it to catch typos.
  14. Other tabs come alive as you do more advanced work, for example, you are pushing content to online repositories, publishing a document to RPubs, using bookdown to write a “book”, and so on

You can customize the panes via Tools -> Global Options...

Installing packages

What are packages, or libraries as they are also called? Well, once again, Chester Ismay describes them thus:

“The latest version of R is like the latest smartphone you bought, and the libraries/packages are like the apps you installed on your smartphone to enhance its functionality.”
— Chester Ismay

Now we install some packages via Tools -> Install Packages... The initial list of packages to be installed is shown below. Other packages will be installed as needed. Copy-and-paste the following commands into the R Console prompt and hit

my.pkgs <- c("devtools", "reshape2", "tidyverse", "lubridate", "Hmisc",
             "gapminder", "leaflet", "DT", "data.table", "htmltools",
             "scales", "ggridges", "here", "knitr", "kableExtra", "haven",
             "readxl", "ggthemes", "janitor")

install.packages(my.pkgs)

It is a good idea to update packages on a regular frequency but every now and then something might break with an update but it is usually fixed sooner rather than later by the developer. You can update packages via Tools -> Check for Package Updates...

Creating RProjects

  1. Create a folder called ouir
  2. Inside the ouir folder create a subfolder called data. The folder structure will now be as shown below
ouir/
    └── session01.Rmd
    └── session02.Rmd
    └── session03.Rmd
    └── data/
        └── some data file
        └── another data file

All data you download or create go into the data folder. All R code files reside in the ouir folder. Open the Rmd file I sent you: ouir-day01.Rmd and save it in the ouir folder. Save the data I sent you in the data folder.

  1. Now create a project via File -> New Project and choose Existing Directory. Browse to the ouir folder and click Create Project. RStudio will restart and when it does you will be in the project folder and will see a file called ouir.Rproj

RStudio projects make it straightforward to divide your work into multiple contexts, each with their own working directory, workspace, history, and source documents. There are several options that can be set on a per-project basis to customize the behavior of RStudio. You can edit these options using the Project Options command on the Project menu. From now on, when you start an RStudio session and want to work on the materials developed for this workshop, do so by clicking the ouir.Rproj file or icon. Trust me, this makes working with R/RStudio a lot easier even for advanced R users.

Of course, if you are going to work on something else, for work perhaps, you should create a folder, name it, create the data sub-folder, and then create a project, much as we did here, and use that *.Rproj file. I create projects for everything of any consequences.

Let us shutdown RStudio now. As you do this, if you are asked whether you want to save the workspace, etc., always say “No” otherwise you will end up with a very cluttered Environment and your machine will slowdown as well.

RMarkdown Files

  1. Go to New File -> R Markdown ... and enter My First Rmd File in title and your name.
  2. Click OK
  3. Now File -> Save As.. and save it as testing_rmd in the ouir sub-folder and click the Knit to html button

You may see a message that says some packages need to be installed/updated. Allow these to be installed/updated.

If all goes well, and the document knits, you should see an html file that has some code, a plot and other results. As the document knits, watch for error messages and copy these verbatim since we can hunt for solutions if we know the error message word for word.

If you need to create PDF documents, then you will need a working LaTeX setup on your machine. There are other ways to setup a LaTeX system but the easiest might be to run the following code:

install.packages('tinytex')
tinytex::install_tinytex()
# to uninstall TinyTeX, run tinytex::uninstall_tinytex() 

Now restart RStudio and this time try to knit to PDF and then shutdown RStudio once again.

Specific R Markdown code block commands

Golden Rule: Give every code chunk a unique name, which can be a alphanumeric string with no whitespace. If you forget, use the namer() package to assign names to every code chunk sans a name. This can be done via

library(namer)
name_chunks("myfilename.Rmd")

You will see the code chunks have several options that could be invoked. Here are some of the more common ones we will use.

Other options can be found in the cheatsheet available here.

Working with data

Data will generally mirror one of the following types …integer, numeric/double, character, logical, date, or a factor

library(tibble)
library(lubridate)
data_frame(
    variable1 = c(1L, 2L, 3L, 4L),
    variable2 = c(2.1, 3.4, 5.6, 7.8),
    variable4 = c("Low", "Medium", "High", "Missing"),
    variable5 = c(TRUE, FALSE, FALSE, TRUE),
    variable6 = ymd(c("2017-05-23", "1776/07/04", 
                 "1983-05/31", "1908/04-01")),
    variable7 = as.factor(c("Male", "Female", "Trans", "Trans"))
  )
## # A tibble: 4 x 6
##   variable1 variable2 variable4 variable5 variable6  variable7
##       <int>     <dbl> <chr>     <lgl>     <date>     <fct>    
## 1         1       2.1 Low       TRUE      2017-05-23 Male     
## 2         2       3.4 Medium    FALSE     1776-07-04 Female   
## 3         3       5.6 High      FALSE     1983-05-31 Trans    
## 4         4       7.8 Missing   TRUE      1908-04-01 Trans

Check out the lubridate package if you need to work with dates and time intervals. A date variable has a very specific meaning for R; the data point must reflect a year, a month, and a day before it is deemed a valid date format.

Reading data

Make sure you have the following data-sets in the data folder. If you don’t, then the commands that follow will not work. We start by reading a simple comma-separated variable format file and then a tab-delimited variable format file.

library(here) # loaded once per session 
df.csv <- read.csv("data/ImportDataCSV.csv", sep=",", header=TRUE) # note sep = ","
df.tab <- read.csv("data/ImportDataTAB.txt", sep="\t", header=TRUE) # note sep = "\t"

If the files were read then the Environment should show objects called df.csv and df.tab. If you don’t see these then run through the following checklist:
- Is the csv/txt files in your data folder? - Is the folder correctly named (no blank spaces before or after, all lowercase, etc)? - Is the data folder is inside ouir folder? - Are you in the our.Rproj?

Excel files can be read via the readxl package.

library(readxl)
df.xls <- read_excel("data/ImportDataXLS.xls")
df.xlsx <- read_excel("data/ImportDataXLSX.xlsx")

SAS, SPSS, Stata files can be read with the haven package.

library(haven)
df.stata <- read_stata("data/ImportDataStata.dta")
df.sas <- read_sas("data/ImportDataSAS.sas7bdat")
df.spss <- read_sav("data/ImportDataSPSS.sav")

Fixed-width files: It is also common to encounter fixed-width files where the raw data are stored without any gaps between successive variables. However, these files will come with documentation that will tell you where each variable starts and ends, along with other details about each variable.

df.fw <- read.fwf("data/fwfdata.txt", widths = c(4, 9, 2, 4), header = FALSE, 
                 col.names = c("Name", "Month", "Day", "Year"))

Notice we need widths = c() and col.names = c()

Optional Exercise Now an example of an even larger fixed-width file
  1. Download the BRFSS data from here
  2. Extract the ascii data file and place it in your data directory
  3. Now copy-and-paste the code I sent you via Slack into a standalone R code-chunk
  4. Run the code chunk

Reading Files from the Web

It is possible to specify the full web-path for a file and read it in, rather than storing a local copy. This is often useful when updated by the source (Census Bureau, Bureau of Labor, Bureau of Economic Analysis, etc.)

fpe <- read.table("http://data.princeton.edu/wws509/datasets/effort.dat")
test <- read.table("https://stats.idre.ucla.edu/stat/data/test.txt", 
                  header = TRUE)
test.csv <- read.csv("https://stats.idre.ucla.edu/stat/data/test.csv", 
                    header = TRUE)

library(foreign)
hsb2.spss <- read.spss("https://stats.idre.ucla.edu/stat/data/hsb2.sav")
df.hsb2.spss <- as.data.frame(hsb2.spss)

The foreign package will also read Stata and other formats. I end up defaulting to haven now. There are other packages for reading SPSS, SAS, etc. files … sas7bdat, rio, data.table, xlsx, XLConnect, gdata and others.

Reading compressed files

temp <- tempfile()
download.file("ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/
              Datasets/NVSS/bridgepop/2016/pcen_v2016_y1016.sas7bdat.zip",
              temp)
oursasdata <- haven::read_sas(unz(temp, "pcen_v2016_y1016.sas7bdat"))
unlink(temp)

You can save your data in a format that R will recognize, giving it the RData or rdata extension

save(oursasdata, file = "data/oursasdata.RData")
save(oursasdata, file = "data/oursasdata.rdata")

Check your data directory to confirm both files are present.

Minimal example of data processing

Working with the hsb2 data: 200 students from the High school and Beyond study

hsb2 <- read.table('https://stats.idre.ucla.edu/stat/data/hsb2.csv',
                  header = TRUE, sep = ",")

There are no label values for the qualitative/categorical variables (female, race, ses, schtyp, and prog) so we create these (with base R commands).

hsb2$female.f <- factor(hsb2$female,
                      levels = c(0, 1),
                      labels = c("Male", "Female"))

hsb2$race.f <- factor(hsb2$race,
                    levels = c(1:4),
                    labels = c("Hispanic", "Asian", "African American", "White"))

hsb2$ses.f <- factor(hsb2$ses,
                   levels = c(1:3),
                   labels = c("Low", "Middle", "High"))

hsb2$schtyp.f <- factor(hsb2$schtyp,
                      levels = c(1:2),
                      labels = c("Public", "Private"))

hsb2$prog.f <- factor(hsb2$prog,
                   levels = c(1:3),
                   labels = c("General", "Academic", "Vocational"))

Having added labels to the factors in hsb2 we can now save the data for later use.

save(hsb2, file = "data/hsb2.RData") 

Saving data, objects, and workspaces

You save your data via

save(dataname, file = "filepath/filename.RData") or save(dataname, file = "filepath/filename.rdata")

data(mtcars)
save(mtcars, file = "data/mtcars.RData")
rm(list = ls())# To clear the Environment
load("data/mtcars.RData")

You can also save multiple data files as follows:

data(mtcars)
library(ggplot2)
data(diamonds)
save(mtcars, diamonds, file = "data/mydata.RData")
rm(list = ls()) # To clear the Environment
load("data/mydata.RData")

If you want to save just a single object from the environment and then load it in a later session, maybe with a different name, then you should use saveRDS() and readRDS()

data(mtcars)
saveRDS(mtcars, file = "data/mydata.RDS")
rm(list = ls()) # To clear the Environment
ourdata = readRDS("data/mydata.RDS")

If instead you did the following, the file will be read with the original name even though you called it with ourdata

data(mtcars)
save(mtcars, file = "data/mtcars.RData")
rm(list = ls())  # To clear the Environment
ourdata = load("data/mtcars.RData") # Note ourdata is listed as "mtcars" 

If you want to save everything you have done in the work session you can via save.image()

save.image(file = "mywork_jan182018.RData")

Some useful housekeeping commands

summary(dataname) will give you a snapshot of your data

glimpse(dataname) does the same if you are using the tidyverse library

dim(dataname) will give you the dimensions of the data frame

str(dataname) will give you the structure of the data frame … each variable’s type and other details

names(dataname) will give you the names of all columns as well as the column position (i.e., number)

head(dataname, x) will give you the first \(x\) rows of the data frame

tail(dataname, x) will give you the last \(x\) rows of the data frame

clean_names(dataname) from the janitor package will clean up messy column names (i.e., ensuring that all column names are lowercase and have no blank spaces, etc)

Calculating some basic statistics

mean(varname, na.rm = TRUE) will give you the mean of a variable

median(varname, na.rm = TRUE) will give you the median of a variable

sd(varname, na.rm = TRUE) will give you the standard deviation of a variable

var(varname, na.rm = TRUE) will give you the variance of a variable

min(varname, na.rm = TRUE) will give you the minimum of a variable

max(varname, na.rm = TRUE) will give you the maximum of a variable

quantile(varname, p = c(0.25, 0.75), na.rm = TRUE) will give you the first and third quartiles of a variable

scale(varname, na.rm = TRUE) will give you z-score of a variable

Note that na.rm = TRUE drops all cases with missing values before calculating quantities of interest. If you forget this switch you will get nothing or worse, see an error message.