Introduction to R & RStudio

Agenda

Installing R and RStudio

First install the latest version of R from here

Then install the latest version of RStudio from here

Launch RStudio and check that it shows

R version 3.5.2 (2018-12-20) – “Egshell Igloo”
Copyright (C) 2018 The R Foundation for Statistical Computing

Understand your RStudio Environment

  1. Console = This is where commands are issued to R, either by typing and hitting enter or running commands from a script (like your R Markdown file)
  2. Environment = stores and shows you all the objects created
  3. History shows you a running list of all commands issued to R
  4. Connections = shows you any databases/servers you are connected to and also allows you to initiate a new connection
  5. Files = shows you files and folders in your current working directory, and you can move up/down in the folder hierarchy
  6. Plots = show you all plots that have been generated
  7. Packages = shows you installed packages
  8. help = allows you to get the help pages by typing in keywords
  9. Viewer = shows you are “live” documents running on the server
  10. Knit = allows you to generated html/pdf/word documents from a script
  11. Insert = allows you to insert a vanilla R chunk. You can (and should) give unique name to code chunks so that you can easily diagnose which chunk is not working
  12. Run = allows you to run lines/chunks

You can customize the panes via Tools -> Global Options...

Panes can be detached . This is very helpful when you want another application next to the pane or behind it, or if you are using multiple monitors since then you can execute commands in one monitor and watch the output in another monitor.

You also have a spellcheck; use it to catch typos.

Installing packages

Now we install some packages via Tools -> Install Packages... The initial list of packages to be installed is shown below. Other packages will be installed as needed.

devtools, reshape2, lubridate, car, Hmisc, gapminder, leaflet, DT, data.table, htmltools, scales, ggridges, here, knitr, kableExtra, haven, readr, readxl, ggplot2, vembedr

If we need to, we could update packages via Tools -> Check for Package Updates... It is a good idea to update packages on a regular frequency but every now and then something might break with an update but it is usually fixed sooner rather than later by the developer.

Rprojects

  1. Create a folder called ohio2019
  2. Inside the ohio2019 folder create a subfolder called data. The folder structure will now be as shown below
ohio2019/
    └── my-rmarkdown-file-01.Rmd
    └── my-rmarkdown-file-02.Rmd
    └── data/
        └── some data file
        └── another data file

All data you download or create go into the data folder. All R code files reside in the ohio2019 folder. Open the Rmd file I sent you: Module01_forClass.Rmd and save it in the ohio2019 folder. Save the data I sent you to the data folder.

  1. Now create a project via File -> New Project and choose Existing Directory. Browse to the ohio2019 folder and click Create Project. RStudio will restart and when it does you will be in the project folder and will see a file called ohio2019.Rproj

From now until you leave, whenever you start an RStudio session, do so by clicking the ohio2019.Rproj file. If you do this life will be fairly easy when working with the files I send you.

R Markdown files

  1. Go to New File -> R Markdown ... and enter My First Rmd File in title and your name.
  2. Click OK
  3. Now File -> Save As.. and save it as testing_rmd in the ohio2019 folder and click the Knit button

You may see a message that says some packages need to be installed/updated. Allow these to be installed/updated.

If all goes well, and the document kntis, you should see an html file that has some code, a plot and other results. As the document knits, watch for error messages and copy these verbatim since we can hunt for solutions if we know the error message word for word.

Specific R Markdown code block commands

Golden Rule: Give every code chunk a unique name, whch can be a alphanumeric string with no whitespace. If you forget, use the namer() package to assign names to every code chunk sans a name. This can be done via

library(namer)
name_chunks("myfilename.Rmd")

You will see the code chunks have several options that could be invoked. Here are some of the more common ones we will use.

eval = If FALSE, knitr will not run the code in the code chunk. include = If FALSE, knitr will run the chunk but not include the chunk in the final document. echo = If FALSE, knitr will not display the code in the code chunk above it’s results in the final document. error = If FALSE, knitr will not display any error messages generated by the code. message = If FALSE, knitr will not display any messages generated by the code. warning = If FALSE, knitr will not display any warning messages generated by the code. cache = If TRUE, knitr will cache the results to reuse in future knits. Knitr will reuse the results until the code chunk is altered. dev = The R function name that will be used as a graphical device to record plots, e.g. dev=‘CairoPDF’. dpi = A number for knitr to use as the dots per inch (dpi) in graphics (when applicable). fig.align = ‘center’, ‘left’, ‘right’ alignment in the knit document fig.height = height of the figure (in inches, for example) fig.width = width of the figure (in inches, for example) out.height and out.width = The width and height to scale plots to in the final output.

Other options can be found in the cheatsheet available here There is an excellent R Markdown in RStudio tutorial on vimeo. If the video does not show up below (because of privacy restrictions) click on it to view it on vimeo. You may need to sign-up (for free) with an email id.

Reading data

Make sure you have the data-sets sent to you via Slack in your data folder. If you don’t then the commands that follow will not work. We start by reading a simple comma-separated variable format file and then a tab-delimited variable format file.

library(readr)
df.csv <- read_csv("data/ImportDataCSV.csv") 

df.tab <- read_csv("data/ImportDataTAB.txt") 

If both files were read then Environment should show objects called df.csv and df.tab. If you don’t see these, check the following:

Excel files can be read via the readxl package

library(readxl)
df.xls <- read_excel("data/ImportDataXLS.xls")
df.xlsx <- read_excel("data/ImportDataXLSX.xlsx")

SPSS, Stata, SAS files can be read via the haven package

library(haven)
df.stata <- read_stata("data/ImportDataStata.dta")
df.sas <- read_sas("data/ImportDataSAS.sas7bdat")
df.spss <- read_sav("data/ImportDataSPSS.sav")

It is also common to encounter fixed-width files where the raw data are stored without any gaps between successive variables. However, these files will come with documentation that will tell you where each variable starts and ends, along with other details about each variable.

df.fw <- read.fwf(
  here::here("workshops/fulbright/handouts/data/", "fwfdata.txt"),
                  widths = c(4, 9, 2, 4),
                  header = FALSE, 
                  col.names = c("Name", "Month", "Day", "Year")
                  )

Notice we need widths = c() to indicate how many slots each variable takes and then col.names = c() to label the columns since the data file does not have variable names.

Reading Files from the Web

It is possible to specify the full web-path for a file and read it in, rather than storing a local copy. This is often useful when updated by the source (Census Bureau, Bureau of Labor, Bureau of Economic Analysis, etc.)

fpe <- read.table("http://data.princeton.edu/wws509/datasets/effort.dat")

test <- read.table("https://stats.idre.ucla.edu/stat/data/test.txt",
                   header = TRUE)

test.csv <- read_csv("https://stats.idre.ucla.edu/stat/data/test.csv")

hsb2.spss <- read_spss("https://stats.idre.ucla.edu/stat/data/hsb2.sav")

There are other packages as well – for example, the foreign package will also read Stata, SAS, SPSS, and other file formats. In addition, there are some specialist packages for reading SAS, SPSS, etc. data files – sas7bdat, rio, data.table, xlsx, XLConnect, gdata, etc.

Reading compressed files

Large files may sit in compressed archives on the web and R has a neat way of allowing you to download the file, unzip it, and read it. Why is this useful? Because if these files tend to be update periodically, this ability lets you use the same piece of R code to download/unzip/read the updated file. The tedious way would be to manually download, unzip, place in the appropriate data folder, and then read it.

temp <- tempfile()
download.file("ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NVSS/bridgepop/2016/pcen_v2016_y1016.sas7bdat.zip", temp)
oursasdata <- read_sas(
  unz(temp,
      "pcen_v2016_y1016.sas7bdat"
      )
  )
unlink(temp)

You can save your data in a format that R will recognize, giving it the RData or rdata extension

save(oursasdata,
     file = here::here("workshops/fulbright/handouts/data/", "oursasdata.RData")
     )
save(oursasdata,
     file = here::here("workshops/fulbright/handouts/data/", "oursasdata.rdata")
     )

Check your data directory to confirm that both files are present

Minimal example of data processing

Working with the hsb2 data: 200 students from the High school and Beyond study

hsb2 <- read.table('https://stats.idre.ucla.edu/stat/data/hsb2.csv',
                   header = TRUE, 
                   sep = ","
                   )

There are no value labels for the various qualitative/categorical variables (female, race, ses, schtyp, and prog) so we next create these.

hsb2$female <- factor(hsb2$female, levels = c(0, 1),
                      labels=c("Male", "Female")
                      )

hsb2$race <- factor(hsb2$race, levels = c(1:4),
                    labels=c("Hispanic", "Asian", "African American", "White")
                    )

hsb2$ses <- factor(hsb2$ses, levels = c(1:3),
                   labels=c("Low", "Middle", "High")
                   )

hsb2$schtyp <- factor(hsb2$schtyp, levels = c(1:2),
                      labels=c("Public", "Private")
                      )

hsb2$prog <- factor(hsb2$prog, levels = c(1:3), 
                    labels=c("General", "Academic", "Vocational")
                    )

I am overwriting each variable, indicating to R that variable x will show up as numeric with values 0 and 1, and that a 0 should be treated as male and a 1 as female, and so on. There are are four values for race, 3 for ses, 2 for schtyp, and 3 for prog, so the mapping has to reflect this. Note that this is just a quick run through with creating value labels; we will cover this in greater detail in a later module.

save your work!!

Having added labels to the factors in hsb2 we can now save the data for later use.

save(hsb2,
     file = here::here("workshops/fulbright/handouts/data/", "hsb2.RData")
     )

Let us test if this R Markdown file will knit to html. If all is good then we can Close Project, and when we do so, RStudio will close the project and reopen in a vanilla session.

Data in packages

Almost all R packages come bundled with data-sets, too many of them to walk you through but

To load data from a package, if you know the data-set’s name, run

Saving data and workspaces

You can save your data via

data(mtcars)
save(mtcars, file = "workshops/fulbright/handouts/data/mtcars.RData")
rm(list = ls()) # To clear the Environment
load("workshops/fulbright/handouts/data/mtcars.RData")

You can also save multiple data files as follows:

data(mtcars)
library(ggplot2)
data(diamonds)
save(mtcars, diamonds, file = "workshops/fulbright/handouts/data/mydata.RData") 
rm(list = ls()) # To clear the Environment
load("workshops/fulbright/handouts/data/mydata.RData")

If you want to save just a single object from the environment and then load it in a later session, maybe with a different name, then you should use saveRDS() and readRDS()

data(mtcars)
saveRDS(mtcars,
        file = here::here("workshops/fulbright/handouts/data/", "mydata.RDS"))

rm(list = ls()) # To clear the Environment

ourdata = readRDS(
  here::here("workshops/fulbright/handouts/data/", "mydata.RDS"))

If instead you did the following, the file will be read with the name when saved

data(mtcars)
save(mtcars,
     file = here::here("workshops/fulbright/handouts/data/", "mtcars.RData"))

rm(list = ls())  # To clear the Environment
ourdata = load(
  here::here("workshops/fulbright/handouts/data/", "mtcars.RData") 
  ) # Note ourdata is listed as "mtcars" 

If you want to save everything you have done in the work session you can via save.image()

save.image(file = "mywork_aug142019.RData")

The next time you start RStudio this image will be automatically loaded. This is useful if you have a lot of R code you have written and various objects generated and do not want to start from scratch the next time around.

If you are not in a project and they try to close RStudio after some code has been run, you will be prompted to save (or not) the workspace and you should say “no” by default unless you want to save the workspace.

A Small Map with Leaflet

There are several packages that allow us to build simple versus complicated maps in R. Of late I have been really fascinated by leaflet – an easy to learn JavaScript library that generates interactive maps – so let us see that package in action. Later on, when we move to more advanced visualizations we will look at a variety of mapping options. For the moment we keep it simple and fun.

library(leaflet)
library(leaflet.extras)
library(widgetframe)

m1 <- leaflet() %>% setView(lat = 39.322577,
                            lng = -82.106336,
                            zoom = 14) %>% 
  addTiles() %>%
  setMapWidgetStyle() %>%
  frameWidget(height = '275')
saveWidget(m1, 'leaflet1.html')
m1

Notice how this was built:

Now, say since I ended up picking the general area around Richland Avenue, I could drop a marker on Building 21 on The Ridges. This is being done with addMarkers and the popup is basically reflecting what should be displayed when someone clicks on this marker.

m2 <- leaflet() %>% setView(lat = 39.322577,
                            lng = -82.106336,
                            zoom = 15) %>% 
  addMarkers(lat = 39.319984, lng = -82.107084,
             popup = c("The Ridges, Building 21")) %>% 
  addTiles() %>%
  setMapWidgetStyle() %>%
  frameWidget(height = '275')
saveWidget(m2, 'leaflet2.html')
m2

Let us build one for Egypt, shall we?

m3 <- leaflet() %>% setView(lat = 30.049677, 
                            lng = 31.236318,
                            zoom = 8) %>% 
  addMarkers(lat = 30.049677, lng = 31.236318,
             popup = c("Cairo, Egypt")) %>% 
  addTiles() %>%
  setMapWidgetStyle() %>%
  frameWidget(height = '500')
saveWidget(m3, 'leaflet3.html')
m3