Introduction to R and RStudio

This module introduces you to R and RStudio, both in terms of installing the needed software on your computer, and by explaining the very basics of working within RStudio. By the end of the module you will understand the fundamentals of RMarkdown, how to create html, pdf, and Word files that incorporate code, interpretive text, and tables/figures, how to read local and web-based data-files stored in any format, and how to save data in R’s native format.

Ani Ruhil
2021-12-20

Installing R and RStudio

First install the latest version of R from here

Then install the latest version of RStudio from here

Launch RStudio and check that it shows

R version 3.6.1 (2019-07-05) – “Action of the Toes”
Copyright (C) 2019 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin15.6.0 (64-bit)

Understand your RStudio Environment

Option Description
Console This is where commands are issued to R, either by typing and hitting enter or running commands from a script (like your R Markdown file)
Environment stores and shows you all the objects created
History shows you a running list of all commands issued to R
Connections shows you any databases/servers you are connected to and also allows you to initiate a new connection
Files shows you files and folders in your current working directory, and you can move up/down in the folder hierarchy
Plots show you all plots that have been generated
Packages shows you installed packages
help allows you to get the help pages by typing in keywords
Viewer shows you are “live” documents running on the server
Knit allows you to generated html/pdf/word documents from a script
Insert allows you to insert a vanilla R chunk. You can (and should) give unique name to code chunks so that you can easily diagnose which chunk is not working
Run allows you to run lines/chunks

You can customize the panes via Tools -> Global Options...

Panes can be detached . This is very helpful when you want another application next to the pane or behind it, or if you are using multiple monitors since then you can execute commands in one monitor and watch the output in another monitor.

You also have a spellcheck; use it to catch typos.

Installing packages

Now we install some packages via Tools -> Install Packages... The initial list of packages to be installed is shown below. Other packages will be installed as needed.

devtools, reshape2, lubridate, car, Hmisc, gapminder, leaflet,
DT, data.table, htmltools, scales, ggridges, here, knitr,
kableExtra, haven, readr, readxl, ggplot2, vembedr, namer

If we need to, we could update packages via Tools -> Check for Package Updates... It is a good idea to update packages on a regular frequency but every now and then something might break with an update but it is usually fixed sooner rather than later by the developer.

Rprojects

  1. Create a folder called mpa6020
  2. Inside the mpa6020 folder create a subfolder called data. The folder structure will now be as shown below
mpa6020/
    └── my-rmarkdown-file-01.Rmd
    └── my-rmarkdown-file-02.Rmd
    └── data/
        └── some data file
        └── another data file

All data you download or create go into the data subfolder. All R code files reside in the mpa6020 folder. Open the Rmd file I sent you: Module01_forClass.Rmd and save it in the mpa6020 folder. Save the data I sent you to the data subfolder.

  1. Now create a project via File -> New Project and choose Existing Directory. Browse to the mpa6020 folder and click Create Project. RStudio will restart and when it does you will be in the project folder and will see a file called mpa6020.Rproj

From now on, when you start an RStudio session, do so by clicking the mpa6020.Rproj file. If you do this life will be fairly easy when working with the files I send you.

Note: If you use the RStudio Cloud instead of your own computer you will see a different folder structure.

Cloud > project/
            └── .Rhistory
            └── data/
                  └── some data file
                  └── another data file
            └── project.Rproj     

In addition, the data folder will have all the data-sets listed in this Module so you do not need to copy any out of Slack.

R Markdown files

  1. Go to New File -> R Markdown ... and enter My First Rmd File in title and your name.
  2. Click OK
  3. Now File -> Save As.. and save it as testing_rmd in the mpa6020 sub-folder and click the Knit button

You may see a message that says some packages need to be installed/updated. Allow these to be installed/updated.

If all goes well, and the document kntis, you should see an html file that has some code, a plot and other results. As the document knits, watch for error messages and copy these verbatim since we can hunt for solutions if we know the error message word for word.

Specific R Markdown Code Chunk commands

Golden Rule: Give every code chunk a unique name, whch can be a alphanumeric string with no whitespace. If you forget, use the namer() package to assign names to every code chunk sans a name. This can be done via

library(namer)

name_chunks("myfilename.Rmd")

Code chunks have several options that could be invoked. Here are some of the more common ones we will use.

Option What it does …
eval If FALSE, knitr will not run the code in the code chunk
include If FALSE, knitr will run the chunk but not include the chunk in the final document
echo If FALSE, knitr will not display the code in the code chunk above it’s results in the final document
error If FALSE, knitr will not display any error messages generated by the code
message If FALSE, knitr will not display any messages generated by the code
warning If FALSE, knitr will not display any warning messages generated by the code
cache If TRUE, knitr will cache the results to reuse in future knits. Knitr will reuse the results until the code chunk is altered
dev The R function name that will be used as a graphical device to record plots, e.g. dev = 'CairoPDF'
dpi A number for knitr to use as the dots per inch (dpi) in graphics (when applicable)
fig.align ‘center’, ‘left’, ‘right’ alignment in the knit document
fig.height height of the figure (in inches, for example)
fig.width width of the figure (in inches, for example)
out.height, out.width The width and height to scale plots to in the final output

Other options can be found in the cheatsheet available here

Reading data

Make sure you have the data-sets sent to you via Slack in your data folder. If you don’t then the commands that follow will not work. We start by reading a simple comma-separated variable format file and then a tab-delimited variable format file.

library(here)

read.csv(
  here(
    "data", 
    "ImportDataCSV.csv"
    ),
  sep = ",",
  header = TRUE
  ) -> df.csv

read.csv(
  here(
    "data", 
    "ImportDataTAB.txt"
    ),
  sep = "\t", 
  header = TRUE
  ) -> df.tab

The sep = "," switch says the individual variables are separated by a comma, and header = TRUE switch indicates that the first row includes variable names. The tab-delimited file needs sep = "\t". If both files were read then Environment should show objects called df.csv and df.tab. If you don’t see these, check the following:

Excel files can be read via the readxl package

library(readxl)

read_excel(
  here(
    "data", 
    "ImportDataXLS.xls"
    )
  ) -> df.xls

read_excel(
  here(
    "data", 
    "ImportDataXLSX.xlsx"
    )
  ) -> df.xlsx

SPSS, Stata, SAS files can be read via the haven package

library(haven)

read_stata(
  here(
    "data", 
    "ImportDataStata.dta"
    )
  ) -> df.stata 

read_sas(
  here(
    "data", 
    "ImportDataSAS.sas7bdat"
    )
  ) -> df.sas

read_sav(
  here(
    "data", 
    "ImportDataSPSS.sav"
    )
  ) -> df.spss

It is also common to encounter fixed-width files where the raw data are stored without any gaps between successive variables. However, these files will come with documentation that will tell you where each variable starts and ends, along with other details about each variable.

read.fwf(
  here(
    "data", 
    "fwfdata.txt"
    ),
  widths = c(4, 9, 2, 4),
  header = FALSE,
  col.names = c("Name", "Month", "Day", "Year")
  ) -> df.fw

Note: we need widths = c() to indicate how many slots each variable takes and then col.names = c() to label the columns since the data file does not have variable names.

Reading Files from the Web

It is possible to specify the full web-path for a file and read it in, rather than storing a local copy. This is often useful when updated by the source (Census Bureau, Bureau of Labor, Bureau of Economic Analysis, etc.)

read.table(
  "http://data.princeton.edu/wws509/datasets/effort.dat"
  ) -> fpe

read.table(
  "https://stats.idre.ucla.edu/stat/data/test.txt",
  header = TRUE
  ) -> test.txt 

read.csv(
  "https://stats.idre.ucla.edu/stat/data/test.csv",
  header = TRUE
  ) -> test.csv

library(foreign)

read.spss(
  "https://stats.idre.ucla.edu/stat/data/hsb2.sav"
  ) -> hsb2.spss

as.data.frame(hsb2.spss) -> df.hsb2.spss

Note that hsb2.spss was read with the foreign, an alternative package to haven

library(haven)

read_spss(
  "https://stats.idre.ucla.edu/stat/data/hsb2.sav"
  ) -> hsb2.spss.haven

The foreign package will also read Stata and other formats and was the one I used a lot before defaulting to haven now. There are other packages for reading SAS, SPSS, etc. data files – sas7bdat, rio, data.table, xlsx, XLConnect, gdata, etc.

Reading compressed files

Large files may sit in compressed archives on the web and R has a neat way of allowing you to download the file, unzip it, and read it. Why is this useful? Because if these files tend to be update periodicially, this ability lets you use the same piece of R code to download/unzip/read the updated file. The tedious way would be to manually download, unzip, place in the appropriate data folder, and then read it.

temp <- tempfile()

download.file(
  "ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/nvss/bridged_race/pcen_v2018_y1018.sas7bdat.zip",
  temp, 
  mode = "wb"
  )

haven::read_sas(
  unzip(
    temp,
    "pcen_v2018_y1018.sas7bdat/pcen_v2018_y1018.sas7bdat"
    )
  ) -> oursasdata 

unlink(temp)

Note that in the code above I didn’t run

library(haven)

read_sas(
  unzip(
    temp, 
    "pcen_v2018_y1018.sas7bdat/pcen_v2018_y1018.sas7bdat"
    )
  ) -> oursasdata

This is because you can skip the library() command before using some function from the package by directly invoking the package as in haven::.

You can save your data in a format that R will recognize, giving it the RData or rdata extension

save(
  oursasdata, 
  file = here(
    "data", 
    "oursasdata.RData"
    )
  )

save(
  oursasdata, 
  file = here(
    "data", 
    "oursasdata.rdata"
    )
  )

Check your data directory to confirm both files are present

Minimal example of data processing

Working with the hsb2 data: 200 students from the High school and Beyond study

read.table(
  'https://stats.idre.ucla.edu/wp-content/uploads/2016/02/hsb2-2.csv',
  header = TRUE, 
  sep = ","
  ) -> hsb2 
Variable Description
female (0 = Male 1 = Female)
race (1=Hispanic 2=Asian 3=African-American 4=White)
ses socioeconomic status (1=Low 2=Middle 3=High)
schtyp type of school (1=Public 2=Private)
prog type of program (1=General 2=Academic 3=Vocational)
read standardized reading score
write standardized writing score
math standardized math score
science standardized science score
socst standardized social studies score

There are no value labels for the various qualitative/categorical variables (female, race, ses, schtyp, and prog) so we next create these.

factor(
  hsb2$female,
  levels = c(0, 1),
  labels=c("Male", "Female")
  ) -> hsb2$female 

factor(
  hsb2$race,
  levels = c(1:4),
  labels=c("Hispanic", "Asian", "African American", "White")
  ) -> hsb2$race

factor(
  hsb2$ses, 
  levels = c(1:3),
  labels=c("Low", "Middle", "High")
  ) -> hsb2$ses

factor(
  hsb2$schtyp,
  levels = c(1:2),
  labels=c("Public", "Private")
  ) -> hsb2$schtyp

factor(
  hsb2$prog,
  levels = c(1:3), 
  labels=c("General", "Academic", "Vocational")
  ) -> hsb2$prog

I am overwriting each variable, indicating to R that variable x will show up as numeric with values 0 and 1, and that a 0 should be treated as male and a 1 as female, and so on. There are are four values for race, 3 for ses, 2 for schtyp, and 3 for prog, so the mapping has to reflect this. Note that this is just a quick run through with creating value labels; we will cover this in greater detail in a later module.

save your work!!

Having added labels to the factors in hsb2 we can now save the data for later use.

save(
  hsb2, 
  file = here(
    "data", 
    "hsb2.RData"
    )
  ) 

Let us test if this R Markdown file will knit to html. If all is good then we can Close Project, and when we do so, RStudio will close the project and reopen in a vanilla session.

Data in packages

Almost all R packages come bundled with data-sets, too many of them to walk you through but

To load data from a package, if you know the data-set’s name, run

library(HistData)

data("Galton")

names(Galton)
[1] "parent" "child" 

or you can run

data(
  "GaltonFamilies", 
  package = "HistData"
  )

names(GaltonFamilies)
[1] "family"          "father"          "mother"         
[4] "midparentHeight" "children"        "childNum"       
[7] "gender"          "childHeight"    

Saving data and workspaces

You can save your data via

data(mtcars)

save(
  mtcars, 
  file = here(
    "data", 
    "mtcars.RData"
    )
  )

rm(list = ls()) # To clear the Environment

load(
  here(
    "data", 
    "mtcars.RData"
    )
  )

You can also save multiple data files as follows:

data(mtcars)

library(ggplot2)

data(diamonds)

save(
  mtcars, 
  diamonds, 
  file = here(
    "data", 
    "mydata.RData"
    )
  )

rm(list = ls()) # To clear the Environment

load(
  here(
    "data", 
    "mydata.RData"
    )
  )

If you want to save just a single object from the environment and then load it in a later session, maybe with a different name, then you should use saveRDS() and readRDS()

data(mtcars)


saveRDS(
  mtcars, 
  file = here(
    "data", 
    "mydata.RDS"
    )
  )

rm(list = ls()) # To clear the Environment

readRDS(
  here(
    "data", 
    "mydata.RDS"
    )
  ) -> ourdata 

If instead you did the following, the file will be read with the name it had when you saved it.

data(mtcars)

save(
  mtcars, 
  file = here(
    "data", 
    "mtcars.RData"
    )
  )

rm(list = ls())  # To clear the Environment

load(
  here(
    "data", 
    "mtcars.RData"
    )
  ) -> ourdata # Note ourdata is listed as "mtcars" 

If you want to save everything you have done in the work session you can via save.image()

save.image(
  file = here(
    "data", 
    "mywork_jan182018.RData"
    )
  )

The next time you start RStudio this image will be automatically loaded. This is useful if you have a lot of R code you have written and various objects generated and do not want to start from scratch the next time around.

If you are not in a project and they try to close RStudio after some code has been run, you will be prompted to save (or not) the workspace and you should say “no” by default unless you want to save the workspace.

A Small Map with Leaflet

There are several packages that allow us to build simple versus complicated maps in R. Of late I have been really fascinated by leaflet – an easy to learn JavaScript library that generates interactive maps – so let us see that package in action. Later on, when we move to more advanced visualizations we will look at a variety of mapping options. For the moment we keep it simple and fun.

library(leaflet)

library(leaflet.extras)

library(widgetframe)

leaflet() %>%
  setView(
    lat = 39.322577,
    lng = -82.106336,
    zoom = 14) %>% 
  addTiles() %>%
  setMapWidgetStyle() %>%
  frameWidget(
    height = '275'
    ) -> m1 

saveWidget(
  m1, 
  'leaflet.html'
  )

m1

Notice how this was built: We used setView() to center the map with given latitude and longitude and then picked a reasonable zoom factor with zoom =. If you set the zoom factor too low you will be seeing the place from outer space and if you set it too high then you might standing on a street corner, so experiment with it.

Now, say since I ended up picking the general area around Richland Avenue, I could drop a marker on Building 21 on The Ridges. This is being done with addMarkers and the popup is basically reflecting what should be displayed when someone clicks on this marker.

leaflet() %>%
  setView(
    lat = 39.322577,
    lng = -82.106336,
    zoom = 15) %>% 
  addMarkers(
    lat = 39.319984,
    lng = -82.107084,
    popup = c("The Ridges, Building 21")
    ) %>% 
  addTiles() %>%
  setMapWidgetStyle() %>%
  frameWidget(
    height = '275'
    ) -> m2 

saveWidget(
  m2, 
  'leaflet2.html'
  )

m2

RStudio webinars

The fantastic team at RStudio runs free webinar that are often very helpful so be sure to signup with your email. Here are some video recodgins of webinars that are relevant to what we have covered so far.


Exercises for practice

Ex. 1: Creating and knitting a new RMarkdown file

Open a fresh session by launching RStudio and then running File -> Open Project...

Give it a title, your name as the author, and then save it with the following name: m1ex1.Rmd

Delete all content starting with line 12.

Add this level 1 heading The Starwars Data and then insert your first code chunk exactly as shown below

library(dplyr)

data(starwars)

str(starwars)

Add this level 2 heading Character Heights and Weights and then your second code chunk

plot(starwars$height, plot$mass)

Now knit this file to html

Ex. 2: Lorem Ipsum paragraphs and graphs

Go to this website and generate five Lorem Ipsum placeholder text paragraphs. Insert these five paragraphs in a new RMarkdown file.

Using the starwars data, create five code chunks, one after each paragraph

plot(starwars$height, starwars$mass)

Now knit this file to html

Ex. 3: Reading in three data files

Create a new RMarkdown file.

Insert a code chunk that reads in both these files found on the web

In a follow-up code chunk, run the summary() command on each data-set

In a separate code chunk, read in this dataset after you download it and save the unzipped file in your data folder.

In a follow-up chunk run both the following commands on this data-set

In a final chunk, run the commands necessary to save each of the three data-sets as separate RData files. Make sure you save them in your data folder.

Now knit the complete Rmd file to html

Ex. 4: Knitting with prettydoc

I’d like you to use a specific R Markdown format because the resulting html files are very readable

Install prettydoc package. Now create a prettydoc Rmd file via New File -> RMarkdown... -> From Template -> Lightweight and Pretty Document (HTML)

Now take all the text and code chunk you created in Ex. 3 and insert it in this file. Make sure you add a title, etc in the YAML and then knit the file to html

You can play with the theme: and highlight: fields, choosing from the options displayed here

You should consider using either the prettydoc format or the default RMarkdown templates baked into RStudio. You can explore those here and find the settings that can be tweaked via experimentation.

Ex. 5: Mapping your “happy” place

Think about the one physical location that is, for you, your “happy” place. It could be a beach, a house you grew up in, a friend or relative’s house, a hiking trail, a restaurant, a resort, etc.

Use google maps to find this place and then write down the latitude and longitude google gives you.

Now build a leaflet map that shows this place with a popup. The popup must reflect, when clicked, a brief description of what this place is and why it means so much to you.


Next Module:
Graphics with ggplot2

Visitors

website counter

Citation

For attribution, please cite this work as

Ruhil (2021, Dec. 20). Introduction to R and RStudio. Retrieved from https://aniruhil.org/courses/mpa6020/handouts/module01.html

BibTeX citation

@misc{ruhil2021introduction,
  author = {Ruhil, Ani},
  title = {Introduction to R and RStudio},
  url = {https://aniruhil.org/courses/mpa6020/handouts/module01.html},
  year = {2021}
}