class: title-slide, center, middle background-image: url(images/ouaerial.jpeg) background-size: cover # .fat[.fancy[Basic Data Operations in R]] ## .fat[.fancy[Ani Ruhil]] --- name: agenda ## Basic data operations - Common data types and modifications - Case operations - Subsetting data and factor levels - Exporting data - Merging data - Dealing with duplicates - Rearranging a data-frame - Working with strings --- class: inverse, center, middle # .heat[.fancy[ Common data types ]] --- For the most part you'll focus on variables that are - numeric -- the usual quantitative measures, also called "continuous" variables - factors -- categorical variables such as sex, BMI categories, and so on - dates -- but we'll deal with these in detail later look at the variables in the `hsb` data ```r read.table('https://stats.idre.ucla.edu/stat/data/hsb2.csv', header = TRUE, sep = ",") -> hsb names(hsb) ``` ``` ## [1] "id" "female" "race" "ses" "schtyp" "prog" "read" "write" ## [9] "math" "science" "socst" ``` ```r str(hsb) ``` ``` ## 'data.frame': 200 obs. of 11 variables: ## $ id : int 70 121 86 141 172 113 50 11 84 48 ... ## $ female : int 0 1 0 0 0 0 0 0 0 0 ... ## $ race : int 4 4 4 4 4 4 3 1 4 3 ... ## $ ses : int 1 2 3 3 2 2 2 2 2 2 ... ## $ schtyp : int 1 1 1 1 1 1 1 1 1 1 ... ## $ prog : int 1 3 1 3 2 2 1 2 1 2 ... ## $ read : int 57 68 44 63 47 44 50 34 63 57 ... ## $ write : int 52 59 33 44 52 52 59 46 57 55 ... ## $ math : int 41 53 54 47 57 51 42 45 54 52 ... ## $ science: int 47 63 58 53 53 63 53 39 58 50 ... ## $ socst : int 57 61 31 56 61 61 61 36 51 51 ... ``` --- Some of these variables are not really numeric, in particular: - female = (0/1) - race = (1=hispanic 2=asian 3=african-amer 4=white) - ses = socioeconomic status (1=low 2=middle 3=high) - schtyp = type of school (1=public 2=private) - prog = type of program (1=general 2=academic 3=vocational) We will have to convert these pseudo-numeric variables and label what the values represent ```r factor(hsb$female, levels = c(0, 1), labels = c("Male", "Female")) -> hsb$female factor(hsb$race, levels = c(1:4), labels = c("Hispanic", "Asian", "African American", "White")) -> hsb$race factor(hsb$ses, levels = c(1:3), labels = c("Low", "Middle", "High")) -> hsb$ses factor(hsb$schtyp, levels = c(1:2), labels = c("Public", "Private")) -> hsb$schtyp factor(hsb$prog, levels = c(1:3), labels = c("General", "Academic", "Vocational")) -> hsb$prog str(hsb) ``` ``` ## 'data.frame': 200 obs. of 11 variables: *## $ id : int 70 121 86 141 172 113 50 11 84 48 ... *## $ female : Factor w/ 2 levels "Male","Female": 1 2 1 1 1 1 1 1 1 1 ... *## $ race : Factor w/ 4 levels "Hispanic","Asian",..: 4 4 4 4 4 4 3 1 4 3 ... *## $ ses : Factor w/ 3 levels "Low","Middle",..: 1 2 3 3 2 2 2 2 2 2 ... *## $ schtyp : Factor w/ 2 levels "Public","Private": 1 1 1 1 1 1 1 1 1 1 ... *## $ prog : Factor w/ 3 levels "General","Academic",..: 1 3 1 3 2 2 1 2 1 2 ... ## $ read : int 57 68 44 63 47 44 50 34 63 57 ... ## $ write : int 52 59 33 44 52 52 59 46 57 55 ... ## $ math : int 41 53 54 47 57 51 42 45 54 52 ... ## $ science: int 47 63 58 53 53 63 53 39 58 50 ... ## $ socst : int 57 61 31 56 61 61 61 36 51 51 ... ``` --- If we wanted to be prudent we could have created new factors, preserving the original data (my preference and now a habit) ```r read.table('https://stats.idre.ucla.edu/stat/data/hsb2.csv', header = TRUE, sep = ",") -> hsb factor(hsb$female, levels = c(0, 1), labels = c("Male", "Female")) -> hsb$female.f factor(hsb$race, levels = c(1:4), labels = c("Hispanic", "Asian", "African American", "White")) -> hsb$race.f factor(hsb$ses, levels = c(1:3), labels = c("Low", "Middle", "High")) -> hsb$ses.f factor(hsb$schtyp, levels = c(1:2), labels = c("Public", "Private")) -> hsb$schtyp.f factor(hsb$prog, levels = c(1:3), labels = c("General", "Academic", "Vocational")) -> hsb$prog.f ``` --- At times, we may have numbers that get read in as characters or factors, as is evident from the example below: .pull-left[ ```r c(1, 2, 3, 4, 5) -> x1 c(1, 2, 3, 4, "a5") -> x2 c(1, 2, 3, 4, "hello") -> x3 c(1, 2, 3, 4, NA) -> x4 ``` ] .pull-right[ <center><img src = "./images/df4.png", width = 400px></center> ] Notice that x1 and x2 are read in as numeric but x2 and x3 show up as `chr`, short for `character`. You could convert x2 and x3 into numeric as follows: ```r as.numeric(x2) -> x2.num as.numeric(x3) -> x3.num ``` The `NA` values are `not applicable` values, essentially `missing values`. --- At times the variable may in fact be numeric but because of some anomalies in the variable, R will read it in as a factor. When this happens, converting the variable into numeric has to be done with care. Just running `as.numeric()` will not work. See the example below where `age` is flagged as a factor, perhaps because the variable was stored and exported with the double-quotation marks. ```r c(100, 101, 102, 103, 104, 105, 106) -> score c("Male", "Female", "Male", "Female", "Female", "Male", "Female") -> sex c("18", "18", "19", "19", "21", "21", "NA") -> age cbind.data.frame(score, sex, age) -> my.df ``` If I try `as.numeric()` the conversion uses the factor codings 1, 2, 3, 4 instead of the actual age values. So what is recommended is that you run `as.numeric(levels(data$variable))[data$variable]` instead. You can see the difference by comparing `age.num1` against `age.num2`. ```r as.numeric(my.df$age) -> my.df$age.num1 as.numeric(levels(my.df$age))[my.df$age] -> my.df$age.num2 ``` Open up the data-frame and then see these two variables to understand how `age.num1` is invalid but `age.num2` is valid --- We often need to transform an original numeric variable, stored perhaps as a proportion that we want to convert into a percentage. Or we may want to square it, divide it by 100, and so on. These operations are very straightforward. .pull-left[ ```r data.frame(x = c(0.1, 0.2, 0.3, 0.4)) -> my.df # Convert x from proportion to percent my.df$x * 100 -> my.df$x.pct # Divide x by 10 my.df$x / 10 -> my.df$x.div # Square x my.df$x^2 -> my.df$x.sqrd # Take the square-root of x sqrt(my.df$x) -> my.df$x.sqrt # Multiply x by 2 my.df$x * 2 -> my.df$x.dbled ``` ] .pull-right[ This is the result ... <table> <thead> <tr> <th style="text-align:right;"> x </th> <th style="text-align:right;"> x% </th> <th style="text-align:right;"> x/10 </th> <th style="text-align:right;"> x^2 </th> <th style="text-align:right;"> x^(1/2) </th> <th style="text-align:right;"> 2x </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.1 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 0.01 </td> <td style="text-align:right;"> 0.01 </td> <td style="text-align:right;"> 0.3162278 </td> <td style="text-align:right;"> 0.2 </td> </tr> <tr> <td style="text-align:right;"> 0.2 </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 0.02 </td> <td style="text-align:right;"> 0.04 </td> <td style="text-align:right;"> 0.4472136 </td> <td style="text-align:right;"> 0.4 </td> </tr> <tr> <td style="text-align:right;"> 0.3 </td> <td style="text-align:right;"> 30 </td> <td style="text-align:right;"> 0.03 </td> <td style="text-align:right;"> 0.09 </td> <td style="text-align:right;"> 0.5477226 </td> <td style="text-align:right;"> 0.6 </td> </tr> <tr> <td style="text-align:right;"> 0.4 </td> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> 0.04 </td> <td style="text-align:right;"> 0.16 </td> <td style="text-align:right;"> 0.6324555 </td> <td style="text-align:right;"> 0.8 </td> </tr> </tbody> </table> ] --- We can also group numeric data into `bins`. Let us see this with our reading scores from the `hsb2` data-set. ```r load("data/hsb2.RData") cut(hsb2$read, breaks = c(28, 38, 48, 58, 68, 78)) -> hsb2$grouped_read table(hsb2$grouped_read) ``` ``` ## ## (28,38] (38,48] (48,58] (58,68] (68,78] ## 13 69 61 47 9 ``` Uh oh! we have a total of 199 so what happened to the 200th? The one that wasn't grouped is a reading score of 28. Why? You have to choose `left-open` or `right-open` intervals... should 38 go in the 38-48 group or the 28-38 group? This decision can be invoked by adding `right = FALSE` or `right = TRUE` (which is the default). So let me set `right = FALSE`. ```r cut(hsb2$read, breaks = c(28, 38, 48, 58, 68, 78), right = FALSE) -> hsb2$grouped_read2 table(hsb2$grouped_read2) ``` ``` ## ## [28,38) [38,48) [48,58) [58,68) [68,78) ## 14 68 62 36 20 ``` Open `hsb2` and arrange rows in ascending order of `read`. Find `read = 48`. With the default `right = TRUE` 48 put in 38-48 group but with `right = FALSE` 48 put in 48-58 group. So `right = FALSE` only include values in a group `(a,b]` if the value is `\(a <= x < b\)`. With `right = TRUE` only include values in a group `[a, b)` if the value is `\(a < x <= b\)`. --- ```{=html} <div id="htmlwidget-96cbf205c1d7c8d8211d" style="width:100%;height:auto;" class="datatables html-widget"></div> <script type="application/json" data-for="htmlwidget-96cbf205c1d7c8d8211d">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57","58","59","60","61","62","63","64","65","66","67","68","69","70","71","72","73","74","75","76","77","78","79","80","81","82","83","84","85","86","87","88","89","90","91","92","93","94","95","96","97","98","99","100","101","102","103","104","105","106","107","108","109","110","111","112","113","114","115","116","117","118","119","120","121","122","123","124","125","126","127","128","129","130","131","132","133","134","135","136","137","138","139","140","141","142","143","144","145","146","147","148","149","150","151","152","153","154","155","156","157","158","159","160","161","162","163","164","165","166","167","168","169","170","171","172","173","174","175","176","177","178","179","180","181","182","183","184","185","186","187","188","189","190","191","192","193","194","195","196","197","198","199","200"],[57,68,44,63,47,44,50,34,63,57,60,57,73,54,45,42,47,57,68,55,63,63,50,60,37,34,65,47,44,52,42,76,65,42,52,60,68,65,47,39,47,55,52,42,65,55,50,65,47,57,53,39,44,63,73,39,37,42,63,48,50,47,44,34,50,44,60,47,63,50,44,60,73,68,55,47,55,68,31,47,63,36,68,63,55,55,52,34,50,55,52,63,68,39,44,50,71,63,34,63,68,47,47,63,52,55,60,35,47,71,57,44,65,68,73,36,43,73,52,41,60,50,50,47,47,55,50,39,50,34,57,57,68,42,61,76,47,46,39,52,28,42,47,47,52,47,50,44,47,45,47,65,43,47,57,68,52,42,42,66,47,57,47,57,52,44,50,39,57,57,42,47,42,60,44,63,65,39,50,52,60,44,52,55,50,65,52,47,63,50,42,36,50,41,47,55,42,57,55,63],["(48,58]","(58,68]","(38,48]","(58,68]","(38,48]","(38,48]","(48,58]","(28,38]","(58,68]","(48,58]","(58,68]","(48,58]","(68,78]","(48,58]","(38,48]","(38,48]","(38,48]","(48,58]","(58,68]","(48,58]","(58,68]","(58,68]","(48,58]","(58,68]","(28,38]","(28,38]","(58,68]","(38,48]","(38,48]","(48,58]","(38,48]","(68,78]","(58,68]","(38,48]","(48,58]","(58,68]","(58,68]","(58,68]","(38,48]","(38,48]","(38,48]","(48,58]","(48,58]","(38,48]","(58,68]","(48,58]","(48,58]","(58,68]","(38,48]","(48,58]","(48,58]","(38,48]","(38,48]","(58,68]","(68,78]","(38,48]","(28,38]","(38,48]","(58,68]","(38,48]","(48,58]","(38,48]","(38,48]","(28,38]","(48,58]","(38,48]","(58,68]","(38,48]","(58,68]","(48,58]","(38,48]","(58,68]","(68,78]","(58,68]","(48,58]","(38,48]","(48,58]","(58,68]","(28,38]","(38,48]","(58,68]","(28,38]","(58,68]","(58,68]","(48,58]","(48,58]","(48,58]","(28,38]","(48,58]","(48,58]","(48,58]","(58,68]","(58,68]","(38,48]","(38,48]","(48,58]","(68,78]","(58,68]","(28,38]","(58,68]","(58,68]","(38,48]","(38,48]","(58,68]","(48,58]","(48,58]","(58,68]","(28,38]","(38,48]","(68,78]","(48,58]","(38,48]","(58,68]","(58,68]","(68,78]","(28,38]","(38,48]","(68,78]","(48,58]","(38,48]","(58,68]","(48,58]","(48,58]","(38,48]","(38,48]","(48,58]","(48,58]","(38,48]","(48,58]","(28,38]","(48,58]","(48,58]","(58,68]","(38,48]","(58,68]","(68,78]","(38,48]","(38,48]","(38,48]","(48,58]",null,"(38,48]","(38,48]","(38,48]","(48,58]","(38,48]","(48,58]","(38,48]","(38,48]","(38,48]","(38,48]","(58,68]","(38,48]","(38,48]","(48,58]","(58,68]","(48,58]","(38,48]","(38,48]","(58,68]","(38,48]","(48,58]","(38,48]","(48,58]","(48,58]","(38,48]","(48,58]","(38,48]","(48,58]","(48,58]","(38,48]","(38,48]","(38,48]","(58,68]","(38,48]","(58,68]","(58,68]","(38,48]","(48,58]","(48,58]","(58,68]","(38,48]","(48,58]","(48,58]","(48,58]","(58,68]","(48,58]","(38,48]","(58,68]","(48,58]","(38,48]","(28,38]","(48,58]","(38,48]","(38,48]","(48,58]","(38,48]","(48,58]","(48,58]","(58,68]"],["[48,58)","[68,78)","[38,48)","[58,68)","[38,48)","[38,48)","[48,58)","[28,38)","[58,68)","[48,58)","[58,68)","[48,58)","[68,78)","[48,58)","[38,48)","[38,48)","[38,48)","[48,58)","[68,78)","[48,58)","[58,68)","[58,68)","[48,58)","[58,68)","[28,38)","[28,38)","[58,68)","[38,48)","[38,48)","[48,58)","[38,48)","[68,78)","[58,68)","[38,48)","[48,58)","[58,68)","[68,78)","[58,68)","[38,48)","[38,48)","[38,48)","[48,58)","[48,58)","[38,48)","[58,68)","[48,58)","[48,58)","[58,68)","[38,48)","[48,58)","[48,58)","[38,48)","[38,48)","[58,68)","[68,78)","[38,48)","[28,38)","[38,48)","[58,68)","[48,58)","[48,58)","[38,48)","[38,48)","[28,38)","[48,58)","[38,48)","[58,68)","[38,48)","[58,68)","[48,58)","[38,48)","[58,68)","[68,78)","[68,78)","[48,58)","[38,48)","[48,58)","[68,78)","[28,38)","[38,48)","[58,68)","[28,38)","[68,78)","[58,68)","[48,58)","[48,58)","[48,58)","[28,38)","[48,58)","[48,58)","[48,58)","[58,68)","[68,78)","[38,48)","[38,48)","[48,58)","[68,78)","[58,68)","[28,38)","[58,68)","[68,78)","[38,48)","[38,48)","[58,68)","[48,58)","[48,58)","[58,68)","[28,38)","[38,48)","[68,78)","[48,58)","[38,48)","[58,68)","[68,78)","[68,78)","[28,38)","[38,48)","[68,78)","[48,58)","[38,48)","[58,68)","[48,58)","[48,58)","[38,48)","[38,48)","[48,58)","[48,58)","[38,48)","[48,58)","[28,38)","[48,58)","[48,58)","[68,78)","[38,48)","[58,68)","[68,78)","[38,48)","[38,48)","[38,48)","[48,58)","[28,38)","[38,48)","[38,48)","[38,48)","[48,58)","[38,48)","[48,58)","[38,48)","[38,48)","[38,48)","[38,48)","[58,68)","[38,48)","[38,48)","[48,58)","[68,78)","[48,58)","[38,48)","[38,48)","[58,68)","[38,48)","[48,58)","[38,48)","[48,58)","[48,58)","[38,48)","[48,58)","[38,48)","[48,58)","[48,58)","[38,48)","[38,48)","[38,48)","[58,68)","[38,48)","[58,68)","[58,68)","[38,48)","[48,58)","[48,58)","[58,68)","[38,48)","[48,58)","[48,58)","[48,58)","[58,68)","[48,58)","[38,48)","[58,68)","[48,58)","[38,48)","[28,38)","[48,58)","[38,48)","[38,48)","[48,58)","[38,48)","[48,58)","[48,58)","[58,68)"]],"container":"<table class=\"display\">\n <thead>\n <tr>\n <th> <\/th>\n <th>read<\/th>\n <th>grouped_read<\/th>\n <th>grouped_read2<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-center","targets":[1,2,3]},{"orderable":false,"targets":0}],"filter":"top","rownames":false,"class":"compact","order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script> ``` --- More generally you'll see folks just specify the number of cuts they want. ```r cut(hsb2$read, breaks = 5) -> hsb2$grouped_read3 table(hsb2$grouped_read3) ``` ``` ## ## (28,37.6] (37.6,47.2] (47.2,56.8] (56.8,66.4] (66.4,76] ## 14 68 48 50 20 ``` But since age cannot include a decimal value this may be unsuitable for our purposes. Of course, we could have avoided this open/closed business of the interval by simply doing this: ```r cut(hsb2$read, breaks = c(25, 35, 45, 55, 65, 75, 85)) -> hsb2$grouped_read4 table(hsb2$grouped_read4) ``` ``` ## ## (25,35] (35,45] (45,55] (55,65] (65,75] (75,85] ## 9 45 76 49 19 2 ``` --- class: inverse, center, middle ### .fat[.fancy[ chop() with {santoku} ]] --- Say we want the reading scores grouped in 10-point intervals. ```r library(santoku) chop_width(hsb$read, width = 10) -> hsb$rchop1 table(hsb$rchop1) ``` ``` ## ## [28, 38) [38, 48) [48, 58) [58, 68) [68, 78) ## 14 68 62 36 20 ``` If you want 5, evenly-spaced intervals, which will give each group with a width of exactly `\(\dfrac{76 - 28}{5} = 9.6\)` ```r chop_evenly(hsb$read, intervals = 5) -> hsb$rchop2 table(hsb$rchop2) ``` ``` ## ## [28, 37.6) [37.6, 47.2) [47.2, 56.8) [56.8, 66.4) [66.4, 76] ## 14 68 48 50 20 ``` --- What if we want equally spaced cut-points, maybe the quintiles or the quartiles, respectively?? ```r chop_equally(hsb$read, groups = 5) -> hsb$rchop3 table(hsb$rchop3) ``` ``` ## ## [0%, 20%) [20%, 40%) [40%, 60%) [60%, 80%) [80%, 100%] ## 39 16 62 37 46 ``` ```r chop_equally(hsb$read, groups = 4) -> hsb$rchop4 table(hsb$rchop4) ``` ``` ## ## [0%, 25%) [25%, 50%) [50%, 75%) [75%, 100%] ## 39 44 61 56 ``` What if you want specific quantiles? ```r chop_quantiles(hsb$read, c(0.05, 0.25, 0.50, 0.75, 0.95)) -> hsb$rchop5 table(hsb$rchop5) ``` ``` ## ## [0%, 5%) [5%, 25%) [25%, 50%) [50%, 75%) [75%, 95%] (95%, 100%] ## 9 30 44 61 47 9 ``` --- Cut-points in terms of the mean and standard deviation? ```r chop_mean_sd(hsb$read,) -> hsb$rchop6 table(hsb$rchop6) ``` ``` ## ## [-3 sd, -2 sd) [-2 sd, -1 sd) [-1 sd, 0 sd) [0 sd, 1 sd) [1 sd, 2 sd) [2 sd, 3 sd) ## 2 22 91 39 39 7 ``` If you want custom labels ... .panelset[ .panel[.panel-name[with lbl_dash()] ```r chop_quantiles( hsb$read, c(0.05, 0.25, 0.50, 0.75, 0.95), labels = lbl_dash() ) -> hsb$rchop7 table(hsb$rchop7) ``` ``` ## ## 0% - 5% 5% - 25% 25% - 50% 50% - 75% 75% - 95% 95% - 100% ## 9 30 44 61 47 9 ``` ] .panel[.panel-name[with lbl_seq()] ```r chop_quantiles( hsb$read, c(0.05, 0.25, 0.50, 0.75, 0.95), labels = lbl_seq("(A)") ) -> hsb$rchop8 table(hsb$rchop8) ``` ``` ## ## (A) (B) (C) (D) (E) (F) ## 9 30 44 61 47 9 ``` ] ] And there you have it! --- class: inverse, center, middle ## .fat[.fancy[ Changing case: lower/upper/title/sentence/etc ]] --- ```r data.frame( sex = c("M", "F", "M", "F"), NAMES = c("Andy", "Jill", "Jack", "Madison"), age = c(24, 48, 72, 96) ) -> my.df ``` Maybe I want the names of the individuals to be all uppercase, or perhaps all lowercase. I can do this via: ```r tolower(my.df$sex) -> my.df$sex.lower toupper(my.df$sex) -> my.df$sex.upper ``` Maybe it is not the values but the column names that I want to convert to lowercase. This is easily done via ```r "names" -> colnames(my.df)[2] ``` Note: I asked for the second column's name to be changed to "names". I could have achieved the same thing by running: ```r data.frame( sex = c("M", "F", "M", "F"), NAMES = c("Andy", "Jill", "Jack", "Madison"), age = c(24, 48, 72, 96) ) -> my.df tolower(colnames(my.df)) -> colnames(my.df) ``` --- By the same logic, I could convert each column name to uppercase by typing ```r toupper(colnames(my.df)) -> colnames(my.df) ``` And then of course, using the following to change a specific column name to uppercase: ```r data.frame( sex = c("M", "F", "M", "F"), NAMES = c("Andy", "Jill", "Jack", "Madison"), age = c(24, 48, 72, 96) ) -> my.df toupper(colnames(my.df)[1]) -> colnames(my.df)[1] "AGE" -> colnames(my.df)[3] ``` What about title case? ```r c("somewhere over the rainbow", "to kill a mockingbird", "the reluctant fundamentalist") -> x library(tools) toTitleCase(x) -> x.title x.title ``` ``` ## [1] "Somewhere over the Rainbow" "To Kill a Mockingbird" ## [3] "The Reluctant Fundamentalist" ``` When we work with `stringi` and `stringr` we'll see other ways of switching letter cases --- class: inverse, middle, center # .fat[.fancy[ A little bit about factors ]] --- With the `hsb` data we created some factors. Here I want to show you a couple of things to pay attention to. First, I can create a new variable and store it as a factor as follows: ```r my.df$female1[my.df$SEX == "M"] = 1 my.df$female1[my.df$SEX == "F"] = 2 factor( my.df$female1, levels = c(1, 2), labels = c("Male", "Female") ) -> my.df$female1 ``` I could have also skipped the `0/1` business and just done this: ```r my.df$female2[my.df$SEX == "M"] = "Male" my.df$female2[my.df$SEX == "F"] = "Female" factor(my.df$female2) -> my.df$female2 ``` Notice the difference between `female1` and `female2` - female1: Since you specified `levels = c(0, 1)` and indicated the lowest level be mapped to Male, the factor is built in that order ... Male first, then Female - female2: Here we did not specify the levels and hence R falls back on its default of building them up alphabetically ... Female first, then Male --- Let us see factor levels with a different example. Say we had the following variable, 10 responses to a survey question. ```r data.frame(x = c( rep("Disagree", 3), rep("Neutral", 2), rep("Agree", 5) ) ) -> fdf factor(fdf$x) -> fdf$responses levels(fdf$responses) ``` ``` ## [1] "Agree" "Disagree" "Neutral" ``` ```r table(fdf$responses) ``` ``` ## ## Agree Disagree Neutral ## 5 3 2 ``` --- This makes no sense since we'd like the logical ordering of Disagree -> Neutral -> Agree ```r ordered( fdf$responses, levels = c("Disagree", "Neutral", "Agree") ) -> fdf$newresponses levels(fdf$newresponses) ``` ``` ## [1] "Disagree" "Neutral" "Agree" ``` ```r min(fdf$newresponses) ``` ``` ## [1] Disagree ## Levels: Disagree < Neutral < Agree ``` ```r table(fdf$newresponses) ``` ``` ## ## Disagree Neutral Agree ## 3 2 5 ``` --- I could have also generated this desired order when creating the factor as, for example, via ```r factor(fdf$x, ordered = TRUE, levels = c("Disagree", "Neutral", "Agree") ) -> fdf$xordered levels(fdf$xordered) ``` ``` ## [1] "Disagree" "Neutral" "Agree" ``` ```r min(fdf$xordered) ``` ``` ## [1] Disagree ## Levels: Disagree < Neutral < Agree ``` ```r table(fdf$xordered) ``` ``` ## ## Disagree Neutral Agree ## 3 2 5 ``` Notice the `min()` command ... which asks for the minimum level of a factor. This works with ordered factors but not with unordered factors; try it with `x` and with `responses`. --- Before we move on, a word about the command we used to generate a new factor from an existing variable ```r "Male" -> my.df$female2[my.df$SEX == "M"] "Female" -> my.df$female2[my.df$SEX == "F"] ``` You may run into this a lot when generating new variables, and even when fixing problems with an existing variable. For example, say we were given the data on the sex of these 10 individuals but the values included typos. ```r data.frame( mf = c( rep("Male", 3), rep("male", 2), rep("Female", 3), rep("femalE", 2) ) ) -> sexdf sexdf$mf = factor(sexdf$mf) levels(sexdf$mf) ``` ``` ## [1] "femalE" "Female" "male" "Male" ``` --- Obviously not what we'd like so we can go in and clean it up a bit. How? As follows: ```r "Male" -> sexdf$mf[sexdf$mf == "male"] "Female" -> sexdf$mf[sexdf$mf == "femalE"] levels(sexdf$mf) ``` ``` ## [1] "femalE" "Female" "male" "Male" ``` ```r table(sexdf$mf) ``` ``` ## ## femalE Female male Male ## 0 5 0 5 ``` --- Wait a second, we still see `femalE` and `male` but with 0 counts; how should we get rid of these ghosts? ```r droplevels(sexdf$mf) -> sexdf$newmf levels(sexdf$newmf) ``` ``` ## [1] "Female" "Male" ``` ```r table(sexdf$newmf) ``` ``` ## ## Female Male ## 5 5 ``` Having fixed this issue we can drop the original variable via ```r NULL -> sexdf$mf ``` --- class: inverse, middle, center # .fat[.fancy[ Subsetting data and factor levels ]] --- What if want only a susbet of the data? For example, say I am working with the `diamond` data-set and need to subset my analysis to diamonds that are Very Good or better. ```r library(ggplot2) data(diamonds) str(diamonds$cut) ``` ``` ## Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ... ``` ```r table(diamonds$cut) ``` ``` ## ## Fair Good Very Good Premium Ideal ## 1610 4906 12082 13791 21551 ``` I can subset as follows: ```r subset(diamonds, cut != "Fair" & cut != "Good") -> dia.sub1 str(dia.sub1$cut) ``` ``` ## Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 4 3 3 3 3 5 4 5 ... ``` ```r table(dia.sub1$cut) ``` ``` ## ## Fair Good Very Good Premium Ideal ## 0 0 12082 13791 21551 ``` --- Aha! I see no diamonds in the excluded cut ratings but `str()` still shows R remembering `cut` as having 5 ratings and the frequency table shows up with zero counts for both so we drop these levels explicitly. ```r droplevels(dia.sub1$cut) -> dia.sub1$cutnew str(dia.sub1$cutnew) ``` ``` ## Ord.factor w/ 3 levels "Very Good"<"Premium"<..: 3 2 2 1 1 1 1 3 2 3 ... ``` ```r table(dia.sub1$cutnew) ``` ``` ## ## Very Good Premium Ideal ## 12082 13791 21551 ``` --- Now a bit more on sub-setting data. You can subset with as simple or as complicated a condition you need. For example, maybe you only want Ideal diamonds with a certain minimum clarity and price. ```r subset(diamonds, cut == "Ideal" & clarity == "VVS1" & price > 3933) -> dia.sub2 table(dia.sub2$cut) ``` ``` ## ## Fair Good Very Good Premium Ideal ## 0 0 0 0 283 ``` ```r table(dia.sub2$clarity) ``` ``` ## ## I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF ## 0 0 0 0 0 0 283 0 ``` ```r min(dia.sub2$price) ``` ``` ## [1] 3955 ``` Here the sub-setting is generating a data-frame that will only include observations that meet all three requirements (since you used `&` in the command). If you had run the following instead you would have had very different results. ```r subset(diamonds, cut == "Ideal" | clarity == "VVS1" | price > 3933) -> dia.sub3 subset(diamonds, cut == "Ideal" & clarity == "VVS1" | price > 3933) -> dia.sub4 subset(diamonds, cut == "Ideal" | clarity == "VVS1" & price > 3933) -> dia.sub5 ``` --- class: inverse, middle, center # .fat[.fancy[ Exporting data ]] --- Often, after data have been processed in R, we need to share them with folks who don't use R or want the data in a specific format. This is easily done, as shown below. ```r data.frame( Person = c("John", "Timothy", "Olivia", "Sebastian", "Serena"), Age = c(22, 24, 18, 24, 35) ) -> out.df write.csv(out.df, file = "data/out.csv", row.names = FALSE) # CSV format library(haven) write_dta(out.df, "out.df.dta") # Stata format write_sav(out.df, "out.df.sav") # SPSS format write_sas(out.df, "data/out.df.sas") # SAS format library(writexl) write_xlsx(out.df, "data/out.df.xlsx") # Excel format ``` There are other packages that import/export data so be sure to check them out as well, in particular [rio](https://cran.r-project.org/web/packages/rio/vignettes/rio.html). --- class: inverse, middle, center # .fat[.fancy[ Merging data ]] --- Say I have data on some students, with test scores in one file and their gender in another file. Each file has a unique student identifier, called `ID`. How can I create a single data-set? Let us see each data-set first. ```r data.frame( Score = c(10, 21, 33, 12, 35, 67, 43, 99), ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22") ) -> data1 data.frame( Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"), ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71") ) -> data2 ``` Open up both data-sets and note that students C31 and Q22 are missing from `data2` and students E52 and H71 are missing from `data1`. --- ### ... here are the two data-sets .pull-left[ ```{=html} <div id="htmlwidget-b171b565011c082c9d51" style="width:100%;height:auto;" class="datatables html-widget"></div> <script type="application/json" data-for="htmlwidget-b171b565011c082c9d51">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8"],[10,21,33,12,35,67,43,99],["A12","A23","Z14","WX1","Y31","D66","C31","Q22"]],"container":"<table class=\"display\">\n <thead>\n <tr>\n <th> <\/th>\n <th>Score<\/th>\n <th>ID<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-center","targets":[1,2]},{"orderable":false,"targets":0}],"filter":"top","rownames":false,"class":"compact","order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script> ``` ] .pull-right[ ```{=html} <div id="htmlwidget-bcf4d900b28a98d02dfa" style="width:100%;height:auto;" class="datatables html-widget"></div> <script type="application/json" data-for="htmlwidget-bcf4d900b28a98d02dfa">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8"],["Male","Female","Male","Female","Male","Female","Male","Female"],["A12","A23","Z14","WX1","Y31","D66","E52","H71"]],"container":"<table class=\"display\">\n <thead>\n <tr>\n <th> <\/th>\n <th>Sex<\/th>\n <th>ID<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-center","targets":[1,2]},{"orderable":false,"targets":0}],"filter":"top","rownames":false,"class":"compact","order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script> ``` ] --- What about the merge? To merge data-frames we need to specify the `merge key(s)` -- the variable(s) that identifies unique observations in each file. This variable(s) must be present in all the files you want to merge. So long as it exists, you now need to decide: Do you only want to - merge observations that show up in both files `(natural join)` - merge everything even if some cases show up in one but not the other `(full outer join)` - merge such that all cases in file `x` are retained but only those seen in file `y` are joined `(left outer join)` - merge to keep all cases in `y` and only those cases are to be merged from `x` that show up in `y`, referred to as `(right outer join)` ```r merge(x = data1, y = data2, by = c("ID"), all = FALSE) -> natural # only merge if ID seen in both files merge(x = data1, y = data2, by = c("ID"), all = TRUE) -> full # merge everything merge(x = data1, y = data2, by = c("ID"), all.x = TRUE) -> left # make sure all cases from file x are retained merge(x = data1, y = data2, by = c("ID"), all.y = TRUE) -> right # make sure all cases from file y are retained ``` [Garrett and Hadley](http://r4ds.had.co.nz) have a wonderful chapter on joins and the diagram below is one of their many descriptors. <center><img src = "images/joins.png", width = 500px></center> --- What if the ID variables had different names in the files? You could rename them to have a common name or you could use `by.x =` and `by.y = ` as shown below. ```r data.frame( Score = c(10, 21, 33, 12, 35, 67, 43, 99), ID1 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22") ) -> data3 data.frame( Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"), ID2 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71") ) -> data4 merge( data3, data4, by.x = c("ID1"), by.y = c("ID2"), all = FALSE ) -> diffids ``` You can also have more than one merge key. For example, if I am merging data for Ohio schools, each district has a district ID number `(dirn)`, each school has building ID number `(birn)`, and then the district's name `(district)`, the building's name `(building)`, and so on. If I am merging these data then my `by = ` statement will be `by = c("dirn", "birn", "district", "building")`. --- If we have multiple data-frames to merge we can do so in several ways but my default approach is to rely on the `Reduce()` command, as shown below. ```r data.frame( Score = c(10, 21, 33, 12, 35, 67, 43, 99), ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22") ) -> df1 data.frame( Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"), ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71") ) -> df2 data.frame( Age = c(6, 7, 6, 8, 8, 9, 10, 5), ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71") ) -> df3 list(df1, df2, df3) -> my.list Reduce(function(...) merge(..., by = c("ID"), all = TRUE), my.list) -> df.123 ``` --- Here is the merged data-frame ... <table> <thead> <tr> <th style="text-align:left;"> ID </th> <th style="text-align:right;"> Score </th> <th style="text-align:left;"> Sex </th> <th style="text-align:right;"> Age </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A12 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> A23 </td> <td style="text-align:right;"> 21 </td> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> C31 </td> <td style="text-align:right;"> 43 </td> <td style="text-align:left;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> D66 </td> <td style="text-align:right;"> 67 </td> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> E52 </td> <td style="text-align:right;"> NA </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> H71 </td> <td style="text-align:right;"> NA </td> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> Q22 </td> <td style="text-align:right;"> 99 </td> <td style="text-align:left;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> WX1 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;"> Y31 </td> <td style="text-align:right;"> 35 </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;"> Z14 </td> <td style="text-align:right;"> 33 </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 6 </td> </tr> </tbody> </table> --- class: inverse, middle, center # .fat[.fancy[ Dealing with duplicates ]] --- Every now and then you also run into duplicate rows with no good explanation for how that could have happened. Here is an example: ```r data.frame( Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"), ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "A12", "WX1") ) -> dups.df ``` Students A12 and WX1 show up twice! There are two basic commands that we can use to find duplicates -- `unique` and `duplicated`. ```r duplicated(dups.df) # flag duplicate rows with TRUE !duplicated(dups.df) # flag not duplicated rows with TRUE dups.df[duplicated(dups.df), ] # reveal the duplicated rows dups.df[!duplicated(dups.df), ] # reveal the not duplicated rows dups.df[!duplicated(dups.df), ] -> nodups.df # save a dataframe without duplicated rows dups.df[duplicated(dups.df), ] -> onlydups.df # save a dataframe with ONLY the duplicated rows ``` --- class: inverse, middle, center # .fat[.fancy[ Rearranging the data-set ]] --- We may, at times, want to move the variables around so they are in a particular order (age next to race and so on, for example), perhaps drop some because there are too many variables and we don't need all of them. We may also want to arrange the data-frame in ascending or descending order of some variable(s) (increasing order of age or price for example. ```r mtcars[order(mtcars$mpg), ] -> df1 # arrange the data-frame in ascending order of mpg mtcars[order(-mtcars$mpg), ] -> df2 # arrange the data-frame in descending order of mpg mtcars[order(mtcars$am, mtcars$mpg), ] -> df3 # arrange the data-frame in ascending order of automatic/manual and then in ascending order of mpg ``` What if I want to order the columns such that the first column is `am` and then comes `mpg`, followed by `qsec` and then the rest of the columns? Well, the first thing I want to do is see what variable is in which column via the `names()` command. ```r names(mtcars) # the original structure of the mtcars data-frame ``` ``` ## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb" ``` So `mpg` is in column 1, `cyl` is in column 2, and `carb` is the last one in column 11. Now I'll specify that I want all rows by doing this `[, c()]` and then specify the order of the columns by entering the column numbers in `c()` ```r mtcars[, c(9, 1, 7, 2:6, 8, 10:11)] -> df4 # specifying the position of the columns names(df4) ``` ``` ## [1] "am" "mpg" "qsec" "cyl" "disp" "hp" "drat" "wt" "vs" "gear" "carb" ``` --- If I only wanted certain columns I could simply not list the other columns. ```r mtcars[, c(9, 1, 7)] -> df5 # specifying the position of the columns ``` Remember the comma in `mtcars[,` ... this tells R to keep all rows in mtcars. If, instead, I did this I would only get the first four rows, not all. ```r mtcars[1:4, c(9, 1, 7)] -> df6 # keep first four rows and the specified columns ``` I would end up with only the first four rows of mtcars, not every row. So try to remember the order: `dataframe[i, j]` where `i` reference the rows and `j` reference the columns. --- class: inverse, middle, center # .fat[.fancy[ Working with strings ]] --- Often you run into strings that have to be manipulated. Often, when working with county data, for example, county names are followed by ` County, Ohio` and this is superfluous so we end up excising it ```r read.csv("data/strings.csv") -> c.data head(c.data) ``` ``` ## GEO.id GEO.id2 GEO.display.label respop72016 ## 1 0500000US39001 39001 Adams County, Ohio 27907 ## 2 0500000US39003 39003 Allen County, Ohio 103742 ## 3 0500000US39005 39005 Ashland County, Ohio 53652 ## 4 0500000US39007 39007 Ashtabula County, Ohio 98231 ## 5 0500000US39009 39009 Athens County, Ohio 66186 ## 6 0500000US39011 39011 Auglaize County, Ohio 45894 ``` Let us just retain the county name from `GEO.display.label` ```r gsub(" County, Ohio", "", c.data$GEO.display.label) -> c.data$county head(c.data[, c(3, 5)]) ``` ``` ## GEO.display.label county ## 1 Adams County, Ohio Adams ## 2 Allen County, Ohio Allen ## 3 Ashland County, Ohio Ashland ## 4 Ashtabula County, Ohio Ashtabula ## 5 Athens County, Ohio Athens ## 6 Auglaize County, Ohio Auglaize ``` Note the sequence `gsub("look for this pattern", "replace with this pattern", mydata$myvariable)` --- #### `with` and `within` If you need to add variables to an existing data frame, you can do it in several ways: ```r data.frame(x = seq(0, 5, by = 1), y = seq(10, 15, by = 1)) -> dfw dfw$x * dfw$y -> dfw$xy1 # the most basic creation of xy with(dfw, x * y) -> dfw$xy2 # using with within(dfw, xy3 <- x * y) -> dfw # using within ``` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> x </th> <th style="text-align:right;"> y </th> <th style="text-align:right;"> xy1 </th> <th style="text-align:right;"> xy2 </th> <th style="text-align:right;"> xy3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 11 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> 24 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> 39 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 56 </td> <td style="text-align:right;"> 56 </td> <td style="text-align:right;"> 56 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 75 </td> <td style="text-align:right;"> 75 </td> <td style="text-align:right;"> 75 </td> </tr> </tbody> </table> Notice that if using `within()` the variable name must be specified as in `xy3 <- x * y` --- #### `ifelse` and `grepl` The `ifelse` command is a useful one when creating new variables (invariable factors) from an existing variable The long way ... ```r data(mtcars) mtcars$automatic2[mtcars$am == 1] = "Automatic" mtcars$automatic2[mtcars$am == 0] = "Manual" ``` The more efficient way ... ```r ifelse(mtcars$am == 1, "Automatic", "Manual") -> mtcars$automatic ``` If you have strings, then `grepl` comes into play: ```r load("data/movies.RData") ifelse(grepl("Comedy", movies$genres), "Yes", "No") -> movies$comedy ifelse(grepl("(1995)", movies$title), "Yes", "No") -> movies$in1995 ifelse(grepl("Night", movies$title), "Yes", "No") -> movies$night ``` In short, the function looks for the specified string (“Comedy”) in genres and if found adds a flag of “Yes” and if not found adds a flag of “No” --- But what if I wanted to run more complicated expressions, for example if the words “Night” or “Dark” or “Cool” are used in the title? ```r ifelse( grepl("Night", movies$title), "Yes", ifelse(grepl("Dark", movies$title), "Yes", ifelse(grepl("Cool", movies$title), "Yes", "No")) ) -> movies$tricky ``` Ah `grepl`, you beauty!! Nesting our way to what we want ... `grep()` and `grepl()` differ in that both functions find patterns but return different output types - `grep()` returns numeric values that index the locations where the patterns were found - `grepl()` returns a logical vector in which TRUE represents a pattern match and FALSE does not Run the two commands that follow and focus on the output that scrolls in the console ```r grep("Comedy", movies$genres) grepl("Comedy", movies$genres) ``` --- You could use both for sub-setting data as well. ```r movies[grep("Comedy", movies$genres), ] -> mov1 movies[-grep("Comedy", movies$genres), ] -> mov2 ``` Say I only want columns that have the letter “t” in them. No problem: ```r movies[, grep("t", colnames(movies))] -> mov3 # keep if t is in column name movies[, -grep("t", colnames(movies))] -> mov4 # keep if t is not in column name ``` Try using `grepl` instead of `grep` with the preceding two commands and see what results ... ```r movies[, grepl("t", colnames(movies))] -> mov3 movies[, -grepl("t", colnames(movies))] -> mov4 ``` --- class: inverse, middle, center # .fat[.fancy[ sums and means ]] --- ```r data.frame( x = c(0, 1, 2, 3, 4, NA), y = c(10, 11, 12, NA, 14, 15) ) -> dfs dfs ``` ``` ## x y ## 1 0 10 ## 2 1 11 ## 3 2 12 ## 4 3 NA ## 5 4 14 ## 6 NA 15 ``` `rowSums()` for each row, sum the columns so that you end up with a row total ```r rowSums(dfs, na.rm = TRUE) # sum columns 1 and 2 for each row ``` ``` ## [1] 10 12 14 3 18 15 ``` `colSums()` for each column, calculate the sum of its values ```r colSums(dfs, na.rm = TRUE) ``` ``` ## x y ## 10 62 ``` --- `rowMeans()` for each row, calculate the mean of the columns ```r rowMeans(dfs, na.rm = TRUE) ``` ``` ## [1] 5 6 7 3 9 15 ``` `colMeans()` for each column, calculate the mean ```r colMeans(dfs, na.rm = TRUE) ``` ``` ## x y ## 2.0 12.4 ``` --- class: right, middle <img class="circle" src="https://github.com/aniruhil.png" width="175px"/> # Find me at... [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> @aruhil](http://twitter.com/aruhil) [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M326.612 185.391c59.747 59.809 58.927 155.698.36 214.59-.11.12-.24.25-.36.37l-67.2 67.2c-59.27 59.27-155.699 59.262-214.96 0-59.27-59.26-59.27-155.7 0-214.96l37.106-37.106c9.84-9.84 26.786-3.3 27.294 10.606.648 17.722 3.826 35.527 9.69 52.721 1.986 5.822.567 12.262-3.783 16.612l-13.087 13.087c-28.026 28.026-28.905 73.66-1.155 101.96 28.024 28.579 74.086 28.749 102.325.51l67.2-67.19c28.191-28.191 28.073-73.757 0-101.83-3.701-3.694-7.429-6.564-10.341-8.569a16.037 16.037 0 0 1-6.947-12.606c-.396-10.567 3.348-21.456 11.698-29.806l21.054-21.055c5.521-5.521 14.182-6.199 20.584-1.731a152.482 152.482 0 0 1 20.522 17.197zM467.547 44.449c-59.261-59.262-155.69-59.27-214.96 0l-67.2 67.2c-.12.12-.25.25-.36.37-58.566 58.892-59.387 154.781.36 214.59a152.454 152.454 0 0 0 20.521 17.196c6.402 4.468 15.064 3.789 20.584-1.731l21.054-21.055c8.35-8.35 12.094-19.239 11.698-29.806a16.037 16.037 0 0 0-6.947-12.606c-2.912-2.005-6.64-4.875-10.341-8.569-28.073-28.073-28.191-73.639 0-101.83l67.2-67.19c28.239-28.239 74.3-28.069 102.325.51 27.75 28.3 26.872 73.934-1.155 101.96l-13.087 13.087c-4.35 4.35-5.769 10.79-3.783 16.612 5.864 17.194 9.042 34.999 9.69 52.721.509 13.906 17.454 20.446 27.294 10.606l37.106-37.106c59.271-59.259 59.271-155.699.001-214.959z"/></svg> aniruhil.org](https://aniruhil.org) [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"/></svg> ruhil@ohio.edu](mailto:ruhil@ohio.edu)