Basic Data Operations

Ani Ruhil

Basic Data Operations

The previous module introduced you to basic graphics with {ggplot2}. In this module we shift gears and look at some very basic operations often performed on our data prior to or during the analysisphase. How to create factors, convert across variable types, create a grouped version of a numeric variable, and so on.

Author

Affiliation

Ani Ruhil

Published

Feb. 16, 2025

Citation

Ruhil, 2025

Our goal in this module is to understand some basic operations in R. Specifically, we will look at combining two or more data-sets, modifying variables, creating variables, converting quantitative variables into qualitative variables, and so on. There is a lot more we could learn here but I will focus on the essential tasks you are likely to run into.

Data Types

A typical data-frame will look like the following:

library(here)
load(
  here("data", "hsb2.RData")
  )

hsb2 is a typical data-frame with columns, each representing a variable, and rows, each representing an observation. Variables can be numeric (num), such as id, write, read, etc. or then a categorical variable that R calls a factor (Factor). The numeric class includes an integer (a limited range of values) and double (double precision format) but these distinctions should not be of any consequence to you at the moment. If you have dates with/without times, the variable type will show up as dates, some variables may show up as logical (TRUE/FALSE/NA), and yet others as character (for example, “A”, “Bitcoin”, “p@12359” and so on).

You have to be careful though since your data might have numeric codes for categorical variables. For example, if we read in the following file you will see this issue in effect:

read.table(
  'https://stats.idre.ucla.edu/stat/data/hsb2.csv',
  header = TRUE,
  sep = ","
  ) -> hsb

These are the same data as we have in hsb2. Here, the categorical variables female, race, ses, schtyp, prog are carrying numeric codes that represent different attributes. Each also shows up as an integer (note the int) but that is obviously incorrect. The actual values of each should map to specific attributes as show below.

female = (0/1)
race = (1=hispanic 2=asian 3=african-amer 4=white)
ses = socioeconomic status (1=low 2=middle 3=high)
schtyp = type of school (1=public 2=private)
prog = type of program (1=general 2=academic 3=vocational)

This happens very often and as a result we end up having to convert these pseudo-numeric variables into proper categorical variables that R calls factors. This is done as follows:

factor(hsb$female,
       levels = c(0, 1), 
       labels = c("Male", "Female")
       ) -> hsb$female 

factor(hsb$race,
       levels = c(1:4), 
       labels = c("Hispanic", "Asian", "African American", "White")
       ) -> hsb$race 

factor(hsb$ses,
       levels = c(1:3), 
       labels = c("Low", "Middle", "High")
       ) -> hsb$ses 

factor(hsb$schtyp,
       levels = c(1:2), 
       labels = c("Public", "Private")
       ) -> hsb$schtyp 

factor(hsb$prog,
       levels = c(1:3),
       labels = c("General", "Academic", "Vocational")
       ) -> hsb$prog

factor(dataframe$variable, levels = c(...), labels = c(...))

Notice the pattern here: The command is asking R to convert the original variable into a factor() by specifying the target variable’s name hsb$female, then telling R what the unique values are via levels = c() and then listing how each unique value should be labeled via labels = c().

The only way to label things correctly is to know what each value maps to (is female = 1 a Female or a Male?) but luckily we have that information here.

Of course, we could have chosen not to overwrite the original variables but instead created new ones as factors. This is shown below, and in my opinion would be the more prudent approach (and has now become my habit).

read.table(
  'https://stats.idre.ucla.edu/stat/data/hsb2.csv',
  header = TRUE,
  sep = ","
  ) -> hsb 

factor(hsb$female,
       levels = c(0, 1), 
       labels = c("Male", "Female")
       ) -> hsb$female.f 

factor(hsb$race,
       levels = c(1:4), 
       labels = c("Hispanic", "Asian", "African American", "White")
       ) -> hsb$race.f 

factor(hsb$ses,
       levels = c(1:3), 
       labels = c("Low", "Middle", "High")
       ) -> hsb$ses.f 

factor(hsb$schtyp,
       levels = c(1:2), 
       labels = c("Public", "Private")
       ) -> hsb$schtyp.f 

factor(hsb$prog,
       levels = c(1:3),
       labels = c("General", "Academic", "Vocational")
       ) -> hsb$prog.f

Numbers masquerading as characters or factors

At times numbers will be read in as characters or factors, as is evident from the example below:

c(1, 2, 3, 4, 5) -> x1 

c(1, 2, 3, 4, "a5") -> x2 

c(1, 2, 3, 4, "hello") -> x3 

c(1, 2, 3, 4, NA) -> x4

 num [1:5] 1 2 3 4 5

 chr [1:5] "1" "2" "3" "4" "a5"

 chr [1:5] "1" "2" "3" "4" "hello"

 num [1:5] 1 2 3 4 NA

Notice that x1 and x4 are read in as num (numeric) but x2 and x3 show up as chr (character). You could convert x2 and x3 into numeric as follows:

as.numeric(x2) -> x2.num 

as.numeric(x3) -> x3.num

 num [1:5] 1 2 3 4 NA

 num [1:5] 1 2 3 4 NA

The NA values are not applicable values, essentially R-speak for missing values.

On other occasions, the variable may in fact be numeric but because of some anomalies in the variable, R may read it in as a factor. When this happens, converting the variable to numeric must be done with care. Just running as.numeric() will not work. See the example below where age is flagged as a factor because the original variable was originally recorded and exported with double-quotation marks around each value.

c(
  100, 101, 102, 103,
  104, 105, 106
  ) -> score 

c(
  "Male", "Female", "Male", "Female", 
  "Female", "Male", "Female"
  ) -> sex 

c(
  "18", "18", "19", "19", 
  "21", "21", "NA"
  ) -> age 

cbind.data.frame(
  score, sex, age
  ) -> my.df

I can flip age into a numeric measure via as.numeric() …

as.numeric(my.df$age) -> my.df$age.num1 

str(my.df$age.num1)

 num [1:7] 18 18 19 19 21 21 NA

Operating on numeric variables

We often need to transform an original numeric variable, stored perhaps as a proportion that we want to convert into a percentage. Or we may want to square it, divide it by 100, and so on. These operations are very straightforward.

data.frame(
  x = c(0.1, 0.2, 0.3, 0.4)
  ) -> my.df 

my.df$x.pct = my.df$x * 100
my.df$x.div10 = my.df$x / 10
my.df$x.squared = my.df$x^2
my.df$x.cubed = my.df$x^3
my.df$x.sqroot = sqrt(my.df$x)
my.df$x.doubled = my.df$x * 2
my.df$x.fourths = my.df$x ^ (1/4)
x.pctwrong = my.df$x * 100

Binning

We can also group numeric data into bins. Let us see this with our reading scores from the hsb data-set.

summary(hsb)

       id             female           race           ses       
 Min.   :  1.00   Min.   :0.000   Min.   :1.00   Min.   :1.000  
 1st Qu.: 50.75   1st Qu.:0.000   1st Qu.:3.00   1st Qu.:2.000  
 Median :100.50   Median :1.000   Median :4.00   Median :2.000  
 Mean   :100.50   Mean   :0.545   Mean   :3.43   Mean   :2.055  
 3rd Qu.:150.25   3rd Qu.:1.000   3rd Qu.:4.00   3rd Qu.:3.000  
 Max.   :200.00   Max.   :1.000   Max.   :4.00   Max.   :3.000  
     schtyp          prog            read           write      
 Min.   :1.00   Min.   :1.000   Min.   :28.00   Min.   :31.00  
 1st Qu.:1.00   1st Qu.:2.000   1st Qu.:44.00   1st Qu.:45.75  
 Median :1.00   Median :2.000   Median :50.00   Median :54.00  
 Mean   :1.16   Mean   :2.025   Mean   :52.23   Mean   :52.77  
 3rd Qu.:1.00   3rd Qu.:2.250   3rd Qu.:60.00   3rd Qu.:60.00  
 Max.   :2.00   Max.   :3.000   Max.   :76.00   Max.   :67.00  
      math          science          socst        female.f  
 Min.   :33.00   Min.   :26.00   Min.   :26.0   Male  : 91  
 1st Qu.:45.00   1st Qu.:44.00   1st Qu.:46.0   Female:109  
 Median :52.00   Median :53.00   Median :52.0               
 Mean   :52.65   Mean   :51.85   Mean   :52.4               
 3rd Qu.:59.00   3rd Qu.:58.00   3rd Qu.:61.0               
 Max.   :75.00   Max.   :74.00   Max.   :71.0               
              race.f       ses.f       schtyp.f          prog.f   
 Hispanic        : 24   Low   :47   Public :168   General   : 45  
 Asian           : 11   Middle:95   Private: 32   Academic  :105  
 African American: 20   High  :58                 Vocational: 50  
 White           :145

Remember, we want meaningful groups, no fewer than 4/5 and staying below 8/9 unless we need more. To create the groups,

Start by calculating the range of the variable (maximum value - minimum value)
Divide this range by the number of groups we’d like
Round up to the resulting value, which becomes the width of the interval

For example, the range of the reading scores is 48. Say we want 5 groups. Dividing 48 by 5 yields 9.6 so we round up to 10. Now we see what the groups might be: 28-38, 38-48, 48-58, 58-68, 68-78. Now we ask R to create a new variable that represents these groups.

cut(
  hsb$read,
  breaks = c(28, 38, 48, 58, 68, 78)
  ) -> hsb$grouped_read 

table(hsb$grouped_read)


(28,38] (38,48] (48,58] (58,68] (68,78] 
     13      69      61      47       9

Uh oh! we have a total of 199 so what happened to the 200th? The one that wasn’t grouped is a reading score of 28. Why? Because when you group, you have to decide whether the intervals are left-open or right-open … If I run into a 38, should it go in the 38-48 group or the 28-38 group? This decision can be invoked by adding right = FALSE or right = TRUE (which is the default). So let me set right = FALSE.

cut(
  hsb$read, 
  breaks = c(28, 38, 48, 58, 68, 78),
  right = FALSE
  ) -> hsb$grouped_read2 

table(hsb$grouped_read2)


[28,38) [38,48) [48,58) [58,68) [68,78) 
     14      68      62      36      20

Now open up the data-frame and arrange the rows in ascending order of read. Find read = 48. With the default right = TRUE this score was put in the 38-48 group but with right = FALSE it was put in the 48-58 group. So right = FALSE only include values in a group (a,b] if the value is . With right = TRUE only include values in a group [a, b) if the value is .

More generally you’ll see folks just specify the number of cuts they want.

cut(
  hsb$read,
  breaks = 5
  ) -> hsb$grouped_read3 

table(hsb$grouped_read3)


  (28,37.6] (37.6,47.2] (47.2,56.8] (56.8,66.4]   (66.4,76] 
         14          68          48          50          20

But since age cannot include a decimal value this may be unsuitable for our purposes. Of course, we could have avoided this open/closed business of the interval by simply doing this:

cut(
  hsb$read,
  breaks = c(25, 35, 45, 55, 65, 75, 85)
  ) -> hsb$grouped_read4 

table(hsb$grouped_read4)


(25,35] (35,45] (45,55] (55,65] (65,75] (75,85] 
      9      45      76      49      19       2

And there you have it!

… with `{santoku}`

David Hugh-Jones has authored a package called santoku that looks to replace the cut() command with a flexible chop command.

Note that breaks are closed on the left by default. If you want to disable this, use the left = FALSE switch in the command.

Say we want the reading scores grouped in 10-point intervals.

library(santoku)
chop_width(
  hsb$read, width = 10
  ) -> hsb$rchop1

table(hsb$rchop1)


[28, 38) [38, 48) [48, 58) [58, 68) [68, 78] 
      14       68       62       36       20

If you want 5, evenly-spaced intervals, which will give each group with a width of exactly

chop_evenly(
  hsb$read, 
  intervals = 5
  ) -> hsb$rchop2

table(hsb$rchop2)


  [28, 37.6) [37.6, 47.2) [47.2, 56.8) [56.8, 66.4)   [66.4, 76] 
          14           68           48           50           20

What if we want equally spaced cut-points, maybe the quintiles or the quartiles, respectively??

chop_equally(
  hsb$read, groups = 5
  ) -> hsb$rchop3

table(hsb$rchop3)


[28, 44) [44, 47) [47, 55) [55, 63) [63, 76] 
      39       16       62       37       46

chop_equally(
  hsb$read, 
  groups = 4
  ) -> hsb$rchop4

table(hsb$rchop4)


[28, 44) [44, 50) [50, 60) [60, 76] 
      39       44       61       56

What if you want specific quantiles?

chop_quantiles(
  hsb$read, 
  c(0.05, 0.25, 0.50, 0.75, 0.95)
  ) -> hsb$rchop5

table(hsb$rchop5)


   [0%, 5%)   [5%, 25%)  [25%, 50%)  [50%, 75%)  [75%, 95%) 
          9          30          44          61          36 
[95%, 100%] 
         20

Cut-points in terms of the mean and standard deviation?

chop_mean_sd(
  hsb$read
  ) -> hsb$rchop6

table(hsb$rchop6)


[-3 sd, -2 sd) [-2 sd, -1 sd)  [-1 sd, 0 sd)   [0 sd, 1 sd) 
             2             22             91             39 
  [1 sd, 2 sd)   [2 sd, 3 sd] 
            39              7

If you want custom labels …

chop_quantiles(
  hsb$read,
  c(0.05, 0.25, 0.50, 0.75, 0.95),
  labels = lbl_dash()
  ) -> hsb$rchop7

table(hsb$rchop7)


   0%—5%   5%—25%  25%—50%  50%—75%  75%—95% 95%—100% 
       9       30       44       61       36       20

chop_quantiles(
  hsb$read,
  c(0.05, 0.25, 0.50, 0.75, 0.95),
  labels = lbl_seq("(A)")
  ) -> hsb$rchop8

table(hsb$rchop8)


(A) (B) (C) (D) (E) (F) 
  9  30  44  61  36  20

Lowercase and Uppercase

data.frame(
  sex = c("M", "F", "M", "F"), 
  NAMES = c("Andy", "Jill", "Jack", "Madison"), 
  age = c(24, 48, 72, 96)
  ) -> my.df

Maybe I want the names of the individuals to be all uppercase, or perhaps all lowercase. I can do this via:

tolower(my.df$sex) -> my.df$sex.lower

Maybe it is not the values but the column names that I want to convert to lowercase. This is easily done via

colnames(my.df)[2] = "names"

Note: I asked for the second column’s name to be changed to “names”. Since that was the only column that was in uppercase I could have achieved the same thing by running:

data.frame(
  sex = c("M", "F", "M", "F"), 
  NAMES = c("Andy", "Jill", "Jack", "Madison"), 
  age = c(24, 48, 72, 96)
  ) -> my.df 

tolower(
  colnames(my.df)
  ) -> colnames(my.df)

By the same logic, I could convert each column name to uppercase by typing

toupper(
  colnames(my.df)
  ) -> colnames(my.df)

And then of course, using the following to change a specific column name to uppercase:

data.frame(
  sex = c("M", "F", "M", "F"), 
  NAMES = c("Andy", "Jill", "Jack", "Madison"), 
  age = c(24, 48, 72, 96)
  ) -> my.df 

toupper(
  colnames(my.df)[1]
  ) -> colnames(my.df)[1] 

colnames(my.df)[3] = "AGE"

You may also want to experiment with janitor, a handy package that does a lot but I use it most often for cleaning up messy column names in a data-frame. What should be clear by now is that there are many ways to end up with the same result with R code. This is one of the wonderful features of R.

A little more on factors

With the hsb data we converted some variables into factors. Here I want to show you a couple of things to pay attention to.

First, I can create a new variable and store it as a factor as follows:

my.df$female1[my.df$SEX == "M"] = 0 

my.df$female1[my.df$SEX == "F"] = 1 

factor(
  my.df$female1, 
  levels = c(0, 1),
  labels = c("Male", "Female")
  ) -> my.df$female1

I could have also skipped the 0/1 business and just done this:

my.df$female2[my.df$SEX == "M"] = "Male" 
my.df$female2[my.df$SEX == "F"] = "Female" 

factor(my.df$female2) -> my.df$female2

Notice the difference between female1 and female2

female1: Since you specified levels = c(0, 1) and indicated the lowest level be mapped to Male, the factor is built in that order … Male first, then Female
female2: Here we did not specify the levels and hence R falls back on its default of building them up alphabetically … Female first, then Male

Lets see the question of factor levels with a different example. Say we had the following variable, 10 responses to a survey question.

data.frame(
  x = c(
    rep("Disagree", 3),
    rep("Neutral", 2),
    rep("Agree", 5)
    )
  ) -> fdf 

fdf$responses <- factor(fdf$x)

levels(fdf$responses)

[1] "Agree"    "Disagree" "Neutral"

table(fdf$responses)


   Agree Disagree  Neutral 
       5        3        2

This makes no sense since we’d like the logical ordering of Disagree -> Neutral -> Agree and this can be achieved via ordered(), making sure to specify the order of the levels with levels = c().

ordered(
  fdf$responses, 
  levels = c("Disagree", "Neutral", "Agree")
  ) -> fdf$newresponses 

levels(fdf$newresponses)

[1] "Disagree" "Neutral"  "Agree"

min(fdf$newresponses)

[1] Disagree
Levels: Disagree < Neutral < Agree

table(fdf$newresponses)


Disagree  Neutral    Agree 
       3        2        5

I could have also generated this desired order when creating the factor as, for example, via

factor(
  fdf$x, 
  ordered = TRUE,
  levels = c("Disagree", "Neutral", "Agree")
  ) -> fdf$xordered 

levels(fdf$xordered)

[1] "Disagree" "Neutral"  "Agree"

min(fdf$xordered)

[1] Disagree
Levels: Disagree < Neutral < Agree

table(fdf$xordered)


Disagree  Neutral    Agree 
       3        2        5

Notice the min() command … which asks for the minimum level of a factor. This works with ordered factors but not with unordered factors; try it with x and with responses.

Before we move on, a word about the command we used to generate a new factor from an existing variable –

my.df$female2[my.df$SEX == "M"] = "Male" 
my.df$female2[my.df$SEX == "F"] = "Female"

I use this a lot when generating new variables, and even when fixing problems with an existing variable. For example, say we were given the data on the sex of these 10 individuals but the values included typos.

data.frame(
  mf = c(
    rep("Male", 3),
    rep("male", 2),
    rep("Female", 3),
    rep("femalE", 2)
    )
  ) -> sexdf 

factor(sexdf$mf) -> sexdf$mf 

levels(sexdf$mf)

[1] "femalE" "Female" "male"   "Male"

Obviously not what we’d like so we can go in and clean it up a bit. How? As follows:

sexdf$mf[sexdf$mf == "male"] = "Male" 
sexdf$mf[sexdf$mf == "femalE"] = "Female" 

levels(sexdf$mf)

[1] "femalE" "Female" "male"   "Male"

table(sexdf$mf)


femalE Female   male   Male 
     0      5      0      5

Wait a second, we still see femalE and male but with 0 counts; how should we get rid of these ghosts?

droplevels(sexdf$mf) -> sexdf$newmf 

levels(sexdf$newmf)

[1] "Female" "Male"

table(sexdf$newmf)


Female   Male 
     5      5

Having fixed this issue we can drop the original variable via

sexdf$mf = NULL

Sub-setting data

I may also need to drop a factor level. For example, say I am working with the diamond data-set and need to subset my analysis to diamonds that are Very Good or better.

library(ggplot2)
data(diamonds)
str(diamonds$cut)

 Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...

table(diamonds$cut)


     Fair      Good Very Good   Premium     Ideal 
     1610      4906     12082     13791     21551

I can subset as follows:

subset(
  diamonds, 
  cut != "Fair" & cut != "Good"
  ) -> dia.sub1 

str(dia.sub1$cut)

 Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 4 3 3 3 3 5 4 5 ...

table(dia.sub1$cut)


     Fair      Good Very Good   Premium     Ideal 
        0         0     12082     13791     21551

Dropping factor levels

Aha! I see no diamonds in the excluded cut ratings but str() still shows R remembering cut as having 5 ratings and the frequency table shows up with zero counts for both so we drop these levels explicitly.

droplevels(dia.sub1$cut) -> dia.sub1$cutnew 
str(dia.sub1$cutnew)

 Ord.factor w/ 3 levels "Very Good"<"Premium"<..: 3 2 2 1 1 1 1 3 2 3 ...

table(dia.sub1$cutnew)


Very Good   Premium     Ideal 
    12082     13791     21551

Now a bit more on sub-setting data. You can subset with as simple or as complicated a condition you need. For example, maybe you only want Ideal diamonds with a certain minimum clarity and price. Sure, you can use this sub-setting condition as shown below:

subset(
  diamonds, 
  cut == "Ideal" & clarity == "VVS1" & price > 3933
  ) -> dia.sub2 

table(dia.sub2$cut)


     Fair      Good Very Good   Premium     Ideal 
        0         0         0         0       283

table(dia.sub2$clarity)


  I1  SI2  SI1  VS2  VS1 VVS2 VVS1   IF 
   0    0    0    0    0    0  283    0

min(dia.sub2$price)

[1] 3955

Here the sub-setting is generating a data-frame that will only include observations that meet all three requirements (since you used & in the command). If you had run the following instead you would have had very different results.

subset(
  diamonds, 
  cut == "Ideal" | clarity == "VVS1" | price > 3933
  ) -> dia.sub3 

subset(
  diamonds,
  cut == "Ideal" & clarity == "VVS1" | price > 3933
  ) -> dia.sub4 

subset(
  diamonds,
  cut == "Ideal" | clarity == "VVS1" & price > 3933
  ) -> dia.sub5

That is because & is saying “this criterion and that criterion both have to be met” while | is saying “either this criterion is met or that criterion is met”. So what you did was to ask for

dia.sub3 = either the cut is Ideal OR clarity is VVS1 OR price exceeds 3933. Let us check

table(dia.sub3$cut)


     Fair      Good Very Good   Premium     Ideal 
      656      2073      5310      6692     21551

table(dia.sub3$clarity)


  I1  SI2  SI1  VS2  VS1 VVS2 VVS1   IF 
 354 6106 8144 8108 5283 3248 3655 1384

min(dia.sub3$price)

[1] 326

dia.sub4 = either the cut is Ideal AND clarity is VVS1 OR price exceeds 3933

table(dia.sub4$cut)


     Fair      Good Very Good   Premium     Ideal 
      645      1922      4648      6182      8019

table(dia.sub4$clarity)


  I1  SI2  SI1  VS2  VS1 VVS2 VVS1   IF 
 261 4848 5450 4364 2660 1210 2321  302

min(dia.sub4$price)

[1] 414

dia.sub5 = either the cut is Ideal OR clarity is VVS1 AND price exceeds 3933

table(dia.sub5$cut)


     Fair      Good Very Good   Premium     Ideal 
        6        35       127       106     21551

table(dia.sub5$clarity)


  I1  SI2  SI1  VS2  VS1 VVS2 VVS1   IF 
 146 2598 4282 5071 3589 2606 2321 1212

min(dia.sub5$price)

[1] 326

You could also subset the data in different ways using, for example, %in%

diamonds[
  diamonds$cut == "Ideal" | diamonds$clarity == "VVS1" | diamonds$price > 3933, 
  ] -> dia.sub6 # same as dia.sub3 

diamonds[
  !(diamonds$cut %in% c("Fair", "Good")), 
  ] -> dia.sub7 # same as dia.sub1 

diamonds[
  diamonds$cut %in% c("Ideal", "Premium", "Very Good"), 
  ] -> dia.sub8 # same as dia.sub1 and dia.sub7

Exporting data

Often, after data have been processed in R, we need to share them with folks who don’t use R or want the data in a specific format. This is easily done, as shown below.

data.frame(
  Person = c("John", "Timothy", "Olivia", "Sebastian", "Serena"),
  Age = c(22, 24, 18, 24, 35)
  ) -> out.df 

write.csv(
  out.df,
  file = here("data", "out.csv"),
  row.names = FALSE
  ) # CSV format 

library(haven)
write_dta(out.df, here("data", "out.df.dta")) # Stata format
write_sav(out.df, here("data", "out.df.sav")) # SPSS format 
write_sas(out.df, here("data", "out.df.sas")) # SAS format 

library(writexl)
write_xlsx(out.df, here("data", "out.df.xlsx")) # Excel format

There are other packages that import/export data so be sure to check them out as well, in particular rio.

Merging data

Say I have data on some students, with test scores in one file and their gender in another file. Each file has a unique student identifier, called ID. How can I create a single data-set? Let us see each data-set first.

data.frame(
  Score = c(10, 21, 33, 12, 35, 67, 43, 99),
  ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
  ) -> data1 

data.frame(
  Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
  ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
  ) -> data2

Open up both data-sets and note that students C31 and Q22 are missing from data2 and students E52 and H71 are missing from data1.

What about the merge? To merge data-frames we need to specify the merge key(s) – the variable(s) that identifies unique observations in each file. This variable(s) must be present in all the files you want to merge. So long as it exists, you now need to decide: Do you only want to

merge observations that show up in both files (natural join)
merge everything even if some cases show up in one but not the other (full outer join)
merge such that all cases in file x are retained but only those seen in file y are joined (left outer join)
merge to keep all cases in y and only those cases are to be merged from x that show up in y, referred to as (right outer join)

merge(
  x = data1,
  y = data2, 
  by = c("ID"),
  all = FALSE
  ) -> natural # only merge if ID is seen in both files

merge(
  x = data1,
  y = data2, 
  by = c("ID"),
  all = TRUE
  ) -> full # merge everything in both files

merge(
  x = data1,
  y = data2, 
  by = c("ID"),
  all.x = TRUE
  ) -> left # make sure all cases from file x are retained 

merge(
  x = data1,
  y = data2,
  by = c("ID"),
  all.y  = TRUE
  ) -> right # make sure all cases from file y are retained

What if the ID variables had different names in the files? You could rename them to have a common name or you could use by.x = and by.y = as shown below.

data.frame(
  Score = c(10, 21, 33, 12, 35, 67, 43, 99),
  ID1 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
  ) -> data3 

data.frame(
  Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
  ID2 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
  ) -> data4 

merge(
  data3,
  data4,
  by.x = c("ID1"),
  by.y = c("ID2"),
  all = FALSE
  ) -> diffids

You can also have more than one merge key. For example, if I am merging data for Ohio schools, each district has a district ID number (dirn), each school has building ID number (birn), and then the district’s name (district), the building’s name (building), and so on. If I am merging these data then my by = statement will be by = c("dirn", "birn", "district", "building").

If we have multiple data-frames to merge we can do so in several ways but my default approach is to rely on the Reduce() command, as shown below.

data.frame(
  Score = c(10, 21, 33, 12, 35, 67, 43, 99),
  ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
  ) -> df1 

data.frame(
  Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
  ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
  ) -> df2 

data.frame(
  Age = c(6, 7, 6, 8, 8, 9, 10, 5),
  ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
  ) -> df3 

list(
  df1, df2, df3
  ) -> my.list 

Reduce(
  function(...) 
    merge(..., 
          by = c("ID"), 
          all = TRUE
          ), 
  my.list
  ) -> df.123

Identifying and handling duplicates

Every now and then you also run into duplicate rows with no good explanation for how that could have happened. Here is an example:

data.frame(
  Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
  ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "A12", "WX1")
  ) -> dups.df

Students A12 and WX1 show up twice! There are two basic commands that we can use to find duplicates – unique and duplicated.

duplicated(dups.df) # flag duplicate rows with TRUE

[1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE

!duplicated(dups.df) # flag not duplicated rows with TRUE

[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

dups.df[duplicated(dups.df), ] # reveal the duplicated rows

     Sex  ID
7   Male A12
8 Female WX1

dups.df[!duplicated(dups.df), ] # reveal the not duplicated rows

     Sex  ID
1   Male A12
2 Female A23
3   Male Z14
4 Female WX1
5   Male Y31
6 Female D66

dups.df[!duplicated(dups.df), ] -> nodups.df # save a dataframe without duplicated rows

dups.df[duplicated(dups.df), ] -> onlydups.df # save a dataframe with ONLY the duplicated rows

Rearranging a data-frame

We may, at times, want to move the variables around so they are in a particular order (age next to race and so on, for example), perhaps drop some because there are too many variables and we don’t need all of them. We may also want to arrange the data-frame in ascending or descending order of some variable(s) (increasing order of age or price for example. These tasks are easily done.

mtcars[
  order(mtcars$mpg), 
  ] -> df1 # arrange the data-frame in ascending order of mpg 

mtcars[
  order(-mtcars$mpg), 
  ] -> df2 # arrange the data-frame in descending order of mpg   

mtcars[
  order(mtcars$am, mtcars$mpg), 
  ] -> df3 # arrange the data-frame in ascending order of automatic/manual and then in ascending order of mpg

What if I want to order the columns such that the first column is am and then comes mpg, followed by qsec and then the rest of the columns? Well, the first thing I want to do is see what variable is in which column via the names() command.

names(mtcars) # the original structure of the mtcars data-frame

 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"  
[10] "gear" "carb"

So mpg is in column 1, cyl is in column 2, and carb is the last one in column 11.

Now I’ll specify that I want all rows by doing this [, c()] and then specify the order of the columns by entering the column numbers in c()

mtcars[
  ,
  c(9, 1, 7, 2:6, 8, 10:11)
  ] -> df4 # specifying the position of the columns 

names(df4)

 [1] "am"   "mpg"  "qsec" "cyl"  "disp" "hp"   "drat" "wt"   "vs"  
[10] "gear" "carb"

If I only wanted certain columns I could omit the other columns.

mtcars[
  ,
  c(9, 1, 7)
  ] -> df5 # specifying the position of the columns

Remember the comma in mtcars[, … this tells R to keep all rows in mtcars. If, instead, I did this I would only get the first four rows, not all.

mtcars[
  1:4,
  c(9, 1, 7)
  ] -> df6 # keep first four rows and the specified columns

I would end up with only the first four rows of mtcars, not every row. So try to remember the order: dataframe[i, j] where i reference the rows and j reference the columns. Folks often get tripped up over this switch.

Working with strings

Often you run into strings that have to be manipulated. Often, when working with county data, for example, county names are followed by County, Ohio and this is superfluous so we end up excising it

read.csv(
  here("data", "strings.csv")
  ) -> c.data 

head(c.data)

          GEO.id GEO.id2      GEO.display.label respop72016
1 0500000US39001   39001     Adams County, Ohio       27907
2 0500000US39003   39003     Allen County, Ohio      103742
3 0500000US39005   39005   Ashland County, Ohio       53652
4 0500000US39007   39007 Ashtabula County, Ohio       98231
5 0500000US39009   39009    Athens County, Ohio       66186
6 0500000US39011   39011  Auglaize County, Ohio       45894

Let us just retain the county name from GEO.display.label

gsub(
  " County, Ohio",
  "",
  c.data$GEO.display.label
  ) -> c.data$county 

head(c.data)

          GEO.id GEO.id2      GEO.display.label respop72016    county
1 0500000US39001   39001     Adams County, Ohio       27907     Adams
2 0500000US39003   39003     Allen County, Ohio      103742     Allen
3 0500000US39005   39005   Ashland County, Ohio       53652   Ashland
4 0500000US39007   39007 Ashtabula County, Ohio       98231 Ashtabula
5 0500000US39009   39009    Athens County, Ohio       66186    Athens
6 0500000US39011   39011  Auglaize County, Ohio       45894  Auglaize

Note the sequence gsub("look for this pattern", "replace with this pattern", mydata$myvariable)

Practice Tasks

Why are our best and most experienced employees leaving prematurely?

The data available here includes information on several current and former employees of an anonymous organization. Fields in the data-set include:

satisfaction_level = Level of satisfaction (0-1)
last_evaluation = Evaluation of employee performance (0-1)
number_project = Number of projects completed while at work
average_monthly_hours = Average monthly hours at workplace
time_spend_company = Number of years spent in the company
Work_accident = Whether the employee had a workplace accident (1 or 0)
left = Whether the employee left the workplace or not (1 or 0)
promotion_last_5years = Whether the employee was promoted in the last five years (1 or 0)
sales = Department in which they work for
salary = Relative level of salary (low med high)

Read in the data-set
Create new variables that represent all categorical variables as factors. Make sure you order the levels of the factor (if they need to be ordered).
Convert satisfaction_level from a 0-1 scale to a 0-100 scale
Convert time spent in the company to months rather than years
Order the columns such that left appears as the first column
Arrange the data-set such that the rows are in descending order of average_montly_hours (the column is misspelled in the original data)
Create a subset of these data that contains only records that match the following criteria:
1. The employee must not have had a work accident, AND
2. The employee must have been promoted in the last 5 years
3. Report the number of observations you end up with.
Make all column names lowercase for the data-set you created in (g)

Footnotes

The only way to label things correctly is to know what each value maps to (is female = 1 a Female or a Male?) but luckily we have that information here.[↩]

Citation

For attribution, please cite this work as

Ruhil (2025, Feb. 16). Basic Data Operations. Retrieved from https://aniruhil.org/courses/mpa6020/handouts/module03.html

BibTeX citation

@misc{ruhil2025basic,
  author = {Ruhil, Ani},
  title = {Basic Data Operations},
  url = {https://aniruhil.org/courses/mpa6020/handouts/module03.html},
  year = {2025}
}

Basic Data Operations

Author

Affiliation

Published

Citation

Data Types

Numbers masquerading as characters or factors

Operating on numeric variables

Binning

… with {santoku}

Lowercase and Uppercase

A little more on factors

Sub-setting data

Dropping factor levels

Exporting data

Merging data

Identifying and handling duplicates

Rearranging a data-frame

Working with strings

Practice Tasks

Footnotes

Citation

… with `{santoku}`