The previous module introduced you to basic graphics with
{ggplot2}
. In this module we shift gears and look at some
very basic operations often performed on our data prior to or during the
analysisphase. How to create factors, convert across variable types,
create a grouped version of a numeric variable, and so on.
Our goal in this module is to understand some basic operations in R. Specifically, we will look at combining two or more data-sets, modifying variables, creating variables, converting quantitative variables into qualitative variables, and so on. There is a lot more we could learn here but I will focus on the essential tasks you are likely to run into.
A typical data-frame will look like the following:
hsb2
is a typical data-frame with columns
,
each representing a variable, and rows
, each representing
an observation. Variables can be numeric (num
), such as
id
, write
, read
, etc. or then a
categorical variable that R calls a factor (Factor
). The
numeric
class includes an integer
(a limited
range of values) and double
(double precision format) but
these distinctions should not be of any consequence to you at the
moment. If you have dates with/without times, the variable type will
show up as dates
, some variables may show up as
logical
(TRUE/FALSE/NA), and yet others as
character
(for example, “A”, “Bitcoin”, “p@12359” and so on).
You have to be careful though since your data might have numeric codes for categorical variables. For example, if we read in the following file you will see this issue in effect:
read.table(
'https://stats.idre.ucla.edu/stat/data/hsb2.csv',
header = TRUE,
sep = ","
) -> hsb
These are the same data as we have in hsb2
. Here, the
categorical variables female, race, ses, schtyp, prog
are
carrying numeric codes that represent different attributes. Each also
shows up as an integer (note the int
) but that is obviously
incorrect. The actual values of each should map to specific attributes
as show below.
This happens very often and as a result we end up having to convert
these pseudo-numeric variables into proper categorical variables that R
calls factors
. This is done as follows:
factor(hsb$female,
levels = c(0, 1),
labels = c("Male", "Female")
) -> hsb$female
factor(hsb$race,
levels = c(1:4),
labels = c("Hispanic", "Asian", "African American", "White")
) -> hsb$race
factor(hsb$ses,
levels = c(1:3),
labels = c("Low", "Middle", "High")
) -> hsb$ses
factor(hsb$schtyp,
levels = c(1:2),
labels = c("Public", "Private")
) -> hsb$schtyp
factor(hsb$prog,
levels = c(1:3),
labels = c("General", "Academic", "Vocational")
) -> hsb$prog
factor(dataframe$variable, levels = c(...), labels = c(...))
Notice the pattern here: The command is asking R to convert the
original variable into a factor()
by specifying the target
variable’s name hsb$female
, then telling R what the unique
values are via levels = c()
and then listing how each
unique value should be labeled via labels = c()
.1
Of course, we could have chosen not to overwrite the original variables but instead created new ones as factors. This is shown below, and in my opinion would be the more prudent approach (and has now become my habit).
read.table(
'https://stats.idre.ucla.edu/stat/data/hsb2.csv',
header = TRUE,
sep = ","
) -> hsb
factor(hsb$female,
levels = c(0, 1),
labels = c("Male", "Female")
) -> hsb$female.f
factor(hsb$race,
levels = c(1:4),
labels = c("Hispanic", "Asian", "African American", "White")
) -> hsb$race.f
factor(hsb$ses,
levels = c(1:3),
labels = c("Low", "Middle", "High")
) -> hsb$ses.f
factor(hsb$schtyp,
levels = c(1:2),
labels = c("Public", "Private")
) -> hsb$schtyp.f
factor(hsb$prog,
levels = c(1:3),
labels = c("General", "Academic", "Vocational")
) -> hsb$prog.f
At times numbers will be read in as characters or factors, as is evident from the example below:
num [1:5] 1 2 3 4 5
chr [1:5] "1" "2" "3" "4" "a5"
chr [1:5] "1" "2" "3" "4" "hello"
num [1:5] 1 2 3 4 NA
Notice that x1 and x4 are read in as num
(numeric) but
x2 and x3 show up as chr
(character). You could convert x2
and x3 into numeric as follows:
as.numeric(x2) -> x2.num
as.numeric(x3) -> x3.num
num [1:5] 1 2 3 4 NA
num [1:5] 1 2 3 4 NA
The NA
values are not applicable
values,
essentially R-speak for missing values
.
On other occasions, the variable may in fact be numeric but because
of some anomalies in the variable, R may read it in as a factor. When
this happens, converting the variable to numeric must be done with care.
Just running as.numeric()
will not work. See the example
below where age
is flagged as a factor because the original
variable was originally recorded and exported with double-quotation
marks around each value.
c(
100, 101, 102, 103,
104, 105, 106
) -> score
c(
"Male", "Female", "Male", "Female",
"Female", "Male", "Female"
) -> sex
c(
"18", "18", "19", "19",
"21", "21", "NA"
) -> age
cbind.data.frame(
score, sex, age
) -> my.df
If I try as.numeric()
the conversion uses the factor
coding 1, 2, 3, 4 instead of the actual age values. So what is
recommended is that you run
as.numeric(levels(data$variable))[data$variable]
instead.
You can see the difference by comparing age.num1
against
age.num2
.
as.numeric(my.df$age) -> my.df$age.num1 # Incorrect approach
str(my.df$age.num1)
num [1:7] 18 18 19 19 21 21 NA
as.numeric(levels(my.df$age))[my.df$age] -> my.df$age.num2 # Correct approach
str(my.df$age.num2)
num [1:7] NA NA NA NA NA NA NA
We often need to transform an original numeric variable, stored perhaps as a proportion that we want to convert into a percentage. Or we may want to square it, divide it by 100, and so on. These operations are very straightforward.
data.frame(
x = c(0.1, 0.2, 0.3, 0.4)
) -> my.df
my.df$x.pct = my.df$x * 100
my.df$x.div10 = my.df$x / 10
my.df$x.squared = my.df$x^2
my.df$x.cubed = my.df$x^3
my.df$x.sqroot = sqrt(my.df$x)
my.df$x.doubled = my.df$x * 2
my.df$x.fourths = my.df$x ^ (1/4)
x.pctwrong = my.df$x * 100
We can also group numeric data into bins
. Let us see
this with our reading scores from the hsb
data-set.
summary(hsb)
id female race ses
Min. : 1.00 Min. :0.000 Min. :1.00 Min. :1.000
1st Qu.: 50.75 1st Qu.:0.000 1st Qu.:3.00 1st Qu.:2.000
Median :100.50 Median :1.000 Median :4.00 Median :2.000
Mean :100.50 Mean :0.545 Mean :3.43 Mean :2.055
3rd Qu.:150.25 3rd Qu.:1.000 3rd Qu.:4.00 3rd Qu.:3.000
Max. :200.00 Max. :1.000 Max. :4.00 Max. :3.000
schtyp prog read write
Min. :1.00 Min. :1.000 Min. :28.00 Min. :31.00
1st Qu.:1.00 1st Qu.:2.000 1st Qu.:44.00 1st Qu.:45.75
Median :1.00 Median :2.000 Median :50.00 Median :54.00
Mean :1.16 Mean :2.025 Mean :52.23 Mean :52.77
3rd Qu.:1.00 3rd Qu.:2.250 3rd Qu.:60.00 3rd Qu.:60.00
Max. :2.00 Max. :3.000 Max. :76.00 Max. :67.00
math science socst female.f
Min. :33.00 Min. :26.00 Min. :26.00 Male : 91
1st Qu.:45.00 1st Qu.:44.00 1st Qu.:46.00 Female:109
Median :52.00 Median :53.00 Median :52.00
Mean :52.65 Mean :51.85 Mean :52.41
3rd Qu.:59.00 3rd Qu.:58.00 3rd Qu.:61.00
Max. :75.00 Max. :74.00 Max. :71.00
race.f ses.f schtyp.f prog.f
Hispanic : 24 Low :47 Public :168 General : 45
Asian : 11 Middle:95 Private: 32 Academic :105
African American: 20 High :58 Vocational: 50
White :145
Remember, we want meaningful groups, no fewer than 4/5 and staying below 8/9 unless we need more. To create the groups,
For example, the range of the reading scores is 48. Say we want 5 groups. Dividing 48 by 5 yields 9.6 so we round up to 10. Now we see what the groups might be: 28-38, 38-48, 48-58, 58-68, 68-78. Now we ask R to create a new variable that represents these groups.
(28,38] (38,48] (48,58] (58,68] (68,78]
13 69 61 47 9
Uh oh! we have a total of 199 so what happened to the 200th? The one
that wasn’t grouped is a reading score of 28. Why? Because when you
group, you have to decide whether the intervals are
left-open
or right-open
… If I run into a 38,
should it go in the 38-48 group or the 28-38 group? This decision can be
invoked by adding right = FALSE
or
right = TRUE
(which is the default). So let me set
right = FALSE
.
cut(
hsb$read,
breaks = c(28, 38, 48, 58, 68, 78),
right = FALSE
) -> hsb$grouped_read2
table(hsb$grouped_read2)
[28,38) [38,48) [48,58) [58,68) [68,78)
14 68 62 36 20
Now open up the data-frame and arrange the rows in ascending order of
read
. Find read = 48
. With the default
right = TRUE
this score was put in the 38-48 group but with
right = FALSE
it was put in the 48-58 group. So
right = FALSE
only include values in a group
(a,b]
if the value is \(a <= x
< b\). With right = TRUE
only include values in a
group [a, b)
if the value is \(a
< x <= b\).
More generally you’ll see folks just specify the number of cuts they want.
(28,37.6] (37.6,47.2] (47.2,56.8] (56.8,66.4] (66.4,76]
14 68 48 50 20
But since age cannot include a decimal value this may be unsuitable for our purposes. Of course, we could have avoided this open/closed business of the interval by simply doing this:
cut(
hsb$read,
breaks = c(25, 35, 45, 55, 65, 75, 85)
) -> hsb$grouped_read4
table(hsb$grouped_read4)
(25,35] (35,45] (45,55] (55,65] (65,75] (75,85]
9 45 76 49 19 2
And there you have it!
{santoku}
David Hugh-Jones has
authored a package called santoku that
looks to replace the cut()
command with a flexible
chop
command.
Say we want the reading scores grouped in 10-point intervals.
library(santoku)
chop_width(
hsb$read, width = 10
) -> hsb$rchop1
table(hsb$rchop1)
[28, 38) [38, 48) [48, 58) [58, 68) [68, 78)
14 68 62 36 20
If you want 5, evenly-spaced intervals, which will give each group with a width of exactly \(\dfrac{76 - 28}{5} = 9.6\)
chop_evenly(
hsb$read,
intervals = 5
) -> hsb$rchop2
table(hsb$rchop2)
[28, 37.6) [37.6, 47.2) [47.2, 56.8) [56.8, 66.4) [66.4, 76]
14 68 48 50 20
What if we want equally spaced cut-points, maybe the quintiles or the quartiles, respectively??
chop_equally(
hsb$read, groups = 5
) -> hsb$rchop3
table(hsb$rchop3)
[0%, 20%) [20%, 40%) [40%, 60%) [60%, 80%) [80%, 100%]
39 16 62 37 46
chop_equally(
hsb$read,
groups = 4
) -> hsb$rchop4
table(hsb$rchop4)
[0%, 25%) [25%, 50%) [50%, 75%) [75%, 100%]
39 44 61 56
What if you want specific quantiles?
chop_quantiles(
hsb$read,
c(0.05, 0.25, 0.50, 0.75, 0.95)
) -> hsb$rchop5
table(hsb$rchop5)
[0%, 5%) [5%, 25%) [25%, 50%) [50%, 75%) [75%, 95%]
9 30 44 61 47
(95%, 100%]
9
Cut-points in terms of the mean and standard deviation?
chop_mean_sd(
hsb$read
) -> hsb$rchop6
table(hsb$rchop6)
[-3 sd, -2 sd) [-2 sd, -1 sd) [-1 sd, 0 sd) [0 sd, 1 sd)
2 22 91 39
[1 sd, 2 sd) [2 sd, 3 sd)
39 7
If you want custom labels …
chop_quantiles(
hsb$read,
c(0.05, 0.25, 0.50, 0.75, 0.95),
labels = lbl_dash()
) -> hsb$rchop7
table(hsb$rchop7)
0% - 5% 5% - 25% 25% - 50% 50% - 75% 75% - 95% 95% - 100%
9 30 44 61 47 9
chop_quantiles(
hsb$read,
c(0.05, 0.25, 0.50, 0.75, 0.95),
labels = lbl_seq("(A)")
) -> hsb$rchop8
table(hsb$rchop8)
(A) (B) (C) (D) (E) (F)
9 30 44 61 47 9
data.frame(
sex = c("M", "F", "M", "F"),
NAMES = c("Andy", "Jill", "Jack", "Madison"),
age = c(24, 48, 72, 96)
) -> my.df
Maybe I want the names of the individuals to be all uppercase, or perhaps all lowercase. I can do this via:
tolower(my.df$sex) -> my.df$sex.lower
Maybe it is not the values but the column names that I want to convert to lowercase. This is easily done via
colnames(my.df)[2] = "names"
Note: I asked for the second column’s name to be changed to “names”. Since that was the only column that was in uppercase I could have achieved the same thing by running:
By the same logic, I could convert each column name to uppercase by typing
And then of course, using the following to change a specific column name to uppercase:
You may also want to experiment with janitor, a handy package that does a lot but I use it most often for cleaning up messy column names in a data-frame. What should be clear by now is that there are many ways to end up with the same result with R code. This is one of the wonderful features of R.
With the hsb
data we converted some variables into
factors. Here I want to show you a couple of things to pay attention
to.
First, I can create a new variable and store it as a factor as follows:
I could have also skipped the 0/1
business and just done
this:
my.df$female2[my.df$SEX == "M"] = "Male"
my.df$female2[my.df$SEX == "F"] = "Female"
factor(my.df$female2) -> my.df$female2
Notice the difference between female1
and
female2
levels = c(0, 1)
and
indicated the lowest level be mapped to Male, the factor is built in
that order … Male first, then FemaleLets see the question of factor levels with a different example. Say we had the following variable, 10 responses to a survey question.
data.frame(
x = c(
rep("Disagree", 3),
rep("Neutral", 2),
rep("Agree", 5)
)
) -> fdf
fdf$responses <- factor(fdf$x)
levels(fdf$responses)
[1] "Agree" "Disagree" "Neutral"
table(fdf$responses)
Agree Disagree Neutral
5 3 2
This makes no sense since we’d like the logical ordering of Disagree
-> Neutral -> Agree and this can be achieved via
ordered()
, making sure to specify the order of the levels
with levels = c()
.
ordered(
fdf$responses,
levels = c("Disagree", "Neutral", "Agree")
) -> fdf$newresponses
levels(fdf$newresponses)
[1] "Disagree" "Neutral" "Agree"
min(fdf$newresponses)
[1] Disagree
Levels: Disagree < Neutral < Agree
table(fdf$newresponses)
Disagree Neutral Agree
3 2 5
I could have also generated this desired order when creating the factor as, for example, via
factor(
fdf$x,
ordered = TRUE,
levels = c("Disagree", "Neutral", "Agree")
) -> fdf$xordered
levels(fdf$xordered)
[1] "Disagree" "Neutral" "Agree"
min(fdf$xordered)
[1] Disagree
Levels: Disagree < Neutral < Agree
table(fdf$xordered)
Disagree Neutral Agree
3 2 5
Notice the min()
command … which asks for the minimum
level of a factor. This works with ordered factors but not with
unordered factors; try it with x
and with
responses
.
Before we move on, a word about the command we used to generate a new factor from an existing variable –
my.df$female2[my.df$SEX == "M"] = "Male"
my.df$female2[my.df$SEX == "F"] = "Female"
I use this a lot when generating new variables, and even when fixing problems with an existing variable. For example, say we were given the data on the sex of these 10 individuals but the values included typos.
data.frame(
mf = c(
rep("Male", 3),
rep("male", 2),
rep("Female", 3),
rep("femalE", 2)
)
) -> sexdf
factor(sexdf$mf) -> sexdf$mf
levels(sexdf$mf)
[1] "femalE" "Female" "male" "Male"
Obviously not what we’d like so we can go in and clean it up a bit. How? As follows:
sexdf$mf[sexdf$mf == "male"] = "Male"
sexdf$mf[sexdf$mf == "femalE"] = "Female"
levels(sexdf$mf)
[1] "femalE" "Female" "male" "Male"
table(sexdf$mf)
femalE Female male Male
0 5 0 5
Wait a second, we still see femalE
and male
but with 0 counts; how should we get rid of these ghosts?
droplevels(sexdf$mf) -> sexdf$newmf
levels(sexdf$newmf)
[1] "Female" "Male"
table(sexdf$newmf)
Female Male
5 5
Having fixed this issue we can drop the original variable via
sexdf$mf = NULL
I may also need to drop a factor level. For example, say I am working
with the diamond
data-set and need to subset my analysis to
diamonds that are Very Good or better.
Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
table(diamonds$cut)
Fair Good Very Good Premium Ideal
1610 4906 12082 13791 21551
I can subset as follows:
Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 4 3 3 3 3 5 4 5 ...
table(dia.sub1$cut)
Fair Good Very Good Premium Ideal
0 0 12082 13791 21551
Aha! I see no diamonds in the excluded cut ratings but
str()
still shows R remembering cut
as having
5 ratings and the frequency table shows up with zero counts for both so
we drop these levels explicitly.
droplevels(dia.sub1$cut) -> dia.sub1$cutnew
str(dia.sub1$cutnew)
Ord.factor w/ 3 levels "Very Good"<"Premium"<..: 3 2 2 1 1 1 1 3 2 3 ...
table(dia.sub1$cutnew)
Very Good Premium Ideal
12082 13791 21551
Now a bit more on sub-setting data. You can subset with as simple or as complicated a condition you need. For example, maybe you only want Ideal diamonds with a certain minimum clarity and price. Sure, you can use this sub-setting condition as shown below:
subset(
diamonds,
cut == "Ideal" & clarity == "VVS1" & price > 3933
) -> dia.sub2
table(dia.sub2$cut)
Fair Good Very Good Premium Ideal
0 0 0 0 283
table(dia.sub2$clarity)
I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
0 0 0 0 0 0 283 0
min(dia.sub2$price)
[1] 3955
Here the sub-setting is generating a data-frame that will only
include observations that meet all three requirements (since you used
&
in the command). If you had run the following instead
you would have had very different results.
That is because &
is saying “this criterion and that
criterion both have to be met” while |
is saying “either
this criterion is met or that criterion is met”. So what you did was to
ask for
table(dia.sub3$cut)
Fair Good Very Good Premium Ideal
656 2073 5310 6692 21551
table(dia.sub3$clarity)
I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
354 6106 8144 8108 5283 3248 3655 1384
min(dia.sub3$price)
[1] 326
table(dia.sub4$cut)
Fair Good Very Good Premium Ideal
645 1922 4648 6182 8019
table(dia.sub4$clarity)
I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
261 4848 5450 4364 2660 1210 2321 302
min(dia.sub4$price)
[1] 414
table(dia.sub5$cut)
Fair Good Very Good Premium Ideal
6 35 127 106 21551
table(dia.sub5$clarity)
I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
146 2598 4282 5071 3589 2606 2321 1212
min(dia.sub5$price)
[1] 326
You could also subset the data in different ways using, for example,
%in%
diamonds[
diamonds$cut == "Ideal" | diamonds$clarity == "VVS1" | diamonds$price > 3933,
] -> dia.sub6 # same as dia.sub3
diamonds[
!(diamonds$cut %in% c("Fair", "Good")),
] -> dia.sub7 # same as dia.sub1
diamonds[
diamonds$cut %in% c("Ideal", "Premium", "Very Good"),
] -> dia.sub8 # same as dia.sub1 and dia.sub7
Often, after data have been processed in R, we need to share them with folks who don’t use R or want the data in a specific format. This is easily done, as shown below.
data.frame(
Person = c("John", "Timothy", "Olivia", "Sebastian", "Serena"),
Age = c(22, 24, 18, 24, 35)
) -> out.df
write.csv(
out.df,
file = here("data", "out.csv"),
row.names = FALSE
) # CSV format
library(haven)
write_dta(out.df, here("data", "out.df.dta")) # Stata format
write_sav(out.df, here("data", "out.df.sav")) # SPSS format
write_sas(out.df, here("data", "out.df.sas")) # SAS format
library(writexl)
write_xlsx(out.df, here("data", "out.df.xlsx")) # Excel format
There are other packages that import/export data so be sure to check them out as well, in particular rio.
Say I have data on some students, with test scores in one file and
their gender in another file. Each file has a unique student identifier,
called ID
. How can I create a single data-set? Let us see
each data-set first.
data.frame(
Score = c(10, 21, 33, 12, 35, 67, 43, 99),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
) -> data1
data.frame(
Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
) -> data2
Open up both data-sets and note that students C31 and Q22 are missing
from data2
and students E52 and H71 are missing from
data1
.
What about the merge? To merge data-frames we need to specify the
merge key(s)
– the variable(s) that identifies unique
observations in each file. This variable(s) must be present in all the
files you want to merge. So long as it exists, you now need to decide:
Do you only want to
(natural join)
(full outer join)
x
are retained but
only those seen in file y
are joined
(left outer join)
y
and only those cases are
to be merged from x
that show up in y
,
referred to as (right outer join)
merge(
x = data1,
y = data2,
by = c("ID"),
all = FALSE
) -> natural # only merge if ID seen in both files
merge(
x = data1,
y = data2,
by = c("ID"),
all = TRUE
) -> full # merge everything
merge(
x = data1,
y = data2,
by = c("ID"),
all.x = TRUE
) -> left # make sure all cases from file x are retained
merge(
x = data1,
y = data2,
by = c("ID"),
all.y = TRUE
) -> right # make sure all cases from file y are retained
What if the ID variables had different names in the files? You could
rename them to have a common name or you could use by.x =
and by.y =
as shown below.
data.frame(
Score = c(10, 21, 33, 12, 35, 67, 43, 99),
ID1 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
) -> data3
data.frame(
Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID2 = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
) -> data4
merge(
data3,
data4,
by.x = c("ID1"),
by.y = c("ID2"),
all = FALSE
) -> diffids
You can also have more than one merge key. For example, if I am
merging data for Ohio schools, each district has a district ID number
(dirn)
, each school has building ID number
(birn)
, and then the district’s name
(district)
, the building’s name (building)
,
and so on. If I am merging these data then my by =
statement will be
by = c("dirn", "birn", "district", "building")
.
If we have multiple data-frames to merge we can do so in several ways
but my default approach is to rely on the Reduce()
command,
as shown below.
data.frame(
Score = c(10, 21, 33, 12, 35, 67, 43, 99),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "C31", "Q22")
) -> df1
data.frame(
Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
) -> df2
data.frame(
Age = c(6, 7, 6, 8, 8, 9, 10, 5),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "E52", "H71")
) -> df3
list(df1, df2, df3) -> my.list
Reduce(function(...) merge(..., by = c("ID"), all = TRUE), my.list) -> df.123
Every now and then you also run into duplicate rows with no good explanation for how that could have happened. Here is an example:
data.frame(
Sex = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
ID = c("A12", "A23", "Z14", "WX1", "Y31", "D66", "A12", "WX1")
) -> dups.df
Students A12 and WX1 show up twice! There are two basic commands that
we can use to find duplicates – unique
and
duplicated
.
duplicated(dups.df) # flag duplicate rows with TRUE
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
!duplicated(dups.df) # flag not duplicated rows with TRUE
[1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
dups.df[duplicated(dups.df), ] # reveal the duplicated rows
Sex ID
7 Male A12
8 Female WX1
dups.df[!duplicated(dups.df), ] # reveal the not duplicated rows
Sex ID
1 Male A12
2 Female A23
3 Male Z14
4 Female WX1
5 Male Y31
6 Female D66
dups.df[!duplicated(dups.df), ] -> nodups.df # save a dataframe without duplicated rows
dups.df[duplicated(dups.df), ] -> onlydups.df # save a dataframe with ONLY the duplicated rows
We may, at times, want to move the variables around so they are in a particular order (age next to race and so on, for example), perhaps drop some because there are too many variables and we don’t need all of them. We may also want to arrange the data-frame in ascending or descending order of some variable(s) (increasing order of age or price for example. These tasks are easily done.
mtcars[
order(mtcars$mpg),
] -> df1 # arrange the data-frame in ascending order of mpg
mtcars[
order(-mtcars$mpg),
] -> df2 # arrange the data-frame in descending order of mpg
mtcars[
order(mtcars$am, mtcars$mpg),
] -> df3 # arrange the data-frame in ascending order of automatic/manual and then in ascending order of mpg
What if I want to order the columns such that the first column is
am
and then comes mpg
, followed by
qsec
and then the rest of the columns? Well, the first
thing I want to do is see what variable is in which column via the
names()
command.
names(mtcars) # the original structure of the mtcars data-frame
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am"
[10] "gear" "carb"
So mpg
is in column 1, cyl
is in column 2,
and carb
is the last one in column 11.
Now I’ll specify that I want all rows by doing this
[, c()]
and then specify the order of the columns by
entering the column numbers in c()
[1] "am" "mpg" "qsec" "cyl" "disp" "hp" "drat" "wt" "vs"
[10] "gear" "carb"
If I only wanted certain columns I could omit the other columns.
mtcars[
,
c(9, 1, 7)
] -> df5 # specifying the position of the columns
Remember the comma in mtcars[,
… this tells R to keep
all rows in mtcars. If, instead, I did this I would only get the first
four rows, not all.
mtcars[
1:4,
c(9, 1, 7)
] -> df6 # keep first four rows and the specified columns
I would end up with only the first four rows of mtcars, not every
row. So try to remember the order: dataframe[i, j]
where
i
reference the rows and j
reference the
columns. Folks often get tripped up over this switch.
Often you run into strings that have to be manipulated. Often, when
working with county data, for example, county names are followed by
County, Ohio
and this is superfluous so we end up excising
it
GEO.id GEO.id2 GEO.display.label respop72016
1 0500000US39001 39001 Adams County, Ohio 27907
2 0500000US39003 39003 Allen County, Ohio 103742
3 0500000US39005 39005 Ashland County, Ohio 53652
4 0500000US39007 39007 Ashtabula County, Ohio 98231
5 0500000US39009 39009 Athens County, Ohio 66186
6 0500000US39011 39011 Auglaize County, Ohio 45894
Let us just retain the county name from
GEO.display.label
GEO.id GEO.id2 GEO.display.label respop72016 county
1 0500000US39001 39001 Adams County, Ohio 27907 Adams
2 0500000US39003 39003 Allen County, Ohio 103742 Allen
3 0500000US39005 39005 Ashland County, Ohio 53652 Ashland
4 0500000US39007 39007 Ashtabula County, Ohio 98231 Ashtabula
5 0500000US39009 39009 Athens County, Ohio 66186 Athens
6 0500000US39011 39011 Auglaize County, Ohio 45894 Auglaize
Note the sequence
gsub("look for this pattern", "replace with this pattern", mydata$myvariable)
Why are our best and most experienced employees leaving prematurely?
The data available here includes information on several current and former employees of an anonymous organization. Fields in the data-set include:
Read in the data-set
Create new variables that represent all categorical variables as factors. Make sure you order the levels of the factor (if they need to be ordered).
Convert satisfaction_level from a 0-1 scale to a 0-100 scale
Convert time spent in the company to months rather than years
Order the columns such that left
appears as the
first column
Arrange the data-set such that the rows are in descending order
of average_montly_hours
(the column is misspelled in the
original data)
Create a subset of these data that contains only records that match the following criteria:
Make all column names
lowercase for the data-set you
created in (g)
The only way to label things correctly is to know what each value maps to (is female = 1 a Female or a Male?) but luckily we have that information here.↩︎
For attribution, please cite this work as
Ruhil (2022, Feb. 3). Basic Data Operations. Retrieved from https://aniruhil.org/courses/mpa6020/handouts/module03.html
BibTeX citation
@misc{ruhil2022basic, author = {Ruhil, Ani}, title = {Basic Data Operations}, url = {https://aniruhil.org/courses/mpa6020/handouts/module03.html}, year = {2022} }