Using the cmhflights
data from last week, create a column that unites the three columns Year
, Month
, and DayofMonth
into a single column that we will name date_of_flight
. This column should separate the three fields by “-”.
Hint: You will have to use
load(...)
withhere(...)
.
library(tidyverse)
library(tidylog)
library(here)
load(
here("data", "cmhflights_01092017.RData")
)
cmhflights %>%
unite(
col = "date_of_flight", # name of the new column
c(Year, Month, DayofMonth), # columns to be united
sep = "-", # the separator to be used
remove = FALSE # keep the original column
) %>%
select(c(1, 2, 4, 5)) # only show these columns in the output
## # A tibble: 35,993 x 4
## date_of_flight Year Month DayofMonth
## <chr> <int> <int> <int>
## 1 2017-1-1 2017 1 1
## 2 2017-1-2 2017 1 2
## 3 2017-1-3 2017 1 3
## 4 2017-1-4 2017 1 4
## 5 2017-1-5 2017 1 5
## 6 2017-1-6 2017 1 6
## 7 2017-1-7 2017 1 7
## 8 2017-1-8 2017 1 8
## 9 2017-1-9 2017 1 9
## 10 2017-1-11 2017 1 11
## # … with 35,983 more rows
Sticking with cmhflights
, separate OriginCityName
into two new columns, origin_city
and origin_state
. Do the same for DestCityname
, calling the new columns destination_city
and destination_state
, respectively. Both city columns should only display the name of the city, while both state columns should only display the abbreviated state name (for example, “CA”, “OH”, etc.)
cmhflights %>%
separate(
col = OriginCityName, # column to be separated
into = c("origin_city", "origin_state"), # names of new columns
sep = ', ', # the separator
remove = FALSE # keep original column
) %>%
separate(
col = DestCityName, # column to be separated
into = c("destination_city", "destination_state"), # names of new columns
sep = ', ', # the separator
remove = FALSE # keep original column
) %>%
select(origin_city, origin_state, destination_city, destination_state) # display these columns
## # A tibble: 35,993 x 4
## origin_city origin_state destination_city destination_state
## <chr> <chr> <chr> <chr>
## 1 Los Angeles CA Columbus OH
## 2 Los Angeles CA Columbus OH
## 3 Los Angeles CA Columbus OH
## 4 Los Angeles CA Columbus OH
## 5 Los Angeles CA Columbus OH
## 6 Los Angeles CA Columbus OH
## 7 Los Angeles CA Columbus OH
## 8 Los Angeles CA Columbus OH
## 9 Los Angeles CA Columbus OH
## 10 Los Angeles CA Columbus OH
## # … with 35,983 more rows
Tidy the weather
data such that the resulting data-set, called wdf
, has the days
(the d1-d31 columns) as rows and TMIN
and TMAX
as columns. The end result should be as shown below:
read.delim(
file = "http://stat405.had.co.nz/data/weather.txt",
stringsAsFactors = FALSE
) -> weather
id | year | month | days | TMAX | TMIN |
---|---|---|---|---|---|
MX000017004 | 2010 | 1 | d1 | NA | NA |
MX000017004 | 2010 | 1 | d2 | NA | NA |
MX000017004 | 2010 | 1 | d3 | NA | NA |
MX000017004 | 2010 | 1 | d4 | NA | NA |
MX000017004 | 2010 | 1 | d5 | NA | NA |
MX000017004 | 2010 | 1 | d6 | NA | NA |
MX000017004 | 2010 | 1 | d7 | NA | NA |
MX000017004 | 2010 | 1 | d8 | NA | NA |
MX000017004 | 2010 | 1 | d9 | NA | NA |
MX000017004 | 2010 | 1 | d10 | NA | NA |
weather %>%
group_by(id, year, month, element) %>%
pivot_longer(
names_to = "variable", # name of the new column where names will be stored
values_to = "value", # name of the new column where values will be stored
5:35 # the columns to be flipped into rows
) %>%
pivot_wider(
names_from = element, # column whose values are used to make new columns
values_from = value # where to get the values for the new columns
)
## # A tibble: 341 x 6
## # Groups: id, year, month [11]
## id year month variable TMAX TMIN
## <chr> <int> <int> <chr> <int> <int>
## 1 MX000017004 2010 1 d1 NA NA
## 2 MX000017004 2010 1 d2 NA NA
## 3 MX000017004 2010 1 d3 NA NA
## 4 MX000017004 2010 1 d4 NA NA
## 5 MX000017004 2010 1 d5 NA NA
## 6 MX000017004 2010 1 d6 NA NA
## 7 MX000017004 2010 1 d7 NA NA
## 8 MX000017004 2010 1 d8 NA NA
## 9 MX000017004 2010 1 d9 NA NA
## 10 MX000017004 2010 1 d10 NA NA
## # … with 331 more rows