The data below come from tidytuesday and provide information on accidents at theme parks. You can see more of these data available here. The data give you some details of where and when the accident occurred, and something about the injured party as well.
library(tidyverse)
library(tidylog)
safer_parks <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-10/saferparks.csv")
safer_parks.csv
{-}
variable | class | description |
---|---|---|
acc_id | double | Unique ID |
acc_date | character | Accident Date |
acc_state | character | Accident State |
acc_city | character | Accident City |
fix_port | character | . |
source | character | Source of injury report |
bus_type | character | Business type |
industry_sector | character | Industry sector |
device_category | character | Device category |
device_type | character | Device type |
tradename_or_generic | character | Common name of the device |
manufacturer | character | Manufacturer of device |
num_injured | double | Num injured |
age_youngest | double | Youngest individual injured |
gender | character | Gender of individual injured |
acc_desc | character | Description of accident |
injury_desc | character | Injury description |
report | character | Report URL |
category | character | Category of accident |
mechanical | double | Mechanical failure (binary NA/1) |
op_error | double | Operator error (binary NA/1) |
employee | double | Employee error (binary NA/1) |
notes | character | Additional notes |
Working with the safer_parks
data, complete the following tasks.
Using acc_date
, create a new date variable called idate
that is a proper date column generated via {lubridate}.
Now create new columns for (i) the month of the accident, and (ii) the day of the week. These should not be abbreviated (i.e., we should see the values as ‘Monday’ instead of ‘Mon’, “July” instead of “Jul”). What month had the highest number of accidents? What day of the week had the highest number of accidents?
safer_parks %>%
mutate(
injury_month = month(idate, label = TRUE, abbr = FALSE),
injury_dow = wday(idate, label = TRUE, abbr = FALSE)
) -> safer_parks
safer_parks %>%
filter(!is.na(injury_month)) %>%
count(injury_month, sort = TRUE)
## # A tibble: 12 x 2
## injury_month n
## <ord> <int>
## 1 July 1702
## 2 August 1448
## 3 June 1322
## 4 May 706
## 5 September 575
## 6 March 493
## 7 April 484
## 8 October 461
## 9 December 320
## 10 November 302
## 11 February 281
## 12 January 257
## # A tibble: 7 x 2
## injury_dow n
## <ord> <int>
## 1 Saturday 2070
## 2 Sunday 1633
## 3 Friday 1128
## 4 Monday 985
## 5 Thursday 926
## 6 Wednesday 830
## 7 Tuesday 779
Looks like the month of July, and Saturdays, respectively.
What if you look at days of the week by month? Does the same day of the week show up with the most accidents regardless of month or do we see some variation?
## # A tibble: 84 x 3
## injury_month injury_dow n
## <ord> <ord> <int>
## 1 July Saturday 330
## 2 August Saturday 328
## 3 June Saturday 316
## 4 July Sunday 287
## 5 August Sunday 259
## 6 July Friday 251
## 7 July Thursday 218
## 8 June Sunday 217
## 9 July Monday 216
## 10 July Wednesday 203
## # … with 74 more rows
Well, Saturdays for the most part.
What were the five
dates with the most number of accidents?
## # A tibble: 1,845 x 2
## idate n
## <date> <int>
## 1 2014-08-16 23
## 2 2012-07-21 22
## 3 2015-07-11 21
## 4 2015-06-13 20
## 5 2015-06-14 20
## 6 2015-07-19 20
## 7 2016-08-06 20
## 8 2014-06-14 19
## 9 2015-07-13 19
## 10 2016-08-27 19
## # … with 1,835 more rows
Using the Texas injury data, answer the following question: What ride was the safest? [Hint: For each ride (ride_name
) you will need to calculate the number of days between accidents. The ride with the highest number of days is the safest.]
read_csv(
"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-10/tx_injuries.csv"
) -> tx_injuries
tx_injuries.csv
variable | class | description |
---|---|---|
injury_report_rec | double | Unique Record ID |
name_of_operation | character | Company name |
city | character | City |
st | character | State (all TX) |
injury_date | character | Injury date - note there are some different formats |
ride_name | character | Ride Name |
serial_no | character | Serial number of ride |
gender | character | Gender of the injured individual |
age | character | Age of the injured individual |
body_part | character | Body part injured |
alleged_injury | character | Alleged injury - type of injury |
cause_of_injury | character | Approximate cause of the injury (free text) |
other | character | Anecdotal information in addition to cause of injury |
tx_injuries %>%
mutate(date = mdy(injury_date)) %>%
group_by(ride_name) %>%
arrange(date) %>%
mutate(
tspan = interval(lag(date, order_by = ride_name), date),
tspan.days = as.duration(tspan)/ddays(1)
) %>%
select(date, ride_name, tspan, tspan.days) %>%
arrange(-tspan.days)
## # A tibble: 542 x 4
## # Groups: ride_name [252]
## date ride_name tspan tspan.days
## <date> <chr> <dbl> <dbl>
## 1 2015-04-16 iFly Austin 50025600 579
## 2 2014-07-26 Batman 36720000 425
## 3 2014-08-25 Coyote Cannon 35769600 414
## 4 2014-09-07 Tsunami Surge 35078400 406
## 5 2014-07-25 Howlin' Tornado 33696000 390
## 6 2014-08-31 Stingray Falls 33350400 386
## 7 2014-07-04 Cliffhanger 31276800 362
## 8 2014-05-16 Go Karts 29894400 346
## 9 2014-06-17 Mr. Freeze 28857600 334
## 10 2014-07-22 Dragon Blaster 28080000 325
## # … with 532 more rows
You should note that this assumes each ride was in operation for the same amount of time. If this is not true then our estimates will be unreliable.
The
injury_date
problem fix …
tx_injuries %>%
mutate(
new_date = parse_date_time(
tx_injuries$injury_date,
orders = c("d m y", "d B Y", "m/d/y", "ymd"))
)
## # A tibble: 542 x 14
## injury_report_r… name_of_operati… city st injury_date ride_name serial_no
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2032 Skygroup Invest… Aust… TX 2/12/2013 I Fly SV024
## 2 1897 Willie G's Post… Galv… TX 3/2/2013 Gulf Gli… GS-11-10…
## 3 837 Great Wolf Lodge Grap… TX 3/3/2013 Howlin T… 0643-C1-…
## 4 99 Six Flags Fiest… San … TX 3/3/2013 Scooby D… n/a
## 5 55 Ray Cammack Sho… Lave… AZ 3/11/2013 Alien Ab… 862162
## 6 780 ZDT's Amusement… Segu… TX 3/12/2013 Go Karts n/a
## 7 253 Six Flags Over … Arli… TX 3/15/2013 Gold Riv… 01-0511
## 8 253 Six Flags Over … Arli… TX 3/15/2013 Titan COCO62
## 9 55 Ray Cammack Sho… Lave… AZ 3/16/2013 Wild Riv… 97-05-00…
## 10 55 Ray Cammack Sho… Lave… AZ 3/16/2013 Sky Ride… A-878-B
## # … with 532 more rows, and 7 more variables: gender <chr>, age <chr>,
## # body_part <chr>, alleged_injury <chr>, cause_of_injury <chr>, other <chr>,
## # new_date <dttm>
These data (see below) come from this story: The next generation: The space race is dominated by new contenders. You have data on space missions over time, with dates of the launch, the launching agency/country, type of launch vehicle, and so on.
launches
{-}
variable | definition |
---|---|
tag | Harvard or [COSPAR][cospar] id of launch |
JD | [Julian Date][jd] of launch |
launch_date | date of launch |
launch_year | year of launch |
type | type of launch vehicle |
variant | variant of launch vehicle |
mission | space mission |
agency | launching agency |
state_code | launching agency’s state |
category | success (O) or failure (F) |
agency_type | type of agency |
read_csv(
"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-15/launches.csv"
) -> launches
Create a new column called date
that stores launch_date
as a proper data field in ymd format from {lubridate}.
Creating columns as needed, calculate and show the number of launches first by year, then by month, and then by day of the week. The result should be arranged in descending order of the number of launches.
lau.df %>%
mutate(
year = year(date),
month = month(date, abbr = FALSE, label = TRUE),
day = day(date),
dow = wday(date, abbr = FALSE, label = TRUE)
) -> lau.df
lau.df %>%
filter(!is.na(year)) %>%
count(year, sort = TRUE)
## # A tibble: 63 x 2
## year n
## <dbl> <int>
## 1 1967 139
## 2 1971 134
## 3 1975 132
## 4 1966 131
## 5 1976 131
## 6 1977 130
## 7 1982 129
## 8 1983 129
## 9 1984 129
## 10 1968 128
## # … with 53 more rows
## # A tibble: 12 x 2
## month n
## <ord> <int>
## 1 December 579
## 2 June 524
## 3 October 523
## 4 September 511
## 5 April 504
## 6 August 495
## 7 July 466
## 8 May 460
## 9 March 447
## 10 November 435
## 11 February 429
## 12 January 340
## # A tibble: 31 x 2
## day n
## <int> <int>
## 1 28 248
## 2 25 227
## 3 24 217
## 4 14 213
## 5 21 211
## 6 15 210
## 7 26 208
## 8 16 206
## 9 20 205
## 10 18 204
## # … with 21 more rows
## # A tibble: 7 x 2
## dow n
## <ord> <int>
## 1 Thursday 1112
## 2 Wednesday 1102
## 3 Friday 1030
## 4 Tuesday 971
## 5 Saturday 618
## 6 Monday 534
## 7 Sunday 346
So 1967, the month of December, the 28th of a month, and Thursday have the most launches, respectively.
How many launches were successful (O)
versus failed (F)
by country and year? The countries of interest will be state_code values of “CN”, “F”, “J”, “RU”, “SU”, “US”. You do not need to arrange your results in any order.
lau.df %>%
filter(
!is.na(date),
state_code %in% c("CN", "F", "J", "RU", "SU", "US")
) %>%
count(state_code, year, category)
## # A tibble: 370 x 4
## state_code year category n
## <chr> <dbl> <chr> <int>
## 1 CN 1970 O 1
## 2 CN 1971 O 1
## 3 CN 1973 F 1
## 4 CN 1974 F 2
## 5 CN 1975 O 3
## 6 CN 1976 F 1
## 7 CN 1976 O 2
## 8 CN 1978 O 1
## 9 CN 1979 F 1
## 10 CN 1981 O 1
## # … with 360 more rows