Exercise 1
Why are our best and most experienced employees leaving prematurely?
The data available here includes information on several current and former employees of an anonymous organization.** Fields in the data-set include:
- satisfaction_level = Level of satisfaction (numeric; 0-1)
- last_evaluation = Evaluation score of the employee (numeric; 0-1)
- number_project = Number of projects completed while at work (numeric)
- average_monthly_hours = Average monthly hours spent at the workplace (numeric)
- time_spend_company = Number of years spent in the company (numeric)
- Work_accident = Whether the employee had a workplace accident (categorical; 1 = yes or 0 = no)
- left = Whether the employee left the workplace or not (categorical; 1 = left or 0 = stayed)
- promotion_last_5years = Whether the employee was promoted in the last five years (categorical; 1 = yes or 0 = no)
- sales = Department in which they work (categorical)
- salary = Relative level of salary (categorical; low, med, and high)
- Read in the csv-format data-set, naming it
hrdata
and save it in RData format as hrdata.RData
- Create new variables that add labels to Work_accident, left, promotion_last_5years, and create a new variable that orders salary from low to high, and add these to
hrdata
.
hrdata %>%
mutate(
had_accident = case_when(
Work_accident == 0 ~ "No",
Work_accident == 1 ~ "Yes"),
left_company = case_when(
left == 0 ~ "No",
left == 1 ~ "Yes"),
recently_promoted = case_when(
promotion_last_5years == 0 ~ "No",
promotion_last_5years == 1 ~ "Yes"),
salary_group = ordered(
salary,
levels = c("low", "medium", "high"),
labels = c("Low", "Medium", "High"))
) -> hrdata
- Convert satisfaction_level from a 0-1 scale to a 0-100 scale, making sure to create a new variable of course.
- Now retain only employees who left the company, and had not been promoted in the last five years. Save this result as
hr01
- In this
hr01
data-set, how many employees do you have per sales department? What sales department has the most number of employees?
## # A tibble: 10 x 2
## sales n
## <chr> <int>
## 1 sales 1007
## 2 technical 694
## 3 support 552
## 4 IT 270
## 5 hr 215
## 6 accounting 204
## 7 marketing 203
## 8 product_mng 198
## 9 RandD 121
## 10 management 88
- By sales department, calculate mean and standard deviation of (i) satisfaction_level, and (ii) last_evaluation.
## # A tibble: 10 x 5
## sales mean.satisfaction sd.satisfaction mean.evaluation sd.evaluation
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 accounting 0.403 0.257 0.695 0.199
## 2 hr 0.433 0.243 0.680 0.197
## 3 IT 0.411 0.273 0.733 0.193
## 4 management 0.410 0.266 0.732 0.204
## 5 marketing 0.453 0.249 0.692 0.200
## 6 product_mng 0.482 0.264 0.727 0.203
## 7 RandD 0.433 0.282 0.745 0.194
## 8 sales 0.447 0.259 0.712 0.198
## 9 support 0.451 0.264 0.728 0.192
## 10 technical 0.434 0.276 0.734 0.198
- What department has the lowest mean satisfaction? How much difference is there in mean satisfaction between departments?
## # A tibble: 10 x 2
## sales mean.satisfaction
## <chr> <dbl>
## 1 accounting 0.403
## 2 management 0.410
## 3 IT 0.411
## 4 RandD 0.433
## 5 hr 0.433
## 6 technical 0.434
## 7 sales 0.447
## 8 support 0.451
## 9 marketing 0.453
## 10 product_mng 0.482
Accounting has the lowest mean satisfaction level. All departments have mean satisfaction in the 0.403 to 0.482 range, so some difference but a huge one.
- Create a new variable that groups average monthly hours into 4 groups. You can let the group cut-points be chosen automatically with
chop_evenly()
. Then show the frequencies of each group.
## # A tibble: 4 x 2
## grouped_hours n
## <fct> <int>
## 1 [126, 172) 1594
## 2 [172, 218) 88
## 3 [218, 264) 1037
## 4 [264, 310] 833
Exercise 2
Thanks to the frenetic work of many individuals, the global spread of the Novel Coronavirus (COVID-19) has been tracked and the data made available for analysis. Yanchang Zhao is one such individual and for this exercise we will use a spcific version of his data that I have named cvdata.RData
and made available via Slack. Make sure to upload that data-set to your RStudio Cloud data
folder, and then to read it in via the load()
command. We can then answer a few questions. Note the contents:
country =
name of the country
date =
date of indidents as recorded
confirmed =
cumulative count of the number of people who tested positive
deaths =
cumulative count of the number of people lost to Covid-19
deaths =
cumulative count of the number of people recovered
- Filter the data-set so that we have only one row per country, the data from March 10, 2020 and call it
cv0310
.
- How many countries have lost
at least one
person to this tragedy? “Others” should not show up as one of the countries.
## # A tibble: 1 x 2
## ncountries n
## <int> <int>
## 1 24 24
- What 10 countries have had the most number of confirmed cases? “Others” should not show up as one of the countries. Also ensure the result is organized in descending order of the number of confirmed cases.
## # A tibble: 10 x 2
## country confirmed
## <fct> <int>
## 1 Mainland China 80757
## 2 Italy 10149
## 3 Iran (Islamic Republic of) 8042
## 4 Republic of Korea 7513
## 5 France 1784
## 6 Spain 1695
## 7 US 1670
## 8 Germany 1457
## 9 Japan 581
## 10 Switzerland 491
- Calculate the
fatality_rate
, defined for our purposes as the percent of deaths. excluding “Others”, and only keeping countried that have had at least 10
confirmed cases, arrange the result to show the top-10 countries in descending order of fatality_rate
.
## # A tibble: 10 x 2
## country fatality_rate
## <fct> <dbl>
## 1 Iraq 9.86
## 2 Italy 6.22
## 3 Argentina 5.88
## 4 San Marino 3.92
## 5 Mainland China 3.88
## 6 Iran (Islamic Republic of) 3.62
## 7 US 3.35
## 8 Philippines 3.03
## 9 Australia 2.80
## 10 Hong Kong SAR 2.5
Say we only want to focus on the Baltic countries (Estonia, Latvia, and Lithuania) as a unified group and compare this group to the ASEAN nations (Brunei, Cambodia, Indonesia, Laos, Malaysia, Myanmar, Philippines, Singapore, Thailand, and Vietnam). Use cv0310
to complete the followng tasks:
- Create a new variable called
region
that only takes on two values – “Baltic” if the country is a Baltic country and “Asean” if the country is an ASEAN country.
- Use this variable to calculate the total number of confirmed cases in each region.
cv0310 %>%
mutate(
region = case_when(
country %in% c("Estonia", "Latvia", "Lithuania") ~ "Baltic",
country %in% c("Brunei", "Cambodia", "Indonesia", "Laos",
"Malaysia", "Myanmar", "Philippines", "Singapore",
"Thailand", "Vietnam") ~ "ASEAN")
) %>%
group_by(region) %>%
summarise(total.confirmed = sum(confirmed))
## # A tibble: 3 x 2
## region total.confirmed
## <chr> <int>
## 1 ASEAN 405
## 2 Baltic 21
## 3 <NA> 118877
---
title: "MPA 5830 - Module 02 Exercises"
subtitle: "Spring 2020"
author: "Professor Ruhil"
date: "Updated on `r Sys.Date()`"
output: 
  html_document: 
    code_download: yes
    code_folding: hide
    fig_caption: yes
    highlight: zenburn
    number_sections: yes
    theme: flatly
    toc: yes
    toc_float: yes
editor_options: 
  chunk_output_type: console
---

<style type="text/css">

body{ /* Normal  */
/*    font-family: Lato, sans-serif;  
      font-family: Mukta, sans-serif; 
      font-family: 'Nunito Sans', sans-serif;
      font-family: Karla, sans-serif;  */
      font-family: 'Merriweather Sans', sans-serif; 
      font-size: 18px;
  }

h1.title {
  font-size: 38px;
  color: DarkRed;
}

h1 { /* Header 1 */
  font-size: 28px;
  color: DarkBlue;
}

h2 { /* Header 2 */
    font-size: 22px;
  color: DarkBlue;
}

h3 { /* Header 3 */
  font-size: 18px;
  color: DarkBlue;
}

code.r{ /* Code block */
    font-family: Mukta, sans-serif; 
    font-weight: 600;  
    font-size: 18px;
}

/* pre { /* Code block - determines code spacing between lines */
    font-size: 16px;
} */
</style>


```{r klippy, echo = FALSE, include = TRUE}
klippy::klippy(tooltip_message = 'Click to copy', tooltip_success = 'Done', color = 'cornflowerblue', position = c('top', 'right'))
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, dpi = 300, cache = TRUE, fig.align = "center", fig.width = 10, fig.height = 8, out.width = "100%", highlight = TRUE) 
```

# Exercise 1
Why are our best and most experienced employees leaving prematurely? 

The data [available here](https://aniruhil.github.io/avsr/teaching/dataviz/HR_comma_sep.csv) includes information on several current and former employees of an anonymous organization.** Fields in the data-set include: 

+ satisfaction_level = Level of satisfaction (numeric; 0-1) 
+ last_evaluation = Evaluation score of the employee (numeric; 0-1) 
+ number_project = Number of projects completed while at work (numeric) 
+ average_monthly_hours = Average monthly hours spent at the workplace (numeric)  
+ time_spend_company = Number of years spent in the company (numeric) 
+ Work_accident = Whether the employee had a workplace accident (categorical; 1 = yes or 0 = no) 
+ left = Whether the employee left the workplace or not (categorical; 1 = left or 0 = stayed)  
+ promotion_last_5years = Whether the employee was promoted in the last five years (categorical; 1 = yes or 0 = no) 
+ sales = Department in which they work (categorical) 
+ salary = Relative level of salary (categorical; low, med, and high) 

(a) Read in the csv-format data-set, naming it `hrdata` and save it in RData format as `hrdata.RData` 

```{r hrdata-in}
library(tidyverse)
library(tidylog)
library(here)
read_csv(
  "https://aniruhil.github.io/avsr/teaching/dataviz/HR_comma_sep.csv"
  ) -> hrdata
```

(b) Create new variables that add labels to Work_accident, left, promotion_last_5years, and create a new variable that orders salary from low to high, and add these to `hrdata`.  

```{r mut01}
hrdata %>%
  mutate(
    had_accident = case_when(
      Work_accident == 0 ~ "No",
      Work_accident == 1 ~ "Yes"),
    left_company = case_when(
      left == 0 ~ "No",
      left == 1 ~ "Yes"),
    recently_promoted = case_when(
      promotion_last_5years == 0 ~ "No",
      promotion_last_5years == 1 ~ "Yes"),
    salary_group = ordered(
      salary, 
      levels = c("low", "medium", "high"),
      labels = c("Low", "Medium", "High"))
    ) -> hrdata 
```

(c) Convert satisfaction_level from a 0-1 scale to a 0-100 scale, making sure to create a new variable of course.  

```{r mut02}
hrdata %>%
  mutate(
    satlevel = satisfaction_level * 100
    ) -> hrdata
```

(d) Now retain only employees who left the company, and had not been promoted in the last five years. Save this result as `hr01`

```{r mut03}
hrdata %>%
  filter(
    left_company == "Yes",
    recently_promoted == "No"
    ) -> hr01
```

(e) In this `hr01` data-set, how many employees do you have per sales department? What sales department has the most number of employees? 

```{r mut04}
hr01 %>%
  group_by(sales) %>%
  tally() %>%
  arrange(-n)
```

(f) By sales department, calculate mean and standard deviation of (i) satisfaction_level, and (ii) last_evaluation. 

```{r mut05}
hr01 %>%
  group_by(sales) %>%
  summarise(
    mean.satisfaction = mean(satisfaction_level, na.rm = TRUE),
    sd.satisfaction = sd(satisfaction_level, na.rm = TRUE),
    mean.evaluation = mean(last_evaluation, na.rm = TRUE),
    sd.evaluation = sd(last_evaluation, na.rm = TRUE)
  )
```

(g) What department has the lowest mean satisfaction? How much difference is there in mean satisfaction between departments?  

```{r mut06}
hr01 %>%
  group_by(sales) %>%
  summarise(mean.satisfaction = mean(satisfaction_level, na.rm = TRUE)) %>%
  arrange(mean.satisfaction)
```

Accounting has the lowest mean satisfaction level. All departments have mean satisfaction in the 0.403 to 0.482 range, so some difference but a huge one. 

(h) Create a new variable that groups average monthly hours into 4 groups. You can let the group cut-points be chosen automatically with `chop_evenly()`. Then show the frequencies of each group.

```{r chop}
library(santoku)
hr01 %>%
  mutate(
    grouped_hours = chop_evenly(average_montly_hours, groups = 4)
    ) -> hr01

hr01 %>%
  group_by(grouped_hours) %>%
  tally()
```

# Exercise 2
Thanks to the frenetic work of many individuals, the global spread of the Novel Coronavirus (COVID-19) has been tracked and the data made available for analysis. [Yanchang Zhao](https://rdatamining.wordpress.com/2020/03/10/coronavirus-data-analysis-with-r-tidyverse-and-ggplot2/) is one such individual and for this exercise we will use a spcific version of his data that I have named `cvdata.RData` and made available via Slack. Make sure to upload that data-set to your RStudio Cloud `data` folder, and then to read it in via the `load()` command. We can then answer a few questions. Note the contents: 

+ `country =` name of the country 
+ `date =` date of indidents as recorded 
+ `confirmed =` cumulative count of the number of people who tested positive  
+ `deaths =` cumulative count of the number of people lost to Covid-19 
+ `deaths =` cumulative count of the number of people recovered  

```{r cv01}
load(here("data", "cvdata.RData"))
```

(a) Filter the data-set so that we have only one row per country, the data from March 10, 2020 and call it `cv0310`. 

```{r cv02}
cvdata %>%
  filter(date == "2020-03-10") -> cv0310
```

(b) How many countries have lost `at least one` person to this tragedy? "Others" should not show up as one of the countries.  

```{r cv03}
cv0310 %>%
  filter(deaths >= 1, country != "Others") %>%
  count(ncountries = n_distinct(country))
```

(c) What 10 countries have had the most number of confirmed cases? "Others" should not show up as one of the countries. Also ensure the result is organized in descending order of the number of confirmed cases. 

```{r cv04}
cv0310 %>%
  filter(deaths >= 1, country != "Others") %>%
  group_by(country) %>%
  arrange(-confirmed) %>%
  select(country, confirmed) %>%
  ungroup() %>%
  top_n(10)
```

(d) Calculate the `fatality_rate`, defined for our purposes as the percent of deaths. excluding "Others", and only keeping countried that have had `at least 10` confirmed cases, arrange the result to show the top-10 countries in descending order of `fatality_rate`.  

```{r cv05}
cv0310 %>%
  filter(country != "Others", confirmed >= 10) %>%
  mutate(fatality_rate = (deaths / confirmed)*100) %>%
  group_by(country) %>%
  arrange(-fatality_rate) %>%
  select(country, fatality_rate) %>%
  ungroup() %>%
  top_n(10)
```

(e) Say we only want to focus on the Baltic countries (Estonia, Latvia, and Lithuania) as a unified group and compare this group to the ASEAN nations (Brunei, Cambodia, Indonesia, Laos, Malaysia, Myanmar, Philippines, Singapore, Thailand, and Vietnam). Use `cv0310` to complete the followng tasks: 

(i) Create a new variable called `region` that only takes on two values -- "Baltic" if the country is a Baltic country and "Asean" if the country is an ASEAN country. 
(ii) Use this variable to calculate the total number of confirmed cases in each region. 

```{r cv06}
cv0310 %>%
  mutate(
    region = case_when(
      country %in% c("Estonia", "Latvia", "Lithuania") ~ "Baltic",
      country %in% c("Brunei", "Cambodia", "Indonesia", "Laos",
                   "Malaysia", "Myanmar", "Philippines", "Singapore",
                   "Thailand", "Vietnam") ~ "ASEAN")
    ) %>%
  group_by(region) %>%
  summarise(total.confirmed = sum(confirmed))
```
