class: title-slide, center, middle, inverse # .large[.fancy[Module 01: Introduction to Data and Statistics]] # .medium[.fancy[MPA 6010]] ## .fancy[Ani Ruhil] --- ## .fat[.fancy[Agenda]] 1. Samples, populations, and a statistic 2. The composition of a data-set 3. Types of data 4. Measurement -- just the basics 5. Descriptive Statistics (a) `Central tendency` -- mean, median, mode, quartiles (b) `Dispersion` -- variance, standard deviation, interquartile range 6. Visualizing data --- class: inverse, center, middle ## .heat[.fancy[Introduction]] --- `Statistics` involves methods for describing and analyzing data and for drawing inferences about phenomena represented by the data. For example, research looking to understand * Inflation forecasts for Western Europe and immigrant flows * Links between unemployment and money supply * Variations in Nielsen Weekly U.S. Television ratings * The relationship between per capita income and obesity A `statistic`, on the other hand, is the result of applying a computational algorithm to a set of data. For example, calculating the * Average height of adult males in the U.S. * Median household income * Percent living below the poverty line * Modal number of stories devoted to covering an electoral candidate * Mean hours spent on social media --- A `population` is the universe (or set) of all elements of interest `\((N)\)` in a particular study A `sample` is the subset of cases `\((n)\)` drawn for analysis from the population **Example 1** + `Population:` All first-time enrollee Freshmen at Ohio University in 2009-2010 + `Sample:` Freshmen selected for study from the OU Registrar's list of all first-time enrollee Freshmen at Ohio University in 2009-2010 **Example 2** + `Population:` All national public radio (NPR) members + `Sample:` NPR members selected for telephone survey from NPR's list **Example 3** -- + `Population:` All Productivity Apps in Apple's App Store + `Sample:` 100 Productivity Apps drawn at random from Apple's App Store --- ## Some common sampling methods ... <p align="center"> <iframe src="https://www.youtube.com/embed/pTuj57uXWlk" width="520" height="415" frameborder="0" allowfullscreen> </iframe> </p> --- ## What is in a data-set?
--- .pull-left[ Six variables ... (1) country (2) continent (3) year (4) lifeExp (life expectancy) (5) pop (population) (6) gdpPercap (GDP per capita) ] .pull-right[ * each row is an `observation` ... a country shows up once per row * each column is a `variable` ... * each variable is of a particular type ... - `categorical` versus `numerical` - if categorical, then could be `nominal` or `ordinal` - if numerical, then could be `interval` or `ratio` (aka `scale`) ] --- ## Variable Types & Levels of Measurement **Categorical** variables measure some qualitative attribute 1. `Nominal` have no hierarchy: male/female, Democrat/Independent/Republican, someone's nationality, race/ethnicity 2. `Ordinal` have some hierarchy: diamonds rated as Poor/Fair/Good/Very Good/Premium/Ideal, First-year/Second-year students, survey responses Dislike/Neutral/Like **Numerical** variables measure some quantitative attribute 1. `Interval` have no naturally occurring 0 values that reflect absence of that attribute: - Celsius and Fahrenheit temperature scales, - GRE/SAT/ACT scores, - scaled scores on a standardized test 2. `Ratio` have naturally occurring 0 values that reflect absence of that attribute: - the Kelvin temperature scale, - income, - no. of cars in a household, no. of children in a household, no. of visits to the Emergency Room in a year --- ## Data Types (a) Panel data (measurements for several units for several time periods)
Multiple years of data per country --- (b) Cross-sectional data (measurements for multiple units at a single point in time)
--- (c) Time-series data (measurements for a single unit for several time periods)
--- name: yourturn class: inverse .left-column[ <svg viewBox="0 0 576 512" style="position:relative;display:inline-block;top:.1em;fill:yellow;height:2em;" xmlns="http://www.w3.org/2000/svg"> <path d="M402.6 83.2l90.2 90.2c3.8 3.8 3.8 10 0 13.8L274.4 405.6l-92.8 10.3c-12.4 1.4-22.9-9.1-21.5-21.5l10.3-92.8L388.8 83.2c3.8-3.8 10-3.8 13.8 0zm162-22.9l-48.8-48.8c-15.2-15.2-39.9-15.2-55.2 0l-35.4 35.4c-3.8 3.8-3.8 10 0 13.8l90.2 90.2c3.8 3.8 10 3.8 13.8 0l35.4-35.4c15.2-15.3 15.2-40 0-55.2zM384 346.2V448H64V128h229.8c3.2 0 6.2-1.3 8.5-3.5l40-40c7.6-7.6 2.2-20.5-8.5-20.5H48C21.5 64 0 85.5 0 112v352c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48V306.2c0-10.7-12.9-16-20.5-8.5l-40 40c-2.2 2.3-3.5 5.3-3.5 8.5z"></path></svg> #.fancy[Your turn] ] --- name: yourturn1 template: yourturn .right-column[.medium[Open up the `vehicles.csv.zip` data ---- Identify whether the following variables are numeric or categorical, and their level of measurement. Read the documentation about the variables [available here](https://www.fueleconomy.gov/feg/ws/index.shtml) .pull-left[ (a) fuelType1 (b) charge120 (c) cylinders (d) drive (e) year ] .pull-right[ (f) ghgScore (g) highway08 (h) model (i) trany (j) youSaveSpend ] ] ]
−
+
15
:
00
--- class: inverse, middle, center # .heat[.fancy[Descriptive Statistics]] --- class: center, middle background-image: url(https://upload.wikimedia.org/wikipedia/commons/2/29/Minard.png) background-position: center background-size: contain --- ## Frequency Tables Used with `(i) categorical` variables, and `(ii) grouped continuous` variables .pull-left[ <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Continent </th> <th style="text-align:right;"> Frequency </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> Africa </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 52 </td> </tr> <tr> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 25 </td> </tr> <tr> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 33 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 30 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 142 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Continent </th> <th style="text-align:right;"> Proportion </th> <th style="text-align:right;"> Percent </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> Africa </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 0.37 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 36.62 </td> </tr> <tr> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 0.18 </td> <td style="text-align:right;"> 17.61 </td> </tr> <tr> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 0.23 </td> <td style="text-align:right;"> 23.24 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 0.21 </td> <td style="text-align:right;"> 21.13 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:right;"> 0.01 </td> <td style="text-align:right;"> 1.41 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 1.00 </td> <td style="text-align:right;"> 100.00 </td> </tr> </tbody> </table> ] --- <img src="module01_files/figure-html/unnamed-chunk-5-1.svg" title="A chart that displays the number of countries in a continent as a vertical bar, with each continent represented by a different color. Africa has the most and Oceania the least number of countries." alt="A chart that displays the number of countries in a continent as a vertical bar, with each continent represented by a different color. Africa has the most and Oceania the least number of countries." width="60%" style="display: block; margin: auto;" /> * frequencies on the `y-axis` and categories on the `x-axis` * title, sub-title, labeled x-axis and y-axis, mention data source --- <img src="module01_files/figure-html/unnamed-chunk-6-1.svg" title="Two charts that display the proportion and the percent of countries in each continent as a vertical bar, with each continent represented by a different color. Africa has the most and Oceania the least number of countries." alt="Two charts that display the proportion and the percent of countries in each continent as a vertical bar, with each continent represented by a different color. Africa has the most and Oceania the least number of countries." width="70%" style="display: block; margin: auto;" /> --- name: yourturn2 template: yourturn .right-column[.medium[With the `vehicles` data, construct (1) frequency tables and (ii) bar-charts for the following variables. `Note:` Make sure you have set the variable type correctly and have labeled these variables as well. Also, make sure the x-axis/y-axis are labeled, and the chart has a title. .pull-left[ (1) trany (2) cylinders ] .pull-right[ (3) drive (4) fuelType1 ] ]]
−
+
10
:
00
--- ## Grouped frequency tables .pull-left[ * special frequency tables used with numerical variables * constructed by first grouping the variable's values into "reasonable number" of equal-width groups * overlapping class-limits possible * consistent inclusion/exclusion rule * can flip `Freq` into proportions/percentages ] .pull-right[ <table class="table table-striped" style="font-size: 15px; width: auto !important; margin-left: auto; margin-right: auto;"> <caption style="font-size: initial !important;">Grouped Frequency Table</caption> <thead> <tr> <th style="text-align:left;"> Life Expectancy groups </th> <th style="text-align:right;"> Freq </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 39-49 </td> <td style="text-align:right;"> 17 </td> </tr> <tr> <td style="text-align:left;"> 49-59 </td> <td style="text-align:right;"> 22 </td> </tr> <tr> <td style="text-align:left;"> 59-69 </td> <td style="text-align:right;"> 19 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 69-79 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 63 </td> </tr> <tr> <td style="text-align:left;"> 79-89 </td> <td style="text-align:right;"> 21 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:right;"> 142 </td> </tr> </tbody> </table> ] --- ## Histograms <img src="module01_files/figure-html/unnamed-chunk-9-1.svg" title="A chart that displays life expectancies by creating groups of lifespans, and shows most countries have life expectancies in the 70-80 range." alt="A chart that displays life expectancies by creating groups of lifespans, and shows most countries have life expectancies in the 70-80 range." width="60%" style="display: block; margin: auto;" /> --- ## Choose `bins` carefully <img src="module01_files/figure-html/unnamed-chunk-10-1.svg" title="A chart that displays life expectancies by creating groups with wider versus narrower lifespans to show that the width of the groups and the number of groups used change the narrative." alt="A chart that displays life expectancies by creating groups with wider versus narrower lifespans to show that the width of the groups and the number of groups used change the narrative." width="60%" style="display: block; margin: auto;" /> --- ## Symmetric versus skewed distributions <img src="module01_files/figure-html/unnamed-chunk-11-1.svg" title="Different distributions are shown, symmetric, skewed right versus left, and long- versus short-tailed" alt="Different distributions are shown, symmetric, skewed right versus left, and long- versus short-tailed" width="65%" style="display: block; margin: auto;" /> --- name: yourturn1 template: yourturn .right-column[.medium[Construct a grouped frequency table for the following variables: 1. city08 2. highway08 Now construct histograms for both variables `Question:` Does either variable exhibit a skewed distribution? Which one? In what direction? ]]
−
+
10
:
00
--- class: inverse, center, middle # .large[.fancy[ Crosstabulations ]] --- ## Tables for two categorical variables <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>A Crosstabulation</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> D </th> <th style="text-align:right;"> E </th> <th style="text-align:right;"> F </th> <th style="text-align:right;"> G </th> <th style="text-align:right;"> H </th> <th style="text-align:right;"> I </th> <th style="text-align:right;"> J </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Fair </td> <td style="text-align:right;"> 163 </td> <td style="text-align:right;"> 224 </td> <td style="text-align:right;"> 312 </td> <td style="text-align:right;"> 314 </td> <td style="text-align:right;"> 303 </td> <td style="text-align:right;"> 175 </td> <td style="text-align:right;"> 119 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 1610 </td> </tr> <tr> <td style="text-align:left;"> Good </td> <td style="text-align:right;"> 662 </td> <td style="text-align:right;"> 933 </td> <td style="text-align:right;"> 909 </td> <td style="text-align:right;"> 871 </td> <td style="text-align:right;"> 702 </td> <td style="text-align:right;"> 522 </td> <td style="text-align:right;"> 307 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 4906 </td> </tr> <tr> <td style="text-align:left;"> Very Good </td> <td style="text-align:right;"> 1513 </td> <td style="text-align:right;"> 2400 </td> <td style="text-align:right;"> 2164 </td> <td style="text-align:right;"> 2299 </td> <td style="text-align:right;"> 1824 </td> <td style="text-align:right;"> 1204 </td> <td style="text-align:right;"> 678 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 12082 </td> </tr> <tr> <td style="text-align:left;"> Premium </td> <td style="text-align:right;"> 1603 </td> <td style="text-align:right;"> 2337 </td> <td style="text-align:right;"> 2331 </td> <td style="text-align:right;"> 2924 </td> <td style="text-align:right;"> 2360 </td> <td style="text-align:right;"> 1428 </td> <td style="text-align:right;"> 808 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 13791 </td> </tr> <tr> <td style="text-align:left;"> Ideal </td> <td style="text-align:right;"> 2834 </td> <td style="text-align:right;"> 3903 </td> <td style="text-align:right;"> 3826 </td> <td style="text-align:right;"> 4884 </td> <td style="text-align:right;"> 3115 </td> <td style="text-align:right;"> 2093 </td> <td style="text-align:right;"> 896 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 21551 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> Sum </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 6775 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 9797 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 9542 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 11292 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 8304 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 5422 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 2808 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 53940 </td> </tr> </tbody> </table> * one variable's categories make up the `rows`, the other variable's categories make up the `columns` * also have `row totals` and `column totals` * have to be careful about proportions/percentages here since they could be based on the (i) total, (ii) row totals, or (iii) column totals --- ## Proportions based on total sample size <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>A Crosstabulation</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> D </th> <th style="text-align:right;"> E </th> <th style="text-align:right;"> F </th> <th style="text-align:right;"> G </th> <th style="text-align:right;"> H </th> <th style="text-align:right;"> I </th> <th style="text-align:right;"> J </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Fair </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;"> 0.01 </td> <td style="text-align:right;"> 0.01 </td> <td style="text-align:right;"> 0.01 </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 0.03 </td> </tr> <tr> <td style="text-align:left;"> Good </td> <td style="text-align:right;"> 0.01 </td> <td style="text-align:right;"> 0.02 </td> <td style="text-align:right;"> 0.02 </td> <td style="text-align:right;"> 0.02 </td> <td style="text-align:right;"> 0.01 </td> <td style="text-align:right;"> 0.01 </td> <td style="text-align:right;"> 0.01 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 0.09 </td> </tr> <tr> <td style="text-align:left;"> Very Good </td> <td style="text-align:right;"> 0.03 </td> <td style="text-align:right;"> 0.04 </td> <td style="text-align:right;"> 0.04 </td> <td style="text-align:right;"> 0.04 </td> <td style="text-align:right;"> 0.03 </td> <td style="text-align:right;"> 0.02 </td> <td style="text-align:right;"> 0.01 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 0.22 </td> </tr> <tr> <td style="text-align:left;"> Premium </td> <td style="text-align:right;"> 0.03 </td> <td style="text-align:right;"> 0.04 </td> <td style="text-align:right;"> 0.04 </td> <td style="text-align:right;"> 0.05 </td> <td style="text-align:right;"> 0.04 </td> <td style="text-align:right;"> 0.03 </td> <td style="text-align:right;"> 0.01 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 0.26 </td> </tr> <tr> <td style="text-align:left;"> Ideal </td> <td style="text-align:right;"> 0.05 </td> <td style="text-align:right;"> 0.07 </td> <td style="text-align:right;"> 0.07 </td> <td style="text-align:right;"> 0.09 </td> <td style="text-align:right;"> 0.06 </td> <td style="text-align:right;"> 0.04 </td> <td style="text-align:right;"> 0.02 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 0.40 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> Sum </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 0.13 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 0.18 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 0.18 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 0.21 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 0.15 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 0.10 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 0.05 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 1.00 </td> </tr> </tbody> </table> --- ## Proportions based on row totals <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>A Crosstabulation</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> D </th> <th style="text-align:right;"> E </th> <th style="text-align:right;"> F </th> <th style="text-align:right;"> G </th> <th style="text-align:right;"> H </th> <th style="text-align:right;"> I </th> <th style="text-align:right;"> J </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Fair </td> <td style="text-align:right;"> 0.10 </td> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.19 </td> <td style="text-align:right;"> 0.20 </td> <td style="text-align:right;"> 0.19 </td> <td style="text-align:right;"> 0.11 </td> <td style="text-align:right;"> 0.07 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Good </td> <td style="text-align:right;"> 0.13 </td> <td style="text-align:right;"> 0.19 </td> <td style="text-align:right;"> 0.19 </td> <td style="text-align:right;"> 0.18 </td> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.11 </td> <td style="text-align:right;"> 0.06 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Very Good </td> <td style="text-align:right;"> 0.13 </td> <td style="text-align:right;"> 0.20 </td> <td style="text-align:right;"> 0.18 </td> <td style="text-align:right;"> 0.19 </td> <td style="text-align:right;"> 0.15 </td> <td style="text-align:right;"> 0.10 </td> <td style="text-align:right;"> 0.06 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Premium </td> <td style="text-align:right;"> 0.12 </td> <td style="text-align:right;"> 0.17 </td> <td style="text-align:right;"> 0.17 </td> <td style="text-align:right;"> 0.21 </td> <td style="text-align:right;"> 0.17 </td> <td style="text-align:right;"> 0.10 </td> <td style="text-align:right;"> 0.06 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Ideal </td> <td style="text-align:right;"> 0.13 </td> <td style="text-align:right;"> 0.18 </td> <td style="text-align:right;"> 0.18 </td> <td style="text-align:right;"> 0.23 </td> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.10 </td> <td style="text-align:right;"> 0.04 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 1 </td> </tr> </tbody> </table> --- ## Proportions based on column totals <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>A Crosstabulation</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> D </th> <th style="text-align:right;"> E </th> <th style="text-align:right;"> F </th> <th style="text-align:right;"> G </th> <th style="text-align:right;"> H </th> <th style="text-align:right;"> I </th> <th style="text-align:right;"> J </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Fair </td> <td style="text-align:right;"> 0.02 </td> <td style="text-align:right;"> 0.02 </td> <td style="text-align:right;"> 0.03 </td> <td style="text-align:right;"> 0.03 </td> <td style="text-align:right;"> 0.04 </td> <td style="text-align:right;"> 0.03 </td> <td style="text-align:right;"> 0.04 </td> </tr> <tr> <td style="text-align:left;"> Good </td> <td style="text-align:right;"> 0.10 </td> <td style="text-align:right;"> 0.10 </td> <td style="text-align:right;"> 0.10 </td> <td style="text-align:right;"> 0.08 </td> <td style="text-align:right;"> 0.08 </td> <td style="text-align:right;"> 0.10 </td> <td style="text-align:right;"> 0.11 </td> </tr> <tr> <td style="text-align:left;"> Very Good </td> <td style="text-align:right;"> 0.22 </td> <td style="text-align:right;"> 0.24 </td> <td style="text-align:right;"> 0.23 </td> <td style="text-align:right;"> 0.20 </td> <td style="text-align:right;"> 0.22 </td> <td style="text-align:right;"> 0.22 </td> <td style="text-align:right;"> 0.24 </td> </tr> <tr> <td style="text-align:left;"> Premium </td> <td style="text-align:right;"> 0.24 </td> <td style="text-align:right;"> 0.24 </td> <td style="text-align:right;"> 0.24 </td> <td style="text-align:right;"> 0.26 </td> <td style="text-align:right;"> 0.28 </td> <td style="text-align:right;"> 0.26 </td> <td style="text-align:right;"> 0.29 </td> </tr> <tr> <td style="text-align:left;"> Ideal </td> <td style="text-align:right;"> 0.42 </td> <td style="text-align:right;"> 0.40 </td> <td style="text-align:right;"> 0.40 </td> <td style="text-align:right;"> 0.43 </td> <td style="text-align:right;"> 0.38 </td> <td style="text-align:right;"> 0.39 </td> <td style="text-align:right;"> 0.32 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> Sum </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 1.00 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 1.00 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 1.00 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 1.00 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 1.00 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 1.00 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 1.00 </td> </tr> </tbody> </table> --- ## Row and Column percentages .pull-left[ <table class="table table-striped" style="font-size: 15px; width: auto !important; margin-left: auto; margin-right: auto;"> <caption style="font-size: initial !important;">Row %</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> D </th> <th style="text-align:right;"> E </th> <th style="text-align:right;"> F </th> <th style="text-align:right;"> G </th> <th style="text-align:right;"> H </th> <th style="text-align:right;"> I </th> <th style="text-align:right;"> J </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Fair </td> <td style="text-align:right;"> 10.12 </td> <td style="text-align:right;"> 13.91 </td> <td style="text-align:right;"> 19.38 </td> <td style="text-align:right;"> 19.50 </td> <td style="text-align:right;"> 18.82 </td> <td style="text-align:right;"> 10.87 </td> <td style="text-align:right;"> 7.39 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 100 </td> </tr> <tr> <td style="text-align:left;"> Good </td> <td style="text-align:right;"> 13.49 </td> <td style="text-align:right;"> 19.02 </td> <td style="text-align:right;"> 18.53 </td> <td style="text-align:right;"> 17.75 </td> <td style="text-align:right;"> 14.31 </td> <td style="text-align:right;"> 10.64 </td> <td style="text-align:right;"> 6.26 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 100 </td> </tr> <tr> <td style="text-align:left;"> Very Good </td> <td style="text-align:right;"> 12.52 </td> <td style="text-align:right;"> 19.86 </td> <td style="text-align:right;"> 17.91 </td> <td style="text-align:right;"> 19.03 </td> <td style="text-align:right;"> 15.10 </td> <td style="text-align:right;"> 9.97 </td> <td style="text-align:right;"> 5.61 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 100 </td> </tr> <tr> <td style="text-align:left;"> Premium </td> <td style="text-align:right;"> 11.62 </td> <td style="text-align:right;"> 16.95 </td> <td style="text-align:right;"> 16.90 </td> <td style="text-align:right;"> 21.20 </td> <td style="text-align:right;"> 17.11 </td> <td style="text-align:right;"> 10.35 </td> <td style="text-align:right;"> 5.86 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 100 </td> </tr> <tr> <td style="text-align:left;"> Ideal </td> <td style="text-align:right;"> 13.15 </td> <td style="text-align:right;"> 18.11 </td> <td style="text-align:right;"> 17.75 </td> <td style="text-align:right;"> 22.66 </td> <td style="text-align:right;"> 14.45 </td> <td style="text-align:right;"> 9.71 </td> <td style="text-align:right;"> 4.16 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 100 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-striped" style="font-size: 15px; width: auto !important; margin-left: auto; margin-right: auto;"> <caption style="font-size: initial !important;">Column %</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> D </th> <th style="text-align:right;"> E </th> <th style="text-align:right;"> F </th> <th style="text-align:right;"> G </th> <th style="text-align:right;"> H </th> <th style="text-align:right;"> I </th> <th style="text-align:right;"> J </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Fair </td> <td style="text-align:right;"> 2.41 </td> <td style="text-align:right;"> 2.29 </td> <td style="text-align:right;"> 3.27 </td> <td style="text-align:right;"> 2.78 </td> <td style="text-align:right;"> 3.65 </td> <td style="text-align:right;"> 3.23 </td> <td style="text-align:right;"> 4.24 </td> </tr> <tr> <td style="text-align:left;"> Good </td> <td style="text-align:right;"> 9.77 </td> <td style="text-align:right;"> 9.52 </td> <td style="text-align:right;"> 9.53 </td> <td style="text-align:right;"> 7.71 </td> <td style="text-align:right;"> 8.45 </td> <td style="text-align:right;"> 9.63 </td> <td style="text-align:right;"> 10.93 </td> </tr> <tr> <td style="text-align:left;"> Very Good </td> <td style="text-align:right;"> 22.33 </td> <td style="text-align:right;"> 24.50 </td> <td style="text-align:right;"> 22.68 </td> <td style="text-align:right;"> 20.36 </td> <td style="text-align:right;"> 21.97 </td> <td style="text-align:right;"> 22.21 </td> <td style="text-align:right;"> 24.15 </td> </tr> <tr> <td style="text-align:left;"> Premium </td> <td style="text-align:right;"> 23.66 </td> <td style="text-align:right;"> 23.85 </td> <td style="text-align:right;"> 24.43 </td> <td style="text-align:right;"> 25.89 </td> <td style="text-align:right;"> 28.42 </td> <td style="text-align:right;"> 26.34 </td> <td style="text-align:right;"> 28.77 </td> </tr> <tr> <td style="text-align:left;"> Ideal </td> <td style="text-align:right;"> 41.83 </td> <td style="text-align:right;"> 39.84 </td> <td style="text-align:right;"> 40.10 </td> <td style="text-align:right;"> 43.25 </td> <td style="text-align:right;"> 37.51 </td> <td style="text-align:right;"> 38.60 </td> <td style="text-align:right;"> 31.91 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> Sum </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 100.00 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 100.00 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 100.00 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 100.00 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 100.00 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 100.00 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 100.00 </td> </tr> </tbody> </table> ] --- class: inverse, center, top ## .heat[.fancy[ Visualizing data with graphs ]] --- ## Visualizing two categorical variables <img src="module01_files/figure-html/unnamed-chunk-19-1.svg" title="A stacked bar-chart that shows two variables in a common vertical bar" alt="A stacked bar-chart that shows two variables in a common vertical bar" width="60%" style="display: block; margin: auto;" /> --- <img src="module01_files/figure-html/unnamed-chunk-20-1.svg" title="A dodged bar-chart that shows two variables by creating multiple bars of one variable for the same value of the second variable." alt="A dodged bar-chart that shows two variables by creating multiple bars of one variable for the same value of the second variable." width="60%" style="display: block; margin: auto;" /> --- ## Adjusting for unequal group sizes <img src="module01_files/figure-html/unnamed-chunk-21-1.svg" title="The same chart as before but replacing the number of diamonds with percentages." alt="The same chart as before but replacing the number of diamonds with percentages." width="60%" style="display: block; margin: auto;" /> --- ### a more readable version ... <img src="module01_files/figure-html/unnamed-chunk-22-1.svg" title="And this time by dodging the bars as percentages." alt="And this time by dodging the bars as percentages." width="60%" style="display: block; margin: auto;" /> --- name: yourturn3 template: yourturn .right-column[.medium[ (1) Create a cross-tabulation and stacked bar-chart of drive by `fuelType1` (2) What drive seems most popular for diesel cars? What percent is this? (3) What drive is most common for electric cars? What percent is this? (4) What percent of the cars in the dataset are electric cars? ]]
−
+
10
:
00
--- ## Scatterplots Useful for exploring how two quantitative variables are related <img src="module01_files/figure-html/unnamed-chunk-24-1.svg" width="50%" style="display: block; margin: auto;" /> --- name: yourturn4 template: yourturn .right-column[.medium[ (1) Construct a scatterplot of `highway08` versus `city08` (2) What patterns are evident here? (3) Do these patterns differ by drive? How so? (4) Do these patterns differ by `fuelType1`? How so? .]]
−
+
10
:
00
--- ## Line-plots Useful for visualizing changes over time <img src="module01_files/figure-html/unnamed-chunk-26-1.svg" width="50%" style="display: block; margin: auto;" /> --- name: yourturn5 template: yourturn .right-column[.medium[ (1) Use the `Great Lakes` dataset and draw line-plots that show the water levels for each of the Great lakes. (2) Use the `opioid overdose` data (overdoses per 10,000 persons) and visualize each county's trend over time. * What counties seem to be experiencing the worst of the crisis? * Is there a particular year that seems to be a pivotal year. ]]
−
+
10
:
00
--- ## Recap of Key Points so far * Tabular and graphical descriptions of data are very useful * With qualitative variables (i.e., Nominal/Ordinal) use bar charts and frequency tables * With quantitative variables (i.e., Interval or Ratio -- aka scale) use histograms, scatterplots, trend-lines, and grouped frequency distributions * Cross-tabulations are useful with two Nominal/Ordinal variables, and so are stacked bar-charts * Symmetric distributions are easier to work with than are skewed distributions --- class: inverse, center, middle ## .large[.fancy[.heat[ Central Tendency ]]] ### All about the average, typical value, most likely value ### A statistical measure that defines the center of a distribution and is most representative of the values that comprise the distribution of the variable of interest --- ## (a) The Mean The mean is commonly known as the `arithmetic average`, and is computed by adding up the scores in the distribution and dividing this sum by the sample size `Sample Mean` is denoted by `\(\bar{x}\)` where `\(\bar{x} = \dfrac{\Sigma{x_{i}}}{n}\)` ... add all values of `\(x\)` and divide by the total number of observations with a non-missing value of `\(x\)` `Population Mean` is denoted by `\(\mu\)` where `\(\mu = \dfrac{\Sigma{x_{i}}}{N}\)` ... add all values of `\(x\)` and divide by the total number of observations with a non-missing value of `\(x\)` --- .pull-left[ <table class="table table-striped" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <caption style="font-size: initial !important;">Starting Salaries</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Bi-weekly Salary </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 2710 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 2755 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 2850 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 2880 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:right;"> 2880 </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:right;"> 2890 </td> </tr> <tr> <td style="text-align:left;"> 7 </td> <td style="text-align:right;"> 2920 </td> </tr> <tr> <td style="text-align:left;"> 8 </td> <td style="text-align:right;"> 2940 </td> </tr> <tr> <td style="text-align:left;"> 9 </td> <td style="text-align:right;"> 2950 </td> </tr> <tr> <td style="text-align:left;"> 10 </td> <td style="text-align:right;"> 3050 </td> </tr> <tr> <td style="text-align:left;"> 11 </td> <td style="text-align:right;"> 3130 </td> </tr> <tr> <td style="text-align:left;"> 12 </td> <td style="text-align:right;"> 3325 </td> </tr> </tbody> </table> ] .pull-right[ `\(\bar{x} = \dfrac{\Sigma{x_{i}}}{n}\)` `\(= \dfrac{x_{1} + x_{2} + \cdots + x_{12}}{n}\)` `\(= \dfrac{2,850 + 2,950 + \cdots + 2,880}{12}\)` `\(= \dfrac{35,280}{12}\)` `\(\bar{x} = 2,940\)` ] --- ## Properties of the mean .pull-left[ <table class="table table-striped" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <caption style="font-size: initial !important;">Properties of the Arithmetic Mean</caption> <thead> <tr> <th style="text-align:left;"> Observation </th> <th style="text-align:right;"> x </th> <th style="text-align:right;"> (x - 2) </th> <th style="text-align:right;"> (2 * x) </th> <th style="text-align:right;"> (x / 2) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 3.0 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 1.5 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 2.5 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 1.5 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 2.0 </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 2.5 </td> </tr> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:right;"> 26 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 52 </td> <td style="text-align:right;"> 13.0 </td> </tr> </tbody> </table> ] .pull-right[ * adding/subtracting a constant from each value and recalculating the mean is the same as adding/subtracting the same constant from the original mean * multiplying/dividing each value by a constant and recalculating the mean is akin to multiplying/dividing the original mean by the same constant * constant used in the table is `2` ] --- ## (b) The Median `median:` the middle-value that occurs when the data are arranged in an ascending or descending order, and is commonly denoted by the symbol `\(Md\)` (1) Arrange the values of `\(x\)` either in ascending order or in descending order. (2) If the number of data points in the population or sample is an `odd` number, the median observation can be identified as `\(Md = \dfrac{N + 1}{2}\)` for a population and `\(Md = \dfrac{n + 1}{2}\)` for a sample. (3) If the number of data points in the population or sample is an`even` number, the median observation can be identified as the average of the middle two values of `\(x\)`. --- ## Odd-numbered data-set .pull-left[ <table class="table table-striped" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <caption style="font-size: initial !important;">n or N is odd</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Bi-weekly Salary </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 2710 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 2755 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 2850 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 2880 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:right;"> 2890 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 6 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 2920 </td> </tr> <tr> <td style="text-align:left;"> 7 </td> <td style="text-align:right;"> 2940 </td> </tr> <tr> <td style="text-align:left;"> 8 </td> <td style="text-align:right;"> 2950 </td> </tr> <tr> <td style="text-align:left;"> 9 </td> <td style="text-align:right;"> 3050 </td> </tr> <tr> <td style="text-align:left;"> 10 </td> <td style="text-align:right;"> 3130 </td> </tr> <tr> <td style="text-align:left;"> 11 </td> <td style="text-align:right;"> 3325 </td> </tr> </tbody> </table> ] .pull-right[ * middle position is the `\(6^{th}\)` one; 5 are below it and 5 are above it * so Median salary is `2920` ] --- ## Even-numbered data-set .pull-left[ <table class="table table-striped" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <caption style="font-size: initial !important;">n or N is even</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Bi-weekly Salary </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 2710 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 2755 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 2850 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 2880 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:right;"> 2880 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 6 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 2890 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 7 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 2920 </td> </tr> <tr> <td style="text-align:left;"> 8 </td> <td style="text-align:right;"> 2940 </td> </tr> <tr> <td style="text-align:left;"> 9 </td> <td style="text-align:right;"> 2950 </td> </tr> <tr> <td style="text-align:left;"> 10 </td> <td style="text-align:right;"> 3050 </td> </tr> <tr> <td style="text-align:left;"> 11 </td> <td style="text-align:right;"> 3130 </td> </tr> <tr> <td style="text-align:left;"> 12 </td> <td style="text-align:right;"> 3325 </td> </tr> </tbody> </table> ] .pull-right[ * middle value will be between `\(6{th}\)` and `\(7{th}\)` * `\(\dfrac{2890 + 2920}{2} = 2905\)` ] --- ## Quartiles `Quartiles` divide the data into four equal parts and are denoted as `\(Q_{1}, Q_{2}, Q_{3}\)` .pull-left[ `\(Q_{1}\)` is the first quartile or the `\(25^{th} percentile\)` `\(Q_{2}\)` is the second quartile or the `\(50^{th} percentile\)`, i.e., the Median `\(Q_{3}\)` is the third quartile or the `\(75^{th} percentile\)` `\(Q_{1}\)`, `\(i = \left(\frac{p}{100}\right) \times n = \left(\frac{25}{100}\right) \times 11 \approx 3\)` For `\(Q_{3}\)`, `\(i = \left(\frac{p}{100}\right) \times n = \left(\frac{75}{100}\right) \times 11 \approx 9\)` Therefore, `\(Q_{1} = 3^{rd} = 2850\)` ... and `\(Q_{3} = 9^{th} = 3050\)` ] .pull-right[ <table class="table table-striped" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <caption style="font-size: initial !important;">n or N is odd</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Bi-weekly Salary </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 2710 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 2755 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 3 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 2850 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 2880 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:right;"> 2890 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 6 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 2920 </td> </tr> <tr> <td style="text-align:left;"> 7 </td> <td style="text-align:right;"> 2940 </td> </tr> <tr> <td style="text-align:left;"> 8 </td> <td style="text-align:right;"> 2950 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 9 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 3050 </td> </tr> <tr> <td style="text-align:left;"> 10 </td> <td style="text-align:right;"> 3130 </td> </tr> <tr> <td style="text-align:left;"> 11 </td> <td style="text-align:right;"> 3325 </td> </tr> </tbody> </table> ] --- ## (c) The Mode `Mode` used with categorical variables since it taps the value/attribute that occurs most often in the data-set Mode of little practical use with quantitative variables .pull-left[ <table class="table table-striped" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <caption style="font-size: initial !important;">Modal Frequency of Continents</caption> <thead> <tr> <th style="text-align:left;"> Continent </th> <th style="text-align:right;"> Frequency </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> Africa </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 52 </td> </tr> <tr> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 25 </td> </tr> <tr> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 33 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 30 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:right;"> 2 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-striped" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <caption style="font-size: initial !important;">Modal Frequency of Diamond Cuts</caption> <thead> <tr> <th style="text-align:left;"> Cut </th> <th style="text-align:right;"> Frequency </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Fair </td> <td style="text-align:right;"> 1610 </td> </tr> <tr> <td style="text-align:left;"> Good </td> <td style="text-align:right;"> 4906 </td> </tr> <tr> <td style="text-align:left;"> Very Good </td> <td style="text-align:right;"> 12082 </td> </tr> <tr> <td style="text-align:left;"> Premium </td> <td style="text-align:right;"> 13791 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> Ideal </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 21551 </td> </tr> </tbody> </table> ] --- ## Choosing a Measure of Central Tendency * Mode is the `only measure to be used with categorical variables` * Mean usually the `default measure for quantitative variables` because ... * it is used in almost all statistical tests and models * is intuitive for most folks * Median to be preferred, at least in descriptive statistics, when the data * are `skewed` (because the mean will be distorted by extreme values) * have `open-ended responses` (typical survey-based income groups, for example) * have `undetermined values` at one end of the distribution (time on task, for example) --- name: yourturn6 template: yourturn .right-column[.medium[ Calculate the following statistics for `city08`, `highway08`, and `youSaveSpend` 1. Mean 2. Median What measure -- Mean or Median -- would you choose for `youSaveSpend` and why? Be sure to justify your decision. What measure would be an ideal descriptor of `the average` if the variable in question were `drive` or `cylinders`? Why? ]]
−
+
10
:
00
--- class: inverse, center, middle # .large[.fancy[ Dispersion/Variability ]] --- ## Why focus on variability? Knowing the "typical" value is a good start because you can guess/predict what it will be But knowing how much things vary around the typical allows you to determine how reliable this guess/prediction will be <img src="module01_files/figure-html/unnamed-chunk-33-1.svg" width="55%" style="display: block; margin: auto;" /> --- ## Range and Interquartile Range (IQR) `Range` is the simplest measure of variability, and computed as `\(Range = x_{max} - x_{min}\)` `\(\ldots\)` small range implies little variabiity `Interquartile Range` is the difference between the third and the first quartiles; `\(IQR = Q_{3} - Q{1}\)` - `\(\ldots\)` smaller IQR implies scores do not vary much in the middle 50% of the distribution - `\(\ldots\)` larger IQR implies scores vary a lot in the middle 50% of the distribution `\(Q_3 = 3050; Q_1 = 2850\)` and hence `\(IQR = Q_3 - Q_1 = 3050 - 2850 = 200\)` Range, for same data-set, is `\(x_{max} - x_{min} = 3325 - 2710 = 615\)` --- ## Variance and Standard Deviation `The variance` is a measure of variability constructed using all values in a distribution, and its square root is the ` standard deviation` `Population Variance`: `\(\sigma^{2} = \dfrac{\sum(x_{i} - \mu)^{2}}{N}\)` `Population Standard Deviation`: `\(\sigma = \sqrt{\sigma^{2}} = \sqrt{\dfrac{\sum(x_{i} - \mu)^{2}}{N}}\)` `Sample Variance`: `\(s^{2} = \dfrac{\sum(x_{i} - \bar{x})^{2}}{(n-1)}\)` `Sample Standard Deviation`: `\(s = \sqrt{s^{2}} = \sqrt{\dfrac{\sum(x_{i} - \bar{x})^{2}}{(n-1)}}\)` --- <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <caption style="font-size: initial !important;">Distance from the Mean</caption> <thead> <tr> <th style="text-align:left;"> x </th> <th style="text-align:right;"> Mean of x </th> <th style="text-align:right;"> Mean of (x - x) </th> <th style="text-align:right;"> Squared-difference </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> -9 </td> <td style="text-align:right;"> 81 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> -7 </td> <td style="text-align:right;"> 49 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> -6 </td> <td style="text-align:right;"> 36 </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> -4 </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:left;"> 22 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 144 </td> </tr> <tr> <td style="text-align:left;"> 24 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 196 </td> </tr> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:right;"> 60 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 522 </td> </tr> </tbody> </table> * Average distance would be `\(\dfrac{522}{6} = 87\)` but this is average distance in `squared units` * Take the `square root` to get average distance in meaningful units of the original metric ... - `\(\sqrt{\dfrac{522}{6}} = \sqrt{87} = 9.32379\)` - and thus average variability is `\(9.32379\)` --- ## Why use `n - 1` for samples? <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> x </th> <th style="text-align:right;"> (x - Popn Mean) </th> <th style="text-align:right;"> (x - Popn Mean) Squared </th> <th style="text-align:right;"> (x - Sample Mean) </th> <th style="text-align:right;"> (x - Sample Mean) Squared </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 8 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 25 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> -2 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 35 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 8 </td> </tr> </tbody> </table> Variance when I am given `\(\mu=3 = \dfrac{35}{3} = 11.66667\)` Variance using the sample mean `\(\bar{x} = 6\)` without adjusting the denominator gives `\(\dfrac{8}{3} = 2.666667\)` If I make the adjustment we are asked to make: `\(\dfrac{8}{3-1} = \dfrac{8}{2}=4\)` --- ### Variance and Standard Deviation for the Salary Data .pull-left[ <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <caption style="font-size: initial !important;">Variance and Standard Deviation for Salary Data</caption> <thead> <tr> <th style="text-align:left;"> Salary </th> <th style="text-align:right;"> (Salary - Mean) </th> <th style="text-align:right;"> (Salary - Mean) Squared </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 2710 </td> <td style="text-align:right;"> -230 </td> <td style="text-align:right;"> 52900 </td> </tr> <tr> <td style="text-align:left;"> 2755 </td> <td style="text-align:right;"> -185 </td> <td style="text-align:right;"> 34225 </td> </tr> <tr> <td style="text-align:left;"> 2850 </td> <td style="text-align:right;"> -90 </td> <td style="text-align:right;"> 8100 </td> </tr> <tr> <td style="text-align:left;"> 2880 </td> <td style="text-align:right;"> -60 </td> <td style="text-align:right;"> 3600 </td> </tr> <tr> <td style="text-align:left;"> 2880 </td> <td style="text-align:right;"> -60 </td> <td style="text-align:right;"> 3600 </td> </tr> <tr> <td style="text-align:left;"> 2890 </td> <td style="text-align:right;"> -50 </td> <td style="text-align:right;"> 2500 </td> </tr> <tr> <td style="text-align:left;"> 2920 </td> <td style="text-align:right;"> -20 </td> <td style="text-align:right;"> 400 </td> </tr> <tr> <td style="text-align:left;"> 2940 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> 2950 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 100 </td> </tr> <tr> <td style="text-align:left;"> 3050 </td> <td style="text-align:right;"> 110 </td> <td style="text-align:right;"> 12100 </td> </tr> <tr> <td style="text-align:left;"> 3130 </td> <td style="text-align:right;"> 190 </td> <td style="text-align:right;"> 36100 </td> </tr> <tr> <td style="text-align:left;"> 3325 </td> <td style="text-align:right;"> 385 </td> <td style="text-align:right;"> 148225 </td> </tr> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 301850 </td> </tr> </tbody> </table> ] .pull-right[ `\(\text{Sample mean: }\bar{x}=2940\)` `\(\Sigma(x_{i} - \bar{x}) = 0\)` `\(\Sigma(x_{i} - \bar{x})^{2} = 301850\)` Sample Variance: `\(s^{2} = \dfrac{301850}{(12 - 1)} = 27440.91\)` Sample Standard Deviation: `\(s = \sqrt{27440.91} = 165.63\)` ] --- ## The five-number summary The `five-number summary` comprises the 1. Minimum 2. `\(Q_1\)` 3. Median 4. `\(Q_3\)` 5. Maximum * The distances between consecutive values tells us something about the center and the shape of the distribution * Any data point that is `\(\pm 1.5 \times IQR\)` is considered an `outlier` and will show up as the dots * It is easy to figure out what direction a distribution is skewed in if these outliers only show up on one side of the --- ## Box-plots <img src="module01_files/figure-html/unnamed-chunk-34-1.svg" title="A box-plot that shows for each cut of diamond, the distribution of price, revealing that price is positively skewed for all diamond cuts, and each cut has plenty of outliers." alt="A box-plot that shows for each cut of diamond, the distribution of price, revealing that price is positively skewed for all diamond cuts, and each cut has plenty of outliers." width="60%" style="display: block; margin: auto;" /> --- <img src="module01_files/figure-html/unnamed-chunk-35-1.svg" title="Boxplots of life-expectancy by continent, showing varying skewness and presence of outliers." alt="Boxplots of life-expectancy by continent, showing varying skewness and presence of outliers." width="70%" style="display: block; margin: auto;" /> --- name: yourturn7 template: yourturn .right-column[.medium[ Calculate, for `city08`, `highway08`, and `youSaveSpend`, the following: .pull-left[ 1. Variance 2. Standard Deviation 3. Five-number summary ] .pull-right[ 4. Range 5. Interquartile Range ] Construct boxplots of `youSaveSpend` overall, and then for each `drive` * How does the overall distribution look: skewed or not skewed? * If skewed, in what direction? * What about when you break the overall distribution down into separate box-plots for each `drive`? ]]
−
+
10
:
00