Common Statistical Methods in Program Evaluation

Ani Ruhil
2020-07-07

In the short survey that some of you participated in when we were setting up this workshop, a number of you said you were interested in learning about how to conduct the usual battery of statistical tests in the course of evaluating a program. This is a quick overview of the necessary commands and, where feasible, tests for assumptions underlying the tests, etc. I use a very specific data-set that was part of a pretty famous experiment – Tennessee’s Project STAR. Basically it was a study of the effects of small class-size on student achievement in kindergarten through 3rd grade. In time students’ test records for grades 4 through 8, and then their high-school data were also gathered. More details of the study and the data files are available here. The data were will use are for grades KG through 3 and include only a subset of the variables originally gathered.

The Correlation

We may be interested in correlations between reading/mathematics scores in successive grade, as well as between reading and mathematics scores within a grade. Let us test some of these on a piecemeal basis. First up: reading scores.

cor.test(STAR$readk, STAR$read1,
         method = "pearson", alternative = "two.sided", conf.level = 0.95)

    Pearson's product-moment correlation

data:  STAR$readk and STAR$read1
t = 47.376, df = 4009, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5788848 0.6185798
sample estimates:
      cor 
0.5991003 
cor.test(STAR$readk, STAR$mathk,
         method = "pearson", alternative = "greater", conf.level = 0.95)

    Pearson's product-moment correlation

data:  STAR$readk and STAR$mathk
t = 77.426, df = 5784, p-value < 2.2e-16
alternative hypothesis: true correlation is greater than 0
95 percent confidence interval:
 0.7026181 1.0000000
sample estimates:
      cor 
0.7134043 

Say you have quite a few of these types of variables and want to check pairwise correlations. What would be a good way to do that? By using the GGally package.

library(GGally)
ggpairs(STAR[, c(8:11)])

Note that I am only using reading scores here to keep the plot manageable. Note also that the plot shows you pairwise correlations with the usual asterisks denoting significance levels. You also see the unconditional kernel density of each score, and pairwise scatter-plots. I like ggpairs(...) for exploratory data analysis because it gives you a lot of good information. If you want something fancier but less detailed like a correlogram, then you can play around with the code below; this might be good for inclusion in the final report to the client.

library(corrplot)
STAR[, c(8:15)] -> mydf # keep only reading and math scores, KG through grade 3
na.omit(mydf) -> mydf # drop any observation with missing data
cor(mydf) -> M

cor.mtest <- function(mat, ...) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat <- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            tmp <- cor.test(mat[, i], mat[, j], ...)
            p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
        }
    }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}
# matrix of the p-value of the correlation
cor.mtest(mydf) -> p.mat 

colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA")) -> col 
corrplot(M, method = "color", col = col(200),  
         type = "upper", order = "hclust", 
         addCoef.col = "black", # Add coefficient of correlation
         tl.col = "black", tl.srt = 45, #Text label color and rotation
         # Combine with significance
         p.mat = p.mat, sig.level = 0.01, insig = "blank", 
         # hide correlation coefficient on the principal diagonal
         diag = FALSE 
         )

The Omnipresent t-tests

You may be looking to run t-tests, perhaps a one-sample test, a two-sample (aka independent samples) test with/without the assumption of equal variances, or even a paired t-test. These are all very straightforward to run in R. I will focus on 3rd grade mathematics scores.

One-sample t-test

Say our null hypothesis is that the mean math3 score is 650, i.e., \(\mu_0 = 650\), and we are running a two-tailed test with \(\alpha = 0.05\).

t.test(STAR$math3, mu = 650, alternative = "two.sided", conf.level = 0.95)

    One Sample t-test

data:  STAR$math3
t = -62.68, df = 6076, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 650
95 percent confidence interval:
 616.9663 618.9699
sample estimates:
mean of x 
 617.9681 

Two-sample t-test

Instead, we may be curious about differences between male and female students. Let us stick with a two-tailed test of no difference, i.e., \(\mu_{male} - \mu_{female} = 0\), and with \(\alpha = 0.05\).

t.test(STAR$math3 ~ STAR$gender, alternative = "two.sided",
       conf.level = 0.95, paired = FALSE, var.equal = FALSE)

    Welch Two Sample t-test

data:  STAR$math3 by STAR$gender
t = -0.43691, df = 6074.7, p-value = 0.6622
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.448014  1.555696
sample estimates:
  mean in group male mean in group female 
            617.7512             618.1974 
t.test(STAR$math3 ~ STAR$gender, alternative = "two.sided",
       conf.level = 0.95, paired = FALSE, var.equal = TRUE)

    Two Sample t-test

data:  STAR$math3 by STAR$gender
t = -0.43632, df = 6075, p-value = 0.6626
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.450706  1.558389
sample estimates:
  mean in group male mean in group female 
            617.7512             618.1974 

Paired t-test

What about a paired test of mathk versus math3?

t.test(STAR$math3, STAR$mathk, alternative = "two.sided",
       conf.level = 0.95, paired = TRUE, var.equal = FALSE)

    Paired t-test

data:  STAR$math3 and STAR$mathk
t = 158.42, df = 2889, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 126.4175 129.5860
sample estimates:
mean of the differences 
               128.0017 

Testing for the Usual Assumptions

If you are approaching these tests with an eye on the underlying assumptions of normality and/or equal variances, R has some easy ways to run these tests as well.

  1. Normality tests with small versus large sample sizes.
STAR %>%
  sample_frac(0.1) -> starsm # drawing a small sample for Shapiro-Wilk

shapiro.test(starsm$read3) # needs smaller sample sizes

    Shapiro-Wilk normality test

data:  starsm$read3
W = 0.99065, p-value = 0.0007556
library(nortest)
ad.test(STAR$read3) # The Anderson-Darling test for large n 

    Anderson-Darling normality test

data:  STAR$read3
A = 4.9323, p-value = 3.37e-12
  1. Quantile-quantile plots
library(car)
qqPlot(STAR$read3)

[1] 3949 4878
  1. Equal variances with Levene’s test
library(car)
leveneTest(STAR$read3 ~ STAR$gender, center = "mean")
Levene's Test for Homogeneity of Variance (center = "mean")
        Df F value Pr(>F)
group    1  1.4111 0.2349
      5998               
leveneTest(STAR$read3 ~ STAR$gender, center = "median")
Levene's Test for Homogeneity of Variance (center = "median")
        Df F value Pr(>F)
group    1  1.4025 0.2364
      5998               

Chi-Square \((\chi^2)\) Tests

These tests are easy to setup and execute. Say we are curious about the independence (or lack thereof) of ethnicity and class-size in kindergarten. To run these you can either create a basic table that you can then feed to the test OR use the original factors as they are in the data-set.

table(STAR$ethnicity, STAR$stark) -> tab1

chisq.test(tab1)
Warning in chisq.test(tab1): Chi-squared approximation may be
incorrect

    Pearson's Chi-squared test

data:  tab1
X-squared = 15.786, df = 10, p-value = 0.1059
chisq.test(STAR$ethnicity, STAR$stark)
Warning in chisq.test(STAR$ethnicity, STAR$stark): Chi-squared
approximation may be incorrect

    Pearson's Chi-squared test

data:  STAR$ethnicity and STAR$stark
X-squared = 15.786, df = 10, p-value = 0.1059

Note the warning because of the thin cells vis-a-vis expected frequencies:

chisq.test(tab1)$exp
          
                regular        small regular+aide
  cauc     1468.0366972 1271.8073394 1494.1559633
  afam      713.5615312  618.1812717  726.2571971
  asian       4.8541601    4.2053148    4.9405252
  hispanic    1.7336286    1.5018981    1.7644733
  amindian    0.6934514    0.6007593    0.7057893
  other       3.1205315    2.7034166    3.1760519

This is a common problem as well and we could invoke the usual solution of collapsing the sparse cells.

STAR %>%
  mutate(new_ethnicity = case_when(
    ethnicity %in% c("asian", "hispanic", "amindian", "other") ~ "other",
    ethnicity == "cauc" ~ "caucasian",
    ethnicity == "afam" ~ "african-american")
    ) -> STAR

chisq.test(STAR$new_ethnicity, STAR$stark)

    Pearson's Chi-squared test

data:  STAR$new_ethnicity and STAR$stark
X-squared = 4.6757, df = 4, p-value = 0.3222
chisq.test(STAR$new_ethnicity, STAR$stark)$exp
                  STAR$stark
STAR$new_ethnicity    regular       small regular+aide
  african-american  713.56153  618.181272    726.25720
  caucasian        1468.03670 1271.807339   1494.15596
  other              10.40177    9.011389     10.58684