Common Statistical Methods in Program Evaluation

Ani Ruhil
2020-07-07

In the short survey that some of you participated in when we were setting up this workshop, a number of you said you were interested in learning about how to conduct the usual battery of statistical tests in the course of evaluating a program. This is a quick overview of the necessary commands and, where feasible, tests for assumptions underlying the tests, etc. I use a very specific data-set that was part of a pretty famous experiment – Tennessee’s Project STAR. Basically it was a study of the effects of small class-size on student achievement in kindergarten through 3rd grade. In time students’ test records for grades 4 through 8, and then their high-school data were also gathered. More details of the study and the data files are available here. The data were will use are for grades KG through 3 and include only a subset of the variables originally gathered.

The Correlation

We may be interested in correlations between reading/mathematics scores in successive grade, as well as between reading and mathematics scores within a grade. Let us test some of these on a piecemeal basis. First up: reading scores.

cor.test(STAR$readk, STAR$read1,
         method = "pearson", alternative = "two.sided", conf.level = 0.95)

    Pearson's product-moment correlation

data:  STAR$readk and STAR$read1
t = 47.376, df = 4009, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5788848 0.6185798
sample estimates:
      cor 
0.5991003 
cor.test(STAR$readk, STAR$mathk,
         method = "pearson", alternative = "greater", conf.level = 0.95)

    Pearson's product-moment correlation

data:  STAR$readk and STAR$mathk
t = 77.426, df = 5784, p-value < 2.2e-16
alternative hypothesis: true correlation is greater than 0
95 percent confidence interval:
 0.7026181 1.0000000
sample estimates:
      cor 
0.7134043 

Say you have quite a few of these types of variables and want to check pairwise correlations. What would be a good way to do that? By using the GGally package.

library(GGally)
ggpairs(STAR[, c(8:11)])

Note that I am only using reading scores here to keep the plot manageable. Note also that the plot shows you pairwise correlations with the usual asterisks denoting significance levels. You also see the unconditional kernel density of each score, and pairwise scatter-plots. I like ggpairs(...) for exploratory data analysis because it gives you a lot of good information. If you want something fancier but less detailed like a correlogram, then you can play around with the code below; this might be good for inclusion in the final report to the client.

library(corrplot)
STAR[, c(8:15)] -> mydf # keep only reading and math scores, KG through grade 3
na.omit(mydf) -> mydf # drop any observation with missing data
cor(mydf) -> M

cor.mtest <- function(mat, ...) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat <- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            tmp <- cor.test(mat[, i], mat[, j], ...)
            p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
        }
    }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}
# matrix of the p-value of the correlation
cor.mtest(mydf) -> p.mat 

colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA")) -> col 
corrplot(M, method = "color", col = col(200),  
         type = "upper", order = "hclust", 
         addCoef.col = "black", # Add coefficient of correlation
         tl.col = "black", tl.srt = 45, #Text label color and rotation
         # Combine with significance
         p.mat = p.mat, sig.level = 0.01, insig = "blank", 
         # hide correlation coefficient on the principal diagonal
         diag = FALSE 
         )

The Omnipresent t-tests

You may be looking to run t-tests, perhaps a one-sample test, a two-sample (aka independent samples) test with/without the assumption of equal variances, or even a paired t-test. These are all very straightforward to run in R. I will focus on 3rd grade mathematics scores.

One-sample t-test

Say our null hypothesis is that the mean math3 score is 650, i.e., \(\mu_0 = 650\), and we are running a two-tailed test with \(\alpha = 0.05\).

t.test(STAR$math3, mu = 650, alternative = "two.sided", conf.level = 0.95)

    One Sample t-test

data:  STAR$math3
t = -62.68, df = 6076, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 650
95 percent confidence interval:
 616.9663 618.9699
sample estimates:
mean of x 
 617.9681 

Two-sample t-test

Instead, we may be curious about differences between male and female students. Let us stick with a two-tailed test of no difference, i.e., \(\mu_{male} - \mu_{female} = 0\), and with \(\alpha = 0.05\).

t.test(STAR$math3 ~ STAR$gender, alternative = "two.sided",
       conf.level = 0.95, paired = FALSE, var.equal = FALSE)

    Welch Two Sample t-test

data:  STAR$math3 by STAR$gender
t = -0.43691, df = 6074.7, p-value = 0.6622
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.448014  1.555696
sample estimates:
  mean in group male mean in group female 
            617.7512             618.1974 
t.test(STAR$math3 ~ STAR$gender, alternative = "two.sided",
       conf.level = 0.95, paired = FALSE, var.equal = TRUE)

    Two Sample t-test

data:  STAR$math3 by STAR$gender
t = -0.43632, df = 6075, p-value = 0.6626
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.450706  1.558389
sample estimates:
  mean in group male mean in group female 
            617.7512             618.1974 

Paired t-test

What about a paired test of mathk versus math3?

t.test(STAR$math3, STAR$mathk, alternative = "two.sided",
       conf.level = 0.95, paired = TRUE, var.equal = FALSE)

    Paired t-test

data:  STAR$math3 and STAR$mathk
t = 158.42, df = 2889, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 126.4175 129.5860
sample estimates:
mean of the differences 
               128.0017 

Testing for the Usual Assumptions

If you are approaching these tests with an eye on the underlying assumptions of normality and/or equal variances, R has some easy ways to run these tests as well.

  1. Normality tests with small versus large sample sizes.
STAR %>%
  sample_frac(0.1) -> starsm # drawing a small sample for Shapiro-Wilk

shapiro.test(starsm$read3) # needs smaller sample sizes

    Shapiro-Wilk normality test

data:  starsm$read3
W = 0.99065, p-value = 0.0007556
library(nortest)
ad.test(STAR$read3) # The Anderson-Darling test for large n 

    Anderson-Darling normality test

data:  STAR$read3
A = 4.9323, p-value = 3.37e-12
  1. Quantile-quantile plots
library(car)
qqPlot(STAR$read3)