19 Sampling Distribution of PRE

Here, we explore the concept of the proportional reduction in error, or PRE. This activity will help to illustrate several ideas simultaneously, including simulation, sampling distributions, and (most importantly) the basic foundation of statistical inference. Not too bad for a few lines of R.

Relevant Background

This chapter uses a framework and notation taken from Judd, McClelland, and Ryan (2017). Please refer to that work for additional context, explanation, and rationale.

Assume that we have data set and that we have calculated the mean of the data. Let’s imagine that our question is whether this mean is equal to some specific value. For example, we might wish to know whether the mean of a set of IQs is equal to 100 or whether the mean of several samples of boiling water is 212 degrees Fahrenheit. Formalizing our hypothesis, we propose that \(\beta_0 = B_0\), where \(B_0\) is the hypothesized value of the mean. The mean of our particular sample of data, \(b_0=\bar{Y}\), will almost never be exactly equal to the hypothesized value, \(B_0\). So we need a statistical approach to decide whether to accept our hypothesis. To do so, we construct two models, one that embodies our hypothesis (the “compact” model) and a second that embodies the idea that the true mean of the population is some unknown value (the “augmented” model). This latter model estimates \(\beta_0\) as \(b_0\) and is thus much more flexible than the compact model.

Model C (compact): \(Y_i=50+\epsilon_i\)
Model A (augmented): \(Y_i=\bar{Y}+\epsilon_i\)
\(\epsilon_i \sim \mathcal{N}(0,\sigma^2)\).

When we calculate the sum of square errors (SSE), we will find that the SSE associated with the augmented model is no smaller than the SSE associated with the compact model. Thus, the real question becomes whether the SSE associated with the augmented model improves on the compact model enough for us to favor the augmented model over the compact model. This is a statistical question. Indeed, this is arguably the statistical question and is certainly the essence of statistical testing

To make this exercise concrete, let’s generate a sample in which the null hypothesis is true. That is, \(\beta_0=50\).

# make example reproducible
set.seed(1)

# set the sample size
sample_size = 20

# draw a sample from the population distribution
df.sample = tibble(
    # enumerate the observations in the sample
    observation = 1:sample_size,
    # randomly sample the observed values
    value = rnorm(sample_size, mean = 50, sd = 5)
)
df.sample

# A tibble: 20 × 2
   observation value
         <int> <dbl>
 1           1  46.9
 2           2  50.9
 3           3  45.8
 4           4  58.0
 5           5  51.6
 6           6  45.9
 7           7  52.4
 8           8  53.7
 9           9  52.9
10          10  48.5
11          11  57.6
12          12  51.9
13          13  46.9
14          14  38.9
15          15  55.6
16          16  49.8
17          17  49.9
18          18  54.7
19          19  54.1
20          20  53.0

Now let’s calculate the squared error for each observation and model and then the PRE based on those squared errors (note that we call these individual squared differences “SSE”s, but we haven’t summed anything yet).

df.sample <- df.sample %>%
    # these are the "predictions" of each model
    mutate(
        compact = 50,
        augmented = mean(value),
        # SSE associated with the compact model
        sse_compact = (value - compact) ^ 2,
        # SSE associated with the augmented model
        sse_augmented = (value - augmented) ^ 2,
        # sum of squares reduced (SSR)
        ssr = sse_compact - sse_augmented,
)
df.sample

# A tibble: 20 × 7
   observation value compact augmented sse_compact sse_augmented     ssr
         <int> <dbl>   <dbl>     <dbl>       <dbl>         <dbl>   <dbl>
 1           1  46.9      50      51.0     9.81         16.7      -6.88 
 2           2  50.9      50      51.0     0.843         0.00118   0.842
 3           3  45.8      50      51.0    17.5          26.3      -8.87 
 4           4  58.0      50      51.0    63.6          49.3      14.3  
 5           5  51.6      50      51.0     2.71          0.483     2.23 
 6           6  45.9      50      51.0    16.8          25.6      -8.72 
 7           7  52.4      50      51.0     5.94          2.20      3.74 
 8           8  53.7      50      51.0    13.6           7.50      6.13 
 9           9  52.9      50      51.0     8.29          3.71      4.58 
10          10  48.5      50      51.0     2.33          6.15     -3.82 
11          11  57.6      50      51.0    57.1          43.6      13.5  
12          12  51.9      50      51.0     3.80          0.993     2.81 
13          13  46.9      50      51.0     9.65         16.5      -6.83 
14          14  38.9      50      51.0   123.          145.      -22.0  
15          15  55.6      50      51.0    31.6          21.8       9.81 
16          16  49.8      50      51.0     0.0505        1.39     -1.34 
17          17  49.9      50      51.0     0.00655       1.07     -1.06 
18          18  54.7      50      51.0    22.3          14.2       8.08 
19          19  54.1      50      51.0    16.9           9.94      6.92 
20          20  53.0      50      51.0     8.82          4.07      4.75

Now let’s calculate the means across observations and calculate PRE.

df.summary <- df.sample %>%
    summarise(across(everything(), sum)) %>%
    mutate(pre = ssr / sse_compact)
df.summary

# A tibble: 1 × 8
  observation value compact augmented sse_compact sse_augmented   ssr    pre
        <int> <dbl>   <dbl>     <dbl>       <dbl>         <dbl> <dbl>  <dbl>
1         210 1019.    1000     1019.        414.          396.  18.1 0.0438

So the the compact model (which represents the null hypothesis) predicts that each observation should be 50 whereas the augmented model (in which \(\beta_0\) is estimated using data) predicts that each observation should be 50.953.

As a result, PRE is 0.044. Is this sufficient to convince us that the augmented model is preferable to the compact model? It’s not clear. Because \(SSE(A)\) (i.e., sse_augmented) will never be smaller than \(SSE(C)\) (i.e., sse_compact), we cannot, for example, compare PRE to zero. Even when the compact model is correct, \(b_0=\bar{Y}\) will always be slightly less than or greater than the null hypothesis’ proposed value of 50. So when the null hypothesis is true, we cannot expect PRE to be zero. What should we expect it to be then?

One way to answer this question is to construct a sampling distribution of PRE under the null hypothesis. We can do this by using the steps we conducted above (in which the null hypothesis was true), but performing them over and over again. Let’s do that.

n_samples = 100000
# true mean of the distribution
mu = 50
# true standard deviation of the errors
sigma = 5

# function to draw samples from the population distribution
fun.draw_sample = function(sample_size, mu, sigma) {
    sample = mu + rnorm(sample_size,
                        mean = 0,
                        sd = sigma)
    return(sample)
}

# draw samples
samples = n_samples %>%
    replicate(fun.draw_sample(sample_size, mu, sigma)) %>%
    # transpose the resulting matrix
    t()

# put samples in data frame and compute PRE
df.samples = samples %>%
    as_tibble(.name_repair = ~ str_c(1:ncol(samples))) %>%
    mutate(sample = 1:n()) %>%
    pivot_longer(cols = -sample,
                 names_to = "index",
                 values_to = "value") %>%
    mutate(compact = mu) %>%
    group_by(sample) %>%
    mutate(augmented = mean(value)) %>%
    summarize(
        # SSE associated with the compact model
        sse_compact = sum((value - compact) ^ 2),
        # SSE associated with the augmented model
        sse_augmented = sum((value - augmented) ^ 2),
        # sum of squares reduced (SSR)
        ssr = sse_compact - sse_augmented,
        # PRE calculation
        pre = ssr / sse_compact
    )

Let’s take a look at a few of the samples to see what all we’re working with:

samples %>%
    as_tibble(.name_repair = ~ str_c(1:ncol(samples))) %>%
    mutate(sample = 1:n()) %>%
    pivot_longer(cols = -sample,
                 names_to = "index",
                 values_to = "value") %>%
    mutate(compact = mu) %>%
    group_by(sample) %>%
    mutate(augmented = mean(value)) %>%
    ungroup() %>%
    mutate(index = as.numeric(index)) %>%
    arrange(sample, index) %>%
    slice_head(n=25) %>%
    kable(digits = 2) %>%
    kable_styling(bootstrap_options = "striped", full_width = F)

sample	index	value	compact	augmented
1	1	54.59	50	49.97
1	2	53.91	50	49.97
1	3	50.37	50	49.97
1	4	40.05	50	49.97
1	5	53.10	50	49.97
1	6	49.72	50	49.97
1	7	49.22	50	49.97
1	8	42.65	50	49.97
1	9	47.61	50	49.97
1	10	52.09	50	49.97
1	11	56.79	50	49.97
1	12	49.49	50	49.97
1	13	51.94	50	49.97
1	14	49.73	50	49.97
1	15	43.11	50	49.97
1	16	47.93	50	49.97
1	17	48.03	50	49.97
1	18	49.70	50	49.97
1	19	55.50	50	49.97
1	20	53.82	50	49.97
2	1	49.18	50	50.69
2	2	48.73	50	50.69
2	3	53.48	50	50.69
2	4	52.78	50	50.69
2	5	46.56	50	50.69

Here, sample indicates which of the 100000 each row is associated with, index indicates which of the 20 observations within that sample each row is associated with, compact represents the value predicted by the compact model and augmented represents the value predicted by the augmented model. Note that the value predicted by the augmented model differs between the first sample and the second, because the augmented model estimates \(\beta_0\) as \(b_0=\bar{Y}\). Finally, value represents the value that was actually observed. \(SSE(C)\) is now easy to calculate as it is simply the squared difference between the value column and the compact column. Similarly, \(SSE(A)\) is the squared difference between the value column and the augmented column. PRE is then \(\frac{SSR=SSR(C)-SSE(A)}{SSE(C)}\).

Now let’s plot the distribution of the PRE values we just generated:

ggplot(data = df.samples,
       mapping = aes(x = pre)) +
    stat_density(geom = "line") +
    labs(x = "Proportional Reduction in Error") +
    lims(x = c(0, .4), y = c(0, 20)) +
    theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())

We can also visualize the cumulative version of this distribution which will come in handy later:

# plot the cumulative sampling distribution for PRE
ggplot(data = df.samples,
       mapping = aes(x = pre)) +
    stat_ecdf(geom = "line") +
    labs(x = "Proportional reduction in error") +
    lims(x = c(0, .5), y = c(0, 1))

We can also look at a tablular version of this same data:

df.samples %>%
    select(pre) %>%
    group_by(ints = cut_width(pre, width = .01, boundary = 0)) %>%
    summarise("prop" = n() / n_samples) %>%
    mutate("cum. prop" = cumsum(prop)) %>%
    ungroup() %>%
    slice_head(n = 40) %>%
    kable(digits = 2) %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE)

ints	prop	cum. prop
[0,0.01]	0.34	0.34
(0.01,0.02]	0.13	0.46
(0.02,0.03]	0.09	0.55
(0.03,0.04]	0.07	0.62
(0.04,0.05]	0.05	0.67
(0.05,0.06]	0.05	0.72
(0.06,0.07]	0.04	0.75
(0.07,0.08]	0.03	0.79
(0.08,0.09]	0.03	0.81
(0.09,0.1]	0.02	0.84
(0.1,0.11]	0.02	0.86
(0.11,0.12]	0.02	0.88
(0.12,0.13]	0.02	0.89
(0.13,0.14]	0.01	0.91
(0.14,0.15]	0.01	0.92
(0.15,0.16]	0.01	0.93
(0.16,0.17]	0.01	0.94
(0.17,0.18]	0.01	0.94
(0.18,0.19]	0.01	0.95
(0.19,0.2]	0.01	0.96
(0.2,0.21]	0.01	0.96
(0.21,0.22]	0.00	0.97
(0.22,0.23]	0.00	0.97
(0.23,0.24]	0.00	0.98
(0.24,0.25]	0.00	0.98
(0.25,0.26]	0.00	0.98
(0.26,0.27]	0.00	0.98
(0.27,0.28]	0.00	0.99
(0.28,0.29]	0.00	0.99
(0.29,0.3]	0.00	0.99
(0.3,0.31]	0.00	0.99
(0.31,0.32]	0.00	0.99
(0.32,0.33]	0.00	0.99
(0.33,0.34]	0.00	0.99
(0.34,0.35]	0.00	1.00
(0.35,0.36]	0.00	1.00
(0.36,0.37]	0.00	1.00
(0.37,0.38]	0.00	1.00
(0.38,0.39]	0.00	1.00
(0.39,0.4]	0.00	1.00

So let’s return to the question of interest. The values of PRE observed in our original sample was 0.044. We wondered if this was sufficiently large to convince us to prefer the augmented model. We now know what values of PRE to expect when the compact model (i.e., the null hypothesis) is true. So we can compare our observed value of PRE to the distribution of PRE values under the null hypothesis to see whether our observed value of PRE is “large enough”. We can do so, but asking how often PRE takes on a value equal to or larger than our observed value of PRE. Mechanically, we can simply count how many of our 100000 samples (all of which were generated by the compact model) were equal to or larger than our observed value of PRE. It turns out, that 36220 of our 100000 samples (or 36.22%) are greater than or equal to 0.044.

df.samples %>%
    summarize(p_value = sum(pre >= df.summary$pre) / n())

# A tibble: 1 × 1
  p_value
    <dbl>
1   0.362

So what do we ultimately conclude about the compact and augmented models?