Election Forecasting Simplified — Part 1
In this reading, we’ll see how we can use Probability Theory to construct a simple sampling model for opinion polls and find which political party has higher odds of winning.
For simplicity, let’s say there’s two parties, Republican and Democratic. This can be thought of as sampling red and blue beads respectively from an urn which has a total of N beads (sample from a population). We can randomly (so as to avoid bias) sample them like so :
library(dslabs)
library(tidyverse)
take_poll(25) #N=25
We want to predict the proportion of blue beads (p) in the urn which in turn tells us the proportion of red beads (1-p) and the spread (p-(1-p) = 2p-1). The task of statistical inference is to predict the parameter p using observed data in the sample.
What we now do is construct an estimate (is the summary of observed data that we think is informative about the parameter of interest) of p using only the information we observe. Note that sample proportion is a random variable. By describing the distribution of this random variable (RV), we’ll be able to gain insights into how good this estimate is and how we can make it better.
Let’s define a RV X to be 1 if we pick a blue bead at random and 0 if we pick red. This implies that we’re assuming that the population, the beads in the urn, are a list of 0s and 1s. If we sample N beads, then the average would be X̅ = (X1+X2+…+XN)/N
= (Sum of 1s)/N = p
.
We also know that the sum of draws of a RV (NX̅) has distribution that is approximately normal ascribing to Central Limit Theorem (CLT).
Note that the estimate of p calculated a few months before the election might be different than that calculated on the election night as people’s opinions might fluctuate over time. The latter tends to be the most accurate.
1. Expected Value and Standard Error of the Estimate
To understand how good the estimate of p is we describe statistical properties of the RV X̅ , the sample proportion. We know that NX̅ is the sum of independent draws. So, the expected value of NX̅ is E(NX̅) = N*average of the urn = N*p. Dividing by the non-random constant N gives us E(X̅) = p
. We also know that standard error of the sum is SE(NX̅) = √{N*[standard deviation of the values in the urn]} = √{ N*[ (1–0)*√(p*(1-p)) ] } = √{ N*√[p*(1-p)] } . Because we are dividing by N, we get SE(X̅) = √[p*(1-p)/N]
.
The law of large numbers tells us that with a large enough poll, our estimate converges to p. But how large does the poll have to be for the standard error to be as small as 0.01 (in the case of which we can be pretty certain of the winner)? We don’t know p from the SE formula. For illustrative purposes let’s assume p=0.51 and construct a plot of SE against N.
library(ggplot2)
# Set the values of p and N
p <- 0.51
N <- seq(10, 11000, by = 500)
# Calculate the standard error for each value of N
SE <- sqrt(p*(1-p)/N)
# Create a dataframe to store the data
df <- data.frame(N, SE)
# Plot the graph
ggplot(df, aes(x=N, y=SE))
+ geom_line()
+ labs(title="Standard Error vs. Sample Size", x="Sample Size (N)", y="Standard Error (SE)")
We see that we would need a poll of over 10k people to get standard error as low as 0.01 . We rarely see polls of this size, due in part to cost. For a sample size of 1000, if we set p to be 0.51, the standard error is about 1.5%. So even with large polls for close elections, X̅ can lead us astray if we don’t realize it’s a RV.
As per CLT, NX̅ is approximately normal. When dividing a normally distributed RV by a non random constant (here, N), the resulting RV is also normally distributed i.e. X~N(μ,σ) ≡ X/a~N(μ/a,σ/a)
. This implies that the distribution of X̅ is approximately normal. Now, how does this help our case? Suppose we want to know the probability that our estimate is within 1% of p i.e. P(|X̅ — p| ≤ 0.01)
meaning we made a very good estimate. This is basically asking for :
We calculate this probability using the trick of subtracting expected value and dividing by standard error so as to get z.
Substituting E(X̅) = p and SE(X̅) = √[p*(1-p)/N],
We still cannot compute this probability using just the data because we don’t know p. But CLT still works if we use an estimate of the standard error that, instead of p, uses X̅ in its place. This (X̅ =p) is the plug-in estimate.
So now instead of dividing by SE(X̅), we divide by its estimate. For the first sample let’s say we got 12 blue and 13 red beads, in this case X̅=0.48
X_hat <- 0.48
se_est <- sqrt(X_hat*(1-X_hat)/25)
se_est
[1] 0.09991997
pnorm(0.01/se_est) - pnorm(-0.01/se_est)
[1] 0.07971926 #probability that we are within 1% point of p is ~0.08 i.e. 8%
#So there's a very small chance that we'll be as close as 1% point
#to the actual proportion
This wasn’t very useful but with CLT we’ll be able to determine what sample sizes are better (poll of only 25 people is not really useful, at least for a close election). And once we have those larger sample sizes, we’ll be able to provide a very good estimate and some very informative probabilities.
2. Margin of Error
It is 2 times the standard error which in our case would be 2*se_est = 2*0.09991997 ~ 0.2 . Why multiply by 2? It’s because if we ask what is the probability that we’re within 2 standard errors from p, using the same previous equations, we end up with an equation like so:
This simplifies out, and we’re simply asking what is the probability of the standard normal distribution (expected value 0 and standard error 1) within two values from 0, and we know that this is about 95%.
So there’s a 95% chance that X̅ (in our case p) will be within 2 standard errors. But why 95%? Traditionally that’s what is used. It is the most common value to define margin of errors; provides a good balance between precision and accuracy.
If we use a lower confidence interval (CI), such as 90%, then the margin of error will be smaller, but there is a greater chance that the true population mean is outside of the margin of error. If we use a higher CI, such as 99%, then the margin of error will be larger, but there is a smaller chance that the true population mean is outside of the margin of error.
In summary, CLT tells us that our poll based on a sample of just 25 is not very useful. We don’t really learn much when the margin of error is this large. All we can say is that the popular vote will not be won by a large margin. This is why pollsters tend to use larger sample sizes. To see how this gives us a much more practical result, note that if we had obtained X̅= 0.48, but with a sample size of N=2000, the estimated standard error would have been about 0.01 . So our result is an estimate of 48% blue beads with a margin of error of 2%. In this case, the result is much more informative and would make us think that there are more red beads than blue beads.
Let’s corroborate the results of the estimates and margin errors obtained by probability theory with Monte Carlo simulation.
B <- 10000 # number of replicates
N <- 1000 # sample size per replicate
x_hat <- replicate(B, {
x <- sample(c(0,1), size = N, replace = TRUE, prob = c(1-p, p))
mean(x)
})
But, we don’t know p. However, we could run an analog simulation whereby with take_polls(10000) we’d keep track of p. But this is time consuming. So we pick a value of p or several values of p and then run the simulations.
To review, the theory tells us that X̅ has an approximately normal distribution with an expected value of 0.45 and a standard error of about 1.5%
p<-0.45
N<-1000
sqrt(p*(1-p)/N)
[1] 0.01573213
The simulation confirms this:
mean(x_hat)
[1] 0.4498773
sd(x_hat)
[1] 0.01569382
In real life, we would never be able to run such an experiment because we don’t know p. But we could run it for various values of p and sample sizes N and see that the theory does indeed work well for most values.
The actual task is to predict the spread (2p-1) and not p. Once we have our estimate X̅ and SÊ(X̅), we estimate the spread by just plugging in X̅ for p i.e. 2X̅-1. Since we’re multiplying a RV by 2, the standard error goes up by 2 as well. So the standard error of this new RV is 2*SÊ(X̅). Note that subtracting 1 does not add any variability so it does not affect the standard error. So, for our example with just the 25 beads, our estimate of p was X̅ =0.48 with a margin of error of 0.2 . This means that our estimate of the spread is (2*0.48–1=0.96–1=-0.04) 4 percentage points, 0.04, with a margin of error of 40%, 0.4.
Why not run a very large poll? For realistic values of p, say between 0.35 and 0.65 for the popular vote, if we run a large poll with say 100000 people, theoretically we could predict the elections almost perfectly since the largest possible margin of error is about 0.3% (peak of the graph below).
library(tidyverse)
N <- 100000
p <- seq(0.35, 0.65, length = 100)
SE <- sapply(p, function(x) 2*sqrt(x*(1-x)/N))
data.frame(p = p, SE = SE) %>%
ggplot(aes(p, SE)) +
geom_line()
So why are there no pollsters that are conducting polls this large? One reason is that running polls with a sample size of 100000 is very expensive. But perhaps a more important reason is that theory has its limitations. Polling is much more complicated than picking beads from an urn. For example, while the beads are either red or blue and you can see it with your eyes, people, when you ask them, might lie to you. Also, because you’re conducting these polls usually by phone, you might miss people that don’t have phones. And they might vote differently than those that do. But perhaps the most different way an actual poll is from our urn model is that we actually don’t know for sure who is in our population and who is not. How do we know who is going to vote? Are we reaching all possible voters? So, even if our margin of error is very small, it may not be exactly right that our expected value is p. We call this bias. Historically, we observe that polls are, indeed, biased, although not by that much. The typical bias appears to be between 1% and 2%.
3. Confidence Interval (CI)
When pollsters report an estimate and a margin of error, they are usually reporting a 95% CI. Let’s see how this works mathematically. We want to know the probability that the interval [ X̅ — 2SÊ(X̅), X̅ +2SÊ(X̅) ]
contains the actual proportion p. Note that start and end of this interval are random; every time we take a sample, they change. So if we keep sampling with p=0.45 and N=1000 (run a Monte Carlo simulation), we get different intervals each time. To determine the probability that the interval includes p, we need to compute this probability:
So what we have is, what is the probability of a standard normal variable being between -2 and 2? And this is about 95%. So we have a 95% CI. Note that if we want to have a larger probability, say 99%, a 99% CI, we need to multiply by whatever Z satisfies the following equation: P(-z ≤ Z ≤ z) = 0.99
z <- qnorm(0.995)
z
[1] 2.575829
##This works because by definition,
#pnorm(qnorm(0.995))
#[1] 0.995
##And by symmetry,
#pnorm(qnorm(1-0.995))
#[1] 0.005
##So now we compute
pnorm(z)-pnorm(-z)
[1] 0.99 #This is what we wanted
We can use this approach for any percentile q by solving for z=1-(1-q)/2. For example, to find a 95% CI, we solve for
z = qnorm(.975)
which is actually slightly smaller number than 2 [qnorm(0.975)=1.959964~1.96 ]. The upper limit of this 95% CI will beX̅ + qnorm(.975)*SÊ
, which removes the 2.5% highest observations. The lower limit will beX̅ — qnorm(.975)*SÊ
, which removes the 2.5% lowest observations.
We can run a Monte Carlo simulation to confirm that, in fact, a 95% CI includes p 95% of the time.
Pollsters do not become successful for providing correct CIs, but rather for predicting who will win. When we took a sample of size 25, the CI for the spread was [-0.4396799, 0.3596799].
> N <- 25
> X_hat <- 0.48
> (2*X_hat - 1) + c(-2, 2)*2*sqrt(X_hat*(1-X_hat)/N)
[1] -0.4396799 0.3596799
This includes 0. If we were pollsters & were forced to make a declaration about the election, we would have no choice but to say it’s a tossup. A problem with our poll results is that given the sample size and the value of p, we would have to sacrifice on the probability of an incorrect call to create an interval that does not include 0, an interval that makes a call of who’s going to win. The fact that our interval includes 0 does not mean this election is close. It only means that we have a small sample size. In statistics, this is called lack of power. In the context of polls, power can be thought of as the probability of detecting a spread different from 0. By increasing our sample size, we lower our standard error, and therefore have a much better chance of detecting the direction of the spread.
4. p-values
Suppose we want to know, not the proportion but, only if there are more blue beads than red ones. This is asking if 2p-1>0
or p=1/2=0.5
. Suppose we sample 100 beads and get 52 blue beads. This gives a spread of 4% which is >0 aptly pointing to blue beads being more than red. But, we know there is chance involved in this process and can get 52 blue beads even when spread is zero. So, the null hypothesis (the skeptic’s hypothesis) would be 2p-1=0.
We have observed a RV 2X̅-1
to be 4% here. The p-value is the answer to the question, “How likely is it, to see a value this large, when the null hypothesis is true?” This is Pr(|X̅-0.5|>0.02). It’s the same as asking, what’s the chance that the spread is 4 or more? Under the null hypothesis, X̅-0.5/sqrt(0.5*(1–0.5)/N) is standard normal. So we can compute the probability, which is the p-value:
This is ~69% :
> N <- 100 # sample size
> z <- sqrt(N) * 0.02/0.5 # spread of 0.02
> 1 - (pnorm(z) - pnorm(-z))
[1] 0.6891565
In this case, there’s actually a large chance of seeing 52 blue beads or more under the null hypothesis that there is the same amount of blue beads as red beads. So the 52 blue beads are not very strong evidence if we want to make the case that there’s more blue beads than red beads. Note that there’s a close connection between p-values and CIs. If a 95% CI of the spread does not include 0, we can do a little bit of math to see that this implies that the p-value must be smaller than 1-95%, or 0.05. However, in general, we prefer reporting confidence intervals over p-values, since it gives us an idea of the size of the estimate. The p-value simply reports a probability and says nothing about the significance of the finding in the context of the problem.
CASE STUDY: In 2012 presidential election, Barack Obama won the electoral college and the popular vote by a margin of 3.9% . A week before Nate Silver was giving Obama a 90% chance of winning. Yet, none of the individual polls were nearly that sure. We’ll illustrate what Nate Silver saw that other pundits did not, using Monte Carlo simulation; let’s generate results for 12 polls taken a week before election. We’re going to mimic sample sizes from actual polls. We’re going to generate the data using the actual outcome, 3.9%.
Not surprisingly, all 12 polls report CIs that include the election night result (dashed line) because these are 95% CIs. However, all 12 polls also include 0 (solid black line) as well. Therefore, if asked individually for a prediction, the pollsters would have to say: it’s a toss-up. Poll aggregators, such as Nate, realized that by combining the results of different polls, you could greatly improve precision and report a smaller 95% CI.
Although as aggregators we do not have access to the raw poll data, we can use mathematics to reconstruct what we would have obtained had we made one large poll with, in this case, 11269 participants. Basically we construct an estimate of the spread — let’s call it d — with a weighted average in the following way.
Note that this is only a simulation to illustrate the idea, actual forecasting of elections is much more complex and involves statistical modeling.
which is explained in Part 2 of this reading.
Reference: HarvardX