How to use statistics to forecast an election

Considering this topic's importance and interest to a wide range of readers, I am going to aim to make this article accessible to as broad of an audience as possible. As such, whenever there is a need to explain something in depth, I will put it inside a box which can be expanded for further details for readers who are interested, like the one below.

Details Han shot first.
The Head-to-Head

Imagine a big box full of red and blue balls. We don't know how many of the balls are red or blue, or even how many balls are in the box total, we just know that the box is full of red and blue balls. If we want to get an idea of how many red and blue balls are in the box, what we would probably do is start scooping out samples of balls and counting them up.

This is effectively what an election is. In Pennsylvania there is somewhere between 6 and 7 million red and blue balls. When we take a poll, ideally we are effectively just pulling out a few hundred balls and trying to get an idea about what the rest of the 7 million are doing.

Of course, the matter is complicated somewhat because some of the balls we pull are neither red nor blue, but rather a third color, say white. These balls represent third-party voters. Hopefully everyone reading this can all agree that there is no reasonable chance that any state is won by a third party. If that's the case, then the question becomes: "Will Trump or Harris get more of the non-third-party vote?" While this is not strictly necessary as a first step, it does simplify the math quite a bit.

With this in mind, if we pull 1000 balls from the box and we get, for example, 50 white balls, we can freely toss them back in the box, and just think of the poll as having polled 950 of the non-third-party voters. For the more mathematically inclined readers, this step does require some justification, which I'll leave in the box below.

Details Let the number of red, blue, and white balls in the box be $R$, $B$, and $W$ respectively, and let $N=R+B+W$. Suppose we draw $n$ balls at random, finding $n_R$ red balls, $n_B$ blue balls, and $n_W$ white balls. $$ \begin{align*} P(n_R, n_B, n_W) = \dfrac{\binom{R}{n_R}\binom{B}{n_B}\binom{W}{n_W}}{\binom{R+B+W}{n_R+n_B+n_W}} \end{align*} $$ Now, since we are only interested in looking at the distribution of red and blue balls in a sample of $n - n_W$ balls, we find $$ \begin{align*} P(n_R, n_B, n_W \mid n_W) &= \dfrac{P(n_R, n_B, n_W)}{P(n_W)} \\ &= \dfrac{\left[\dfrac{\binom{R}{n_R}\binom{B}{n_B}\binom{W}{n_W}}{\binom{R+B+W}{n_R+n_B+n_W}}\right]}{\left[\dfrac{\binom{W}{n_W}\binom{N-W}{n-n_W}}{\binom{N}{n}}\right]}\\ &= \dfrac{\binom{R}{n_R}\binom{B}{n_B}}{\binom{N-W}{n-n_W}}\\ &= \dfrac{\binom{R}{n_R}\binom{B}{n_B}}{\binom{R+B}{n_R + n_B}} \end{align*} $$ which describes the distribution of the $R+B$ red and blue balls in a sample of $n-n_W = n_R + n_B$ balls, as expected.

With that out of the way, we can simplify the situation down to the head-to-head vote of Harris vs Trump after removing third party votes. I should clarify here that this is not to say that third-party votes should be ignored in the election. They are a very important part of the system. However, for the sake of the math model, it is convenient to not consider them.

Bayesian Statistics

Bayesian statistics are similar but slightly different collection of tools than the typical statistics most people are familiar with. Typically we start with a model and then test it against the data we collect to see if the evidence supports or rejects the model. By contrast, in Bayesian statistics, we begin with an initial belief about the model, and as we collect new evidence, we update that belief to improve our understanding of the model.

We do this by using Bayes' theorem. Although this is a more technical component of this, it's so central that I think it's important to leave it out in the open. Bayes' theorem is as follows.

$$ P(\text{model}\mid\text{evidence}) = \dfrac{P(\text{evidence} \mid \text{model})\cdot P(\text{model})}{P(\text{evidence})} $$

$P(\text{model}\mid\text{evidence})$ is the thing we're ultimately after. This tells us the how much we should believe a particular claim about the election given the evidence we have already observed, in this case the polls.

$P(\text{evidence}\mid\text{model})$ is the opposite. It asks, assuming that a particular claim about the election is true, how likely were we to see what we saw in the polls?

$P(\text{model})$ is called the prior distribution. This is a measure of how likely we were to believe a particular claim about the election before collecting our evidence. Among other things, this is useful because we can use it to inject our qualitative beliefs about the election external to the polls, such as impact from the presidential debate, and so on. With that said, for the purposes of this analysis, I do not attempt to meddle with prior distributions in this way.

Intuitively, Bayesian updating is very similar to how we update our beliefs in real life when presented with new evidence. For example if we have plans for a hike tomorrow we might check the weather forecast. If the forecast for tomorrow says it will be sunny, you'll wake up tomorrow with the prior belief that it is unlikely to rain today. However, when you walk outside and see clouds, you will try to reconcile your prior belief that it would not rain and that it would be sunny with the new evidence that points to not being sunny and potentially being rainy. After this you'll probably end up somewhere in the middle, believing that the clouds might pass some time soon, but being more open than before to the chance of rain.

The important thing about Bayesian statistics is that it only requires the polls to be fair. Typically, we require the polls to be something called independent and identically distributed, which is a fancy way of saying that the results of one poll don't affect the results of another, and that they all have the same chance of having any particular outcome.

In the case of polls, the last requirement translates roughly to saying that the polls are all of the same size. Here, we run into issues, since polls might range from a few hundred respondents to well over 1000.

Bayesian statistics, on the other hand, have no such requirement. As long as we know how a particular model affects our chances of seeing a particular set of polls, we're good to go.

The Election Model

Going back to our model of an election being effectively a large box with an unknown number of red and blue balls, it becomes straightforward that the outcome of a poll follows a hypergeometric distribution.

The goal then becomes to determine the size of the voting population, as well as the size of the population voting for one of the parties (for the sake of this analysis, let's say Democrats). Once we have these things, we can combine them into a distribution of overall vote percentage, which will tell us the likelihood of a particular candidate winning in a particular state.

To find the probability of a particular electorate size and Democratic electorate size, which I will start referring to as "the model" for short, we need to calculate the pieces described in the last section.

$P(\text{evidence}\mid\text{model})$ is pretty straightforward, as this is just the aforementioned hypergeometric distribution.

The prior distribution, $P(\text{model})$, is a little more interesting. Since we have not collected evidence yet we get to choose what our starting beliefs are about the election.

For the sake of this model, which is meant to forecast the outcome of swing states, believe that the voter turnout overall could be anywhere between the 2016 turnout and the 2020 turnout. Since they are swing states I expect that the total number of Democrat voters will be about 50% of the total population, but to be safe let's say anywhere between 45% and 55%.

$P(\text{evidence})$ is a little bit more involved, but just involves repeating the above steps for every possible model we believe could be viable.

The details are in the box below, for curious readers.

Details Suppose the model suggests a voter turnout of $N$ with $B$ total Democrat voters. A poll is taken of size $n$, finding $b$ Democrat voters. Then \begin{align*} P(\text{evidence}\mid\text{model}) &= \dfrac{\binom{B}{b}\binom{N-B}{n-b}}{\binom{N}{n}} \end{align*} Since polls are, at least ostensibly, independent, if we have multiple polls we can simply multiply these probabilities together. More formally, suppose a series of polls of size $n_i$ each finds $b_i$ Democrat voters, then \begin{align*} P(\text{evidence}\mid\text{model}) &= \prod\limits_{i} \dfrac{\binom{B}{b_i}\binom{N-B}{n_i-b_i}}{\binom{N}{n_i}} \end{align*} Since we are starting with a prior belief of any $N$ and any $B$ within a reasonable range being equally likely, $$P(\text{model}) = \frac{1}{|M|}$$ Where $M$ is the set of models being considered, and therefore \begin{align*} P(\text{evidence}) &= \sum\limits_{\text{model}\in M} P(\text{evidence}\mid \text{model}) P(\text{model}) \\ &= \sum\limits_{N,B} \prod\limits_i \dfrac{\binom{B}{b_i}\binom{N-B}{n_i-b_i}}{\binom{N}{n_i}} \cdot \frac{1}{|M|} \end{align*}

It's important to remember that we should only consider all the polls at once if we believe that the underlying distribution has remained relatively static across the time when the polls were taken. When we believe that the electorate's opinion has changed, we can start a new cycle using the last obtained distribution as the prior distribution for the new cycle. This could happen at some regular fixed interval, or when a major event occurs that likely had a large impact on public opinion (such as the presidential debate).

Election Outcome

Of course determining the popular vote is not the be all, end all of determining the winner of the election. In the U.S. presidential election, there is also the matter of the electoral college.

In order to win the presidency, a candidate must obtain at least 270 votes in the electoral college. Other models, to my knowledge, typically rely on simulations in order to determine the likelihood of each candidate winning the election.

This is another benefit of the Bayesian approach. Since it directly spits out a distribution, that is, how likely one candidate is to get 50.1% in a state, vs 50.2% and so on, we can directly calculate the probability that a candidate wins each state simply by adding up the projected probabilities of each model where they win.

For example, say you flip three coins, two nickels and a dime, and you want to know the probability that at least 10 cents worth of coins came up heads. You can simply ask which combinations of coins lead to that outcome. In this example there are 5 possibilities:

  • HHH
  • HHT
  • HTH
  • THH
  • TTH

Each of these outcomes has a probability of $0.5 \times 0.5 \times 0.5 = 0.125$. Adding these up, we find that the probability of at least 10 cents coming up heads is 62.5%.

In essence we will be doing exactly this, but instead of coins we have states and instead of 10 cents we are hoping to add up to 270 electoral college votes.

The results of this are laid out in the main election page.

Motivation and Differences

Recently, polls have been taking a very rapid turn away from Kamala Harris and toward Donald Trump. At the time of writing, the 538 forecast has Donald Trump listed as the favourite to win the election, with a 53% chance of victory.

Trump overtakes Harris in the 538 election forecast.

Now, 538 does great work to be sure, but I spot a few things that give me pause about their approach overall, as well as other popular election forecasters:

  • At the moment, they and other aggregators have been overrun by low quality right-wing partisan pollsters, which skew the numbers toward Republicans.
  • Even absent this, the most popular models appear to combine techniques from signals processing such as IIR filters or simple rolling averages along with other more subjective things like weighting pollsters according to their perceived trustworthiness, or nudging the averages one direction or another to account for public opinion for notable events like the debate.
  • While I know that the popular vote is somewhat of a moving target, as more and more polls are taken, uncertainty on the thing being measured should decrease, especially in a period like the time since the debate where nothing of particular note has happened publicly that would sway the vote one way or another.

My approach takes two main steps differently in order to address these.

  • Rather than trying to account for low quality/high bias in polls by weighting them less than a higher quality poll, low quality polls should simply be ignored. There is a plethora of polls to choose from in swing states, so if a pollster can't be trusted to remain unbiased, their results should simply be ignored rather than weighted less. For this project, I am using 538's pollster ratings, and I have cut off anyone who falls below a 2.3 out of 3 star grading.
  • As shown, by using Bayesian statistics we can overcome the limitations of the more common frequentist statistics, and directly use well established mathematical foundations to build a projection of the election outcome.
In Summary

Ultimately, only time will tell how well this method holds up. However, as we can see from a simulated example, if the polls are representative and fair then it performs extremely well. However, polls do often have some average bias one way or the other, and in an election as tight as this one, even a 1% bias in any direction can sway the entire election.

I realize that I don't have the name or brand recognition that some of the more popular election forecasters have. In lieu of that, I have elected to be as transparent about my model as I can be. I encourage you to check my work, either by reading the methodology above or by checing out the git repo where I keep the code I wrote to perform these calculations (apologies in advance, the code is a little messy as I've put this projet together in the last 4 days or so).

In the end, whether the model performs well or not, I will continue trying to update and refine this methodology to try and squeeze the most out of it that I can. Until then, check out main project page to see my predictions, and I'll see everyone on November 6th.