Better Canadian poll aggregation
On Tuesday, I wrote about how Canadian poll aggregators suck — in particular, pointing out the common ways that their methodologies fail. At the end of the post, I said that we could do better; here's the details of how.The fundamental realization is that our goal is not to compute a polling average; rather, our goal is to use the available data to compute a best estimate of voting intentions. Using noisy data to compute an estimate of a true value? This sounds like a problem for regression analysis! So let's start writing down the things which we know are approximately true:
-
Published opinion polls are the average of the support levels observed
by the pollster in question across the days when the poll was conducted,
to within the level of rounding used by the pollster. For example,
Abacus Data conducted a poll between October 8th and October 10th for
which they reported 32% support for the Liberal Party of Canada (hereafter
"LPC" for succinctness); this tells us that the average of
"Abacus Data LPC 2019-10-08", "Abacus Data LPC 2019-10-09", and
"Abacus Data LPC 2019-10-10" is somewhere between 31.5 and 32.5. One
equation with three unknowns; not very useful by itself, but it's a
start.
For ease of manipulation, I treat "somewhere between 31.5 and 32.5" as "32 with a standard deviation of 1/sqrt(12)"; this replaces a uniform likelihood with a bell curve, which is technically wrong but for practical purposes works just fine. (The value 1/sqrt(12) is the standard deviation of the uniform likelihood distribution in question.)
For pollsters who conduct "rolling" polls (Nanos and Mainstreet), we apply the same method — but unlike with other pollsters, this provides us with multiple equations involving the same days of polling.
-
The observed level of support by a pollster for a particular party
on a particular day is equal to the actual level of support for that
party on that day, plus the pollster's house effect for that party,
within a margin of error determined by the party's level of support,
the number of people polled, and the pollster's non-sampling noise.
I make two assumptions here: pollster house effects are
constant (aka. pollsters don't dramatically change methodologies);
and multi-day polls query the same number of people per day.
For the aforementioned case of the Abacus Data October 8-10 poll of 3000 people, this tells me that "Abacus Data LPC 2019-10-08" is approximately equal to "LPC 2019-10-08" + "Abacus Data LPC", plus or minus an error based on how accurately Abacus Data can determine support for a party with ~32% support by polling 1000 people. (Why does it matter that they had approximately 32% support? Because margins of error get smaller the further you get away from 50% support. The standard deviation on a measurement of something with likelihood p is sqrt(p * (1-p) / N).)
-
The true level of support for a particular party on a particular day
is approximately the same as the support for that party on the previous
day. Obviously not absolutely constant — if nobody ever
changed their opinions, politics would be very boring! But that's why
we have a margin of error on this equality; and here I make an editorial
judgment call based on recent Canadian political history about how much
public support moves from one day to the next: For each day, I assign
a daily standard deviation of a minimum of 0.05%, but sometimes
considerably more depending on how much is going on at the time.
What do I mean by "how much is going on"? Canadians tend to pay more attention to politics — and be more likely to change their intended votes — during election campaigns, but there are also other times when large shifts happen. For example, in February and March of this year, when it became clear that the Prime Minister had pressured the Attorney General to negotiate a deferred prosecution agreement with SNC-Lavalin (and fired her when she did not comply) politics was likewise at the forefront of many Canadians' minds, and voting intention changed more rapidly during this period than in the surrounding months.
So how do I measure this? I rely on what I have: Polls. When news organizations think that voting intentions are likely to be changing rapidly, they commission more polls. Consequently, I count how many polling companies are "in the field" within 3 days of each date (this window is here to allow time for polls to be commissioned after news breaks) and if there are N pollsters in the field, I take a standard deviation of (N + 1) * 0.05% in daily support for each party.
It's a judgment call, but it seems to work well. Larger or smaller margins of error would make the graphs more or less noisy.
Using my current corpus of polls — taken from Wikipedia's list of Canadian Federal opinion polls since the 2015 election — this gives me 24208 (approximate) linear equations in 21687 unknowns. This is obviously impractical to try to solve... just kidding! Computers are fast. It takes under a minute to compute a best-fit solution to these, and it would be even faster if I spent a few hours rewriting my solver to take full advantage of the sparsity of the system.
There's two more things we need to do. First, I mentioned above that I computed polling margins of error based the party's level of support, the number of people polled, and the pollster's non-sampling noise. We need to compute that noise — or as I refer to it, "excess variance". To do this, I take the regression output and feed it back to compute — including house effects — the expected polling results for each poll in my database. Then I calculate how far off the polls were, and compare that against the expected margins of error from a theoretical pollster with perfect random sampling. I average these over all the polls conducted by a firm to produce a "pollster excess variance" value; taking a middle-of-the-pack pollster as an example, Leger Marketing has a +/- 2.5% error on top of the unavoidable statistical errors. These values computed from one run get fed back for use in the next run; fortunately this converges very quickly, so even starting withoug any advance knowledge of how accurate pollsters are I get consistent results after a few runs.
Finally, we need to decide on a polling consensus. Here we run into another judgment call: Taking the famous Shy Tory effect as an example, it's possible that every pollster is reporting biased values due to uncooperative poll respondants. The best we can do is to hope that true voting intentions fall somewhere between what the different pollsters would measure; so I compute a "consensus" such that a weighted average of pollsters' house effects is zero. Those weights? Again a judgment call, but I weight pollsters according to the inverse of their average polling margin of error (including the aforementioned "excess" variance) — in short, I assume that polling firms which are more precise are also generally more accurate. (See Wikipedia for an explanation of these terms.) However, note that the determination of "consensus pollster" will shift all the polls consistently, and will not change the shape of how each party's support changes over time.
So where are we at right now? My latest run tells me that, as of October 15th, the Conservative Party of Canada is leading with 31.82% of the vote; the Liberal Party of Canada is in second place with 30.50%; the New Democratic Party has 18.65% support; the Green Party has 8.26%; the Bloc Quebecois is at 7.37% (nationally — they only run candidates in Quebec, so this translates to roughly 30% in Quebec); and the People's Party of Canada trails with 2.73%. This and more data is now available on a separate page, which I will keep updated between now and the election — and depending on public interest, may update further in the future.