Disclaimer: This is an apolitical post, aimed at explaining the math behind sample count performed during Singapore General Elections. No political parties will be mentioned and names of electoral division (e.g. SMCs/GRCs) are mentioned to illustrate the math with actual numbers. Also, this is our own independent analysis on the math behind sample count. We do not work at the Elections Department Singapore.

What is a sample count?

Implemented in the Singapore General Elections (GE), a sample count is performed at the start of the vote counting process. Briefly, for each electoral division, a small number of votes are sampled from each polling station to provide a statistical estimate of the eventual election result. The purpose of sample count is to help prevent speculation and misinformation from unofficial sources while counting is underway and also serves as a check against the final election results.

In detail, 100 votes are randomly drawn from each polling station and the proportion of votes for each candidate (or group of candidates) are counted. Within each electoral division, the vote proportions are then averaged across polling stations, with weightage given to account for the number of votes cast at each polling station.

Infographic illustrating the sample count process. Votes are sampled from each polling station and a weighted average of the samples (accounting for number of votes) is then calculated to derive the sample count. Image credit: Elections Department Singapore.

Figure 1: Infographic illustrating the sample count process. Votes are sampled from each polling station and a weighted average of the samples (accounting for number of votes) is then calculated to derive the sample count. Image credit: Elections Department Singapore.

When first implemented publicly in the 2015 GE, it is often quoted that the estimate from sample count has an error margin of ±4% (with 95% confidence). This means that the sample count estimate should not differ from the actual voting proportion by more than 4% for 95% of the estimates made. Here, we are going to explore how this number comes about.

Math behind sample count

First, let us reiterate the problem. The votes drawn for the sample count are essentially samples drawn from a population (i.e. all votes) and we are trying to use the sample proportion to estimate the true population proportion. As with all sample estimates, there is some degree of uncertainty involved. This uncertainty is often represented in the form of a confidence interval, a range of values that the true proportion is likely to lie in. The confidence interval also has an associated confidence level, often set to be 95%. This means that if this sampling has been done infinitely, 95% of the confidence interval would contain the true proportion.

Next, we can derive the confidence interval from the standard deviation (SD) of the estimate. If we assume that the samples follow a normal (“bell curve”) distribution, the two sided 95% confidence interval is ±1.96 SD of the estimate. So how do we find the SD of the sample count proportion?

Statistical approach to calculate SD

Fortunately, the standard deviation (SD) of the sample proportion is given by this simple formula:

\[\begin{equation*} \sigma_{p} = \sqrt{\dfrac{p(1-p)}{n}} \end{equation*}\]

where \(p\) is the true population proportion and \(n\) is the sample size, in this case the number of sample count votes. There are two points to note here. First, the SD is the largest when \(p\) is 0.5 due to the \(p(1-p)\) term in the numerator. Second and more importantly, the SD is large when sample size \(n\) is small which make sense given that a smaller sample size introduces more uncertainty.

In the context of the GE, this implies that the smallest SMC, in terms of the number of votes cast, would have the highest error margin / confidence interval in their sample count. This is because they are likely to have the lowest number of polling stations and thus the least number of sample count votes included in the estimate.

In the 2015 GE, Potong Pasir SMC has the smallest number of votes cast which is 17,407 votes. Also, we estimate there to be a polling station for every 2,600 voters given that there are 2,304,331 votes cast and 880 polling stations. Thus, we estimate Potong Pasir SMC to have about 6 to 7 polling stations (17,407 / 2,600 ~ 6.7). Assuming that there are only six polling stations, \(n\) would be 600 since 100 votes are sampled randomly from each polling station. Plugging \(n=600\) and \(p=0.5\) into the formula above, the 95% confidence interval would be:

\[\begin{equation*} 1.96 \times \sqrt{\dfrac{0.5 \times (1-0.5)}{600}} = 0.0400 \end{equation*}\]

which is exactly the 4% error of margin reported! At this point, it is important to realise that we have calculated the worst possible error of margin. In contrast, for larger electoral divisions, e.g. the GRCs, there are a lot more polling stations and thus more sample count votes and this decreases the error of margin by a lot. For instance, the largest GRC, Ang Mo Kio GRC, has 187,771 votes cast, corresponding to about 72 polling stations and \(n\) goes up to 7,200. The corresponding error of margin then drops to 1.15%. And this explains why in general, the actual voting proportions are well within ±4% of the sample count. Here, we have calculated the error margin for the sample count for all the electoral divisions for the 2015 GE.

Sample count and actual election results for the 2015 GE. The 95 percent CI (i.e. error margin) of the sample count are given in red while the actual results are in black dots. Note that we have used the sample count proportion for $p$ instead of the worst case scenario $p$ of 0.5. Clearly, most of the actual results fall within the error margin with the exception of Jurong GRC. And this is also expected since the estimates have a 95 percent confidence.

Figure 2: Sample count and actual election results for the 2015 GE. The 95 percent CI (i.e. error margin) of the sample count are given in red while the actual results are in black dots. Note that we have used the sample count proportion for \(p\) instead of the worst case scenario \(p\) of 0.5. Clearly, most of the actual results fall within the error margin with the exception of Jurong GRC. And this is also expected since the estimates have a 95 percent confidence.

Next, we would like to delve deeper into the difference between the actual election results and sample count, namely the error in the estimates, to uncover any potential bias in the sample count. To this end, we calculated the error as the actual vote proportion minus the sample proportion (true minus predicted). Thus, a positive error implies an underestimate in the vote proportion since the actual is greater than sample proportion. Also, to account for the different uncertainties in the sample count estimates (for example the SMCs having higher error margin), we divided the error by the standard deviation (SD) of the proportion which was explained earlier. In the 2015 GE, for all electoral divisions except Jurong GRC, the estimates do not differ by more than 1.96 times the SD (corresponding to the 95% confidence interval), in agreement with what we observed in the previous plot. Also, these scaled errors follow a normal (“bell curve”) distribution which is what we expected.

Distribution of scaled error in the sample count estimates for 2015 GE. Here, we defined the scaled error to be the (actual - sample count proportions) divided by the SD. Overall, the scaled errors follow a normal distribution (black line).

Figure 3: Distribution of scaled error in the sample count estimates for 2015 GE. Here, we defined the scaled error to be the (actual - sample count proportions) divided by the SD. Overall, the scaled errors follow a normal distribution (black line).

Finally, for the numerically inclined, the actual numbers for 2015 GE are presented below:

What about 2020 GE?

Armed with this knowledge, we can calculate the new error of margin for the sample count for the 2020 GE. Again, the smallest SMC (in terms of the number of voters) is Potong Pasir SMC with 19,740 voters. However, due to COVID-19, there is an increase in the number of polling stations from 880 to 1,100. Thus, we now only have an average of about 2,400 voters per polling station (total electorate of 2,653,942 voters). Thus, we estimate Potong Pasir SMC to have about eight polling stations. Putting all the numbers together, we have an error of margin of ±3.5% for Potong Pasir SMC. On the other spectrum, the largest electoral division, Ang Mo Kio GRC, has 185,465 and approximately 77 polling stations. Overall, this brings the error of margin down to ±1.1%.

Sample count and actual election results for the 2020 GE. The 95 percent CI (i.e. error margin) of the sample count are given in red while the actual results are in black dots. Interestingly, the actual election results deviates from the error margin for quite a number of electoral divisions. This warrants further investigation, see next figure.

Figure 4: Sample count and actual election results for the 2020 GE. The 95 percent CI (i.e. error margin) of the sample count are given in red while the actual results are in black dots. Interestingly, the actual election results deviates from the error margin for quite a number of electoral divisions. This warrants further investigation, see next figure.

Despite the supposedly lower error margins (due to the increase in the number of polling stations), we observed that eight out of the 31 electoral divisions in the 2020 GE have their actual vote proportions lying outside of the sample count error margin (outside of ±1.96 SD). To investigate this anomaly, we plotted the scaled errors, given by (actual - sample count proportions) divided by the SD. Out of the eight “outlying” electoral divisions, six of them were skewed towards negative values, indicating that the sample count overestimated the majority vote proportion. Notably, for the remaining two positive-value “outlying” electoral divisions, one of it was won by the opposition party. Overall, this suggests that the sample count estimates are somewhat biased towards the ruling party i.e. it tends to overestimate the ruling party’s vote proportion. One possible explanation is that the sample count process is not entirely random. Due to COVID-19, the Elections Department Singapore has prescribed timebands for voters to stagger the flow of voters into polling stations. This staggering of voters could have decreased the randomness of the order in which votes were cast and possibly affected the randomness of the sample count.

Distribution of scaled error in the sample count estimates for 2020 GE. Here, we defined the scaled error to be the (actual - sample count proportions) divided by the SD. Notably, the scaled errors are skewed towards negative values and do not follow a normal distribution (black line).

Figure 5: Distribution of scaled error in the sample count estimates for 2020 GE. Here, we defined the scaled error to be the (actual - sample count proportions) divided by the SD. Notably, the scaled errors are skewed towards negative values and do not follow a normal distribution (black line).

For the numerically inclined, the actual numbers for 2020 GE are presented below:

Concluding Remarks

In conclusion, hopefully I have brought about a better understanding of the sample count process and error of margin involved in the estimates. It is also amazing to point out that we are able to make rather accurate estimates of the population proportion with a relatively small number of samples.