2.1 Plausible estimation
How much personal protective equipment (PPE) will be needed by health workers in your country over the course of 1 month?
If every citizen uses a disposable mask every day, what will be the weight and volume of the waste each week?
Suppose you have an infinite supply of an effective vaccine. How long would it take to inoculate everyone in your country?
In each case, one needs definitions and “facts” as starting points. Who is to be counted as a “health worker,” and how many people does this include? How many masks and pairs of gloves does one person need each day? How many people are available who are competent to give injections? Start with strong simplifying assumptions—for example, in the vaccine question, assume that everyone will be able to attend for vaccination; and that vaccination centers will function perfectly, 24 h every day. Then, add more realistic assumptions—bout personal mobility, geography, and the availability of competent staff. In 2001, Swan and Ridgway [14] created and gave detailed lesson plans to support teaching about (and assessment of) plausible estimation.
2.2 Exponential growth
Here is a deal for you—On New Year’s Eve, I’m going to give you £1 billon! Yes, really! All I want back is £1 on first January, £2 on Jan second, £4 on Jan third, £8 on Jan fourth… until the end of the month. Waddaya say?
Time to reach for a spreadsheet… Students can be asked to create a table and graph to display deterministic exponential growth over days for different exponents. On screen, it is a simple matter to create a display where the user can change the exponent, and see the deterministic effects in both the table and the graphic.
Table 1 shows deterministic exponential growth for different exponents. The column headed £, where the payment doubles every day shows that, after 31 days of daily re-payments, the person offering the deal will be more than £1 billion ahead (the column total—which is £2 147 483 647—minus the original payment). So the response to the offer should be I’ve got a better idea—let ME give YOU the billion…
Daily payment | Daily cases | |||
---|---|---|---|---|
Day | £ | R_{0} = 5 | R_{0} = 1 | R_{0} = 0.9 |
1 | 1 | 10 | 10 | 10 |
2 | 2 | 50 | 10 | 9 |
3 | 4 | 250 | 10 | 8 |
4 | 8 | 1250 | 10 | 7 |
5 | 16 | 6250 | 10 | 7 |
6 | 32 | 31 250 | 10 | 6 |
7 | 64 | 156 250 | 10 | 5 |
8 | 128 | 781 250 | 10 | 5 |
9 | 256 | 3 906 250 | 10 | 4 |
10 | 512 | 19 531 250 | 10 | 4 |
11 | 1024 | 97 656 250 | 10 | 3 |
12 | 2048 | 488 281 250 | 10 | 3 |
13 | 4096 | 2,441 406 250 | 10 | 3 |
14 | 8192 | 12 207 031 250 | 10 | 3 |
15 | 16 384 | 61 035 156 250 | 10 | 2 |
16 | 32 768 | 10 | 2 | |
17 | 65 536 | 10 | 2 | |
18 | 131 072 | 10 | 2 | |
19 | 262 144 | 10 | 2 | |
20 | 524 288 | 10 | 1 | |
21 | 1 048 576 | 10 | 1 | |
22 | 2 097 152 | 10 | 1 | |
23 | 4 194 304 | 10 | 1 | |
24 | 8 388 608 | 10 | 1 | |
25 | 16 777 216 | 10 | 1 | |
26 | 33 554 432 | 10 | 1 | |
27 | 67 108 864 | 10 | 1 | |
28 | 134 217 728 | 10 | 1 | |
29 | 268 435 456 | 10 | 1 | |
30 | 536 870 912 | 10 | 0 | |
31 | 1 073 741 824 | 10 | 0 |
2.2.1 On to Covid
The £ column corresponds to a reproduction rate (R_{0}) of exactly 2; that is, each infected person infects 2 more people the next day. The next three columns in Table 1 are introducing ideas of disease spread. Simplistic deterministic assumptions have been made. Starting with 10 people, if there is perfect transmission, and every infected person infects exactly 5 others (R_{0} = 5) by the next day, day 14 would see more than 12 billion newly infected people; that is, the world population (about 8 billion people) would catch the disease within 14 days. There are some big Ifs here that will be explored later. If each infected person infects exactly one other person (R_{0} = 1), then the total number of infected people increases linearly by 10 each day; if the exact infection rate is less than 1 (R_{0} = 0.9), on day 30, the number of new cases falls below 0.5 (0.47) and continues to rapidly fade away.
Apart from the absurdity of considering fractions of people, such a model for the spread of infection might appear too simplistic to be useful, but in the case of Covid it has its uses, with the above results a good approximation on average, with R_{0} the average number of people infected by each infected person. A general model for epidemics must take into account many factors including: the virulence of the virus, the nature of the contacts between people in terms of spatial distribution and frequency, the nature of the contacts in terms of the circumstances of meeting (indoors or outdoors, wearing/not wearing masks etc.), and other chance factors including “virus load”, individual immunity and the natural variation of infection spread by droplets or aerosol. Models of infection usually need to consider both the number of infected people and the number of those susceptible to infection, and the chance that a contact between an infected person and a non-infected person produces a new infected person. However, in the case of Covid, everyone was susceptible and the virus is easily transmitted, so the simplest model for an epidemic provides some answers. This model assumes an unlimited number of susceptible people, with R_{0} the average number infected by each infected person. Although the same R_{0} can arise from different assignation of probabilities over the number of possible people infected by each infected person, it is certain that the epidemic will die out if R_{0} < 1, and grow indefinitely if R_{0} > 1. How quickly either of these happens depends on both the value of R_{0}, and the distribution of probabilities. Simple models of this are very easy to explore and simulate by school students, as explained, for example, by Helen MacGillivray [7]. Hence, we see how important it has been to try to estimate R_{0}, and to reduce it by good hygiene, reducing contact, and finding and applying vaccines.
Table 1 assumes that the number of new infected people each day increased by a factor of exactly R_{0}. If, instead, we assume that R_{0} is the mean number of new infected people per day from each infected person, and consider R_{0} = 0.9, but that the number of new daily infected people from an infected person has a normal distribution with mean 0.9 and SD 0.3, we can examine the distribution of infected people after a certain period.
Figure 1 shows the results of 100 iterations of this process (starting with 100 cases) applied for 20 successive days. (Note that the values simulated from the normal have been rounded to whole numbers).
Of course, public health measures are designed to reduce R_{0}; so estimating R_{0} is an important challenge for planning during a pandemic. Figure 2 shows estimates of R_{0} for India, over the course of a year (sampling problems associated with all aspects of Covid are discussed later). It is clear that successes in curbing the virus in the autumn of 2020 were followed by a second wave of infection in 2021 (with disastrous human consequences).
2.3 Interpreting graphs
Graphs are being used increasingly to convey complex information in the media. A picture may be worth 1000 words, but sometimes a graph needs 1000 words of explanation. Figures 3 and 4 present graphs downloaded from the Our World in Data website [9] about the spread of the disease in the United States and the United Kingdom. Each graph uses data from the same data set.
- Deaths or cases
- Daily deaths/cases or cumulative deaths/cases
- Raw numbers or numbers per million of population
- A linear or log scale
Students can be asked to describe to each other what they see in each graph—and if they believe, the graphs are based on the same data set. Then, tell students the graphs do show data from the same data set, and pose these challenges.
The graphs look very different—explain why.
What type of scale is being used in each graph?
What conclusions can you draw from each graph?
In the context of Covid
When should you use a linear scale, and when should you use a log scale?
When should you use raw numbers, and when should you use events per million of population?
When should you use deaths, and when should you use cases?
When should you use new cases, and when should you use cumulative cases?
Linking Table 1 and Figure 3 highlights the most obvious advantage of log scales—the slope of the curve at any point gives a direct indication of the speed of spread of the disease; changes in the slope show changes in disease acceleration (positive or negative). For instance, epidemics at an early stage can be compared with those at a later stage, even though there are big differences in actual deaths. However, log scales can be hard to understand, and can be visually misleading—for example, a change of one unit on the y-axis corresponds to both a rise from 100 to 1000 cases, and for a rise from 1000 to 100 000 cases. A linear scale gives a clearer impression of the size of the epidemic.
Raw numbers are essential for planning; scaling numbers by the size of the population is not useful at the start of an epidemic, but later gives some indication of the success of eradication programs in different countries, and the load on a country’s resources. Estimates based on small samples are usually less stable than those based on larger samples; this is true of countries, too—those with small populations often have very high “cases per million” (eg, Andorra and Montenegro), and also have very low “cases per million” (eg, Mauritius, Fiji).
2.4 Measurement issues
The discussion of cases vs deaths leads directly to questions about measurement. Clearly, identification of cases depends on the extent of testing; if testing is sparse, the count of cases will be too low. Even with comprehensive testing, a test may fail to detect Covid in the early stages of the disease. Further, some people get the virus and have mild symptoms or no symptoms at all. For example, in 2020, Pollan et al [11] conducted a survey of 61 000 randomly selected people in Spain, completed in May 2020, to determine the prevalence of Covid. About 5% of the sample tested positive; of these, about 1 in 3 were asymptomatic. So, the true number of cases is probably much higher than the reported number.
For everyone who dies, suppose we know exactly who did, and did not, have Covid. Is this a good measure of deaths attributable to Covid?
Use the (historical) English model for recording Covid deaths. Assume a constant rate of infection, and a constant true death rate. Sketch a graph of UK recorded Covid deaths over time.
Are there other ways to estimate Covid deaths?
Figure 5 shows weekly total deaths in the United Kingdom, together with weekly total deaths averaged over the previous 5 years, and deaths attributed to a number of causes. There are obvious peaks in April both in total deaths and deaths attributed to Covid-19.
We have data on total deaths each week over several years. So, it is easy to calculate excess deaths—the number of deaths that are higher than expected. Is this a good measure of deaths attributable to Covid?
Excess mortality data (available for different countries on the Financial Times website [4]) needs accurate historical data on deaths; these data are rare in middle-income and poor countries. Excess deaths might be underestimated if, for example, influenza deaths are lower in a particular year or if there are fewer deaths from other causes, such as road traffic accidents, or deaths attributable to air pollution, because people work from home.
Excess deaths might be overestimated if a pandemic results in increased deaths from other causes, such as resources being directed away from treating diseases such as cancers or HIV/AIDS, or if people die because they were unwilling to go to hospitals (eg, for emergency care) out of fear of contracting the disease.
So confirmed deaths associated with Covid (assuming these are not the result of a statistical anomaly!) and excess deaths are reflecting similar but not identical things. Covid deaths do reflect the cause of death, but probably underestimate the death toll (unless Public Health England were counting). Excess deaths are giving an overall impression of the effect of the pandemic, but can give an estimate of deaths directly attributable to the disease that might be overestimated (because of [say] more heart deaths associated with unwillingness to seek treatment) or underestimated (because of [say] fewer deaths associated with traffic accidents). Detailed lesson plans to support teaching about (and assessment of) inventing measures have been created by Swan and Ridgway [13].
2.5 Sampling
Sampling is one of the Big Ideas in statistics. Covid demonstrates clearly that this Big Idea has not been grasped (or acted upon) by very many decision makers world-wide. For planning and action, we need to be able to estimate a number of parameters—how many people in the population are susceptible? Infected? How long will infected people (as a function of age, obesity, severity of infection, and other co-morbidities) stay in hospital? What is the case fatality rate? What proportion of people who have recovered are immune? (and for how long?). To determine these parameters, one needs to take sampling very seriously. Too much of the early work on estimation was based on opportunistic sampling. There are very big local variations in all these parameters, and parameters change over time, so careful testing needs to be an on-going process.