Covid and data science: Understanding R0 could change your life – Ridgway – 2021 – Teaching Statistics

2.1 Plausible estimation

The essence of plausible estimation is to derive an estimate that is good enough for a decision to be made when necessary, such as under time or information constraints. Interval estimates are much better than point estimates, and should be accompanied by a description of the assumptions that have been made. Covid planning has involved a great deal of plausible estimation. For example:

How much personal protective equipment (PPE) will be needed by health workers in your country over the course of 1 month?

If every citizen uses a disposable mask every day, what will be the weight and volume of the waste each week?

Suppose you have an infinite supply of an effective vaccine. How long would it take to inoculate everyone in your country?

In each case, one needs definitions and “facts” as starting points. Who is to be counted as a “health worker,” and how many people does this include? How many masks and pairs of gloves does one person need each day? How many people are available who are competent to give injections? Start with strong simplifying assumptions—for example, in the vaccine question, assume that everyone will be able to attend for vaccination; and that vaccination centers will function perfectly, 24 h every day. Then, add more realistic assumptions—bout personal mobility, geography, and the availability of competent staff. In 2001, Swan and Ridgway [14] created and gave detailed lesson plans to support teaching about (and assessment of) plausible estimation.

2.2 Exponential growth

Here is a deal for you—On New Year’s Eve, I’m going to give you £1 billon! Yes, really! All I want back is £1 on first January, £2 on Jan second, £4 on Jan third, £8 on Jan fourth… until the end of the month. Waddaya say?

Time to reach for a spreadsheet… Students can be asked to create a table and graph to display deterministic exponential growth over days for different exponents. On screen, it is a simple matter to create a display where the user can change the exponent, and see the deterministic effects in both the table and the graphic.

Table 1 shows deterministic exponential growth for different exponents. The column headed £, where the payment doubles every day shows that, after 31 days of daily re-payments, the person offering the deal will be more than £1 billion ahead (the column total—which is £2 147 483 647—minus the original payment). So the response to the offer should be I’ve got a better idea—let ME give YOU the billion…

Exponential growth for different parameters
Daily payment Daily cases
Day £ R0 = 5 R0 = 1 R0 = 0.9
1 1 10 10 10
2 2 50 10 9
3 4 250 10 8
4 8 1250 10 7
5 16 6250 10 7
6 32 31 250 10 6
7 64 156 250 10 5
8 128 781 250 10 5
9 256 3 906 250 10 4
10 512 19 531 250 10 4
11 1024 97 656 250 10 3
12 2048 488 281 250 10 3
13 4096 2,441 406 250 10 3
14 8192 12 207 031 250 10 3
15 16 384 61 035 156 250 10 2
16 32 768 10 2
17 65 536 10 2
18 131 072 10 2
19 262 144 10 2
20 524 288 10 1
21 1 048 576 10 1
22 2 097 152 10 1
23 4 194 304 10 1
24 8 388 608 10 1
25 16 777 216 10 1
26 33 554 432 10 1
27 67 108 864 10 1
28 134 217 728 10 1
29 268 435 456 10 1
30 536 870 912 10 0
31 1 073 741 824 10 0

2.2.1 On to Covid

The £ column corresponds to a reproduction rate (R0) of exactly 2; that is, each infected person infects 2 more people the next day. The next three columns in Table 1 are introducing ideas of disease spread. Simplistic deterministic assumptions have been made. Starting with 10 people, if there is perfect transmission, and every infected person infects exactly 5 others (R0 = 5) by the next day, day 14 would see more than 12 billion newly infected people; that is, the world population (about 8 billion people) would catch the disease within 14 days. There are some big Ifs here that will be explored later. If each infected person infects exactly one other person (R0 = 1), then the total number of infected people increases linearly by 10 each day; if the exact infection rate is less than 1 (R0 = 0.9), on day 30, the number of new cases falls below 0.5 (0.47) and continues to rapidly fade away.

Apart from the absurdity of considering fractions of people, such a model for the spread of infection might appear too simplistic to be useful, but in the case of Covid it has its uses, with the above results a good approximation on average, with R0 the average number of people infected by each infected person. A general model for epidemics must take into account many factors including: the virulence of the virus, the nature of the contacts between people in terms of spatial distribution and frequency, the nature of the contacts in terms of the circumstances of meeting (indoors or outdoors, wearing/not wearing masks etc.), and other chance factors including “virus load”, individual immunity and the natural variation of infection spread by droplets or aerosol. Models of infection usually need to consider both the number of infected people and the number of those susceptible to infection, and the chance that a contact between an infected person and a non-infected person produces a new infected person. However, in the case of Covid, everyone was susceptible and the virus is easily transmitted, so the simplest model for an epidemic provides some answers. This model assumes an unlimited number of susceptible people, with R0 the average number infected by each infected person. Although the same R0 can arise from different assignation of probabilities over the number of possible people infected by each infected person, it is certain that the epidemic will die out if R0 < 1, and grow indefinitely if R0 > 1. How quickly either of these happens depends on both the value of R0, and the distribution of probabilities. Simple models of this are very easy to explore and simulate by school students, as explained, for example, by Helen MacGillivray [7]. Hence, we see how important it has been to try to estimate R0, and to reduce it by good hygiene, reducing contact, and finding and applying vaccines.

Table 1 assumes that the number of new infected people each day increased by a factor of exactly R0. If, instead, we assume that R0 is the mean number of new infected people per day from each infected person, and consider R0 = 0.9, but that the number of new daily infected people from an infected person has a normal distribution with mean 0.9 and SD 0.3, we can examine the distribution of infected people after a certain period.

Figure 1 shows the results of 100 iterations of this process (starting with 100 cases) applied for 20 successive days. (Note that the values simulated from the normal have been rounded to whole numbers).

One simulation of the number of new daily Covid cases after 20 days when we assume the number of new daily cases arising from each infected person is N(0.9, 0.3) [Colour figure can be viewed at]

Of course, public health measures are designed to reduce R0; so estimating R0 is an important challenge for planning during a pandemic. Figure 2 shows estimates of R0 for India, over the course of a year (sampling problems associated with all aspects of Covid are discussed later). It is clear that successes in curbing the virus in the autumn of 2020 were followed by a second wave of infection in 2021 (with disastrous human consequences).

Estimates of R0 for India over a 1-year period [Colour figure can be viewed at]

2.3 Interpreting graphs

Graphs are being used increasingly to convey complex information in the media. A picture may be worth 1000 words, but sometimes a graph needs 1000 words of explanation. Figures 3 and 4 present graphs downloaded from the Our World in Data website [9] about the spread of the disease in the United States and the United Kingdom. Each graph uses data from the same data set.

Cumulative confirmed Covid-19 cases in the United Kingdom and the United States [Colour figure can be viewed at]
Daily new confirmed cases per million people in the United Kingdom and the United States [Colour figure can be viewed at]
The Our World in Data grapher offers a number of choices about how data are to be displayed. One can choose
  • Deaths or cases
  • Daily deaths/cases or cumulative deaths/cases
  • Raw numbers or numbers per million of population
  • A linear or log scale

Students can be asked to describe to each other what they see in each graph—and if they believe, the graphs are based on the same data set. Then, tell students the graphs do show data from the same data set, and pose these challenges.

The graphs look very different—explain why.

What type of scale is being used in each graph?

What conclusions can you draw from each graph?

In the context of Covid

When should you use a linear scale, and when should you use a log scale?

When should you use raw numbers, and when should you use events per million of population?

When should you use deaths, and when should you use cases?

When should you use new cases, and when should you use cumulative cases?

Linking Table 1 and Figure 3 highlights the most obvious advantage of log scales—the slope of the curve at any point gives a direct indication of the speed of spread of the disease; changes in the slope show changes in disease acceleration (positive or negative). For instance, epidemics at an early stage can be compared with those at a later stage, even though there are big differences in actual deaths. However, log scales can be hard to understand, and can be visually misleading—for example, a change of one unit on the y-axis corresponds to both a rise from 100 to 1000 cases, and for a rise from 1000 to 100 000 cases. A linear scale gives a clearer impression of the size of the epidemic.

Raw numbers are essential for planning; scaling numbers by the size of the population is not useful at the start of an epidemic, but later gives some indication of the success of eradication programs in different countries, and the load on a country’s resources. Estimates based on small samples are usually less stable than those based on larger samples; this is true of countries, too—those with small populations often have very high “cases per million” (eg, Andorra and Montenegro), and also have very low “cases per million” (eg, Mauritius, Fiji).

2.4 Measurement issues

The discussion of cases vs deaths leads directly to questions about measurement. Clearly, identification of cases depends on the extent of testing; if testing is sparse, the count of cases will be too low. Even with comprehensive testing, a test may fail to detect Covid in the early stages of the disease. Further, some people get the virus and have mild symptoms or no symptoms at all. For example, in 2020, Pollan et al [11] conducted a survey of 61 000 randomly selected people in Spain, completed in May 2020, to determine the prevalence of Covid. About 5% of the sample tested positive; of these, about 1 in 3 were asymptomatic. So, the true number of cases is probably much higher than the reported number.

Measuring deaths is not without its problems. Is everyone who dies tested for Covid? In hospital settings, the death count is likely to be reliable (because it is important to know which patients had Covid); in community settings, such as care homes for the elderly, or in people’s homes, data may be unreliable. Measures themselves can change; for example, moving from simply counting deaths in hospital to including deaths in prison, care homes, and the community produces major changes in the numbers reported. There are problems comparing numbers from different countries—different measures are used in different countries, and countries differ in the extent to which official statistics are independent of political dictat. The recoding system itself may be unreliable (as in some poorer countries). All of this can be used to draw students’ attention to the critical importance in data science of understanding what is being measured and how, and to the importance of accessing and understanding metadata. It illustrates the reason that agencies concerned with cross-country comparisons (such as Organisation for Economic Co-operation and Development, Eurostat, and the United Nations) place such emphasis on the development of measures, reaching international agreement on methods of measurement, along with their insistence on linking metadata descriptions to datasets. Let us consider some of these measurement problems in more detail.

For everyone who dies, suppose we know exactly who did, and did not, have Covid. Is this a good measure of deaths attributable to Covid?

Someone could have Covid and die from heart disease. This is problematic in care homes for the elderly, where the mortality rate is high, so Covid deaths may be exaggerated (conversely, one could argue that Covid made the heart attack far more likely). A bigger problem was discussed in a paper by Loke and Heneghan in 2020 [6] from the Centre for Evidence-Based Medicine in the United Kingdom. In England, there is a register of everyone who has ever been diagnosed with Covid-19. In July 2020, when someone died, if they were registered as having had Covid-19, they were recorded as a Covid death. So someone who recovered fully, but subsequently died in a traffic accident, was recorded as a Covid death (this recoding method has now changed). Covid deaths were recoded differently in Scotland, making it difficult to draw comparisons between the two countries.

Use the (historical) English model for recording Covid deaths. Assume a constant rate of infection, and a constant true death rate. Sketch a graph of UK recorded Covid deaths over time.

Are there other ways to estimate Covid deaths?

Figure 5 shows weekly total deaths in the United Kingdom, together with weekly total deaths averaged over the previous 5 years, and deaths attributed to a number of causes. There are obvious peaks in April both in total deaths and deaths attributed to Covid-19.

Weekly deaths: data from the Office for National Statistics CC BY-SA 4.0 [8] [Colour figure can be viewed at]

We have data on total deaths each week over several years. So, it is easy to calculate excess deaths—the number of deaths that are higher than expected. Is this a good measure of deaths attributable to Covid?

Excess mortality data (available for different countries on the Financial Times website [4]) needs accurate historical data on deaths; these data are rare in middle-income and poor countries. Excess deaths might be underestimated if, for example, influenza deaths are lower in a particular year or if there are fewer deaths from other causes, such as road traffic accidents, or deaths attributable to air pollution, because people work from home.

Excess deaths might be overestimated if a pandemic results in increased deaths from other causes, such as resources being directed away from treating diseases such as cancers or HIV/AIDS, or if people die because they were unwilling to go to hospitals (eg, for emergency care) out of fear of contracting the disease.

So confirmed deaths associated with Covid (assuming these are not the result of a statistical anomaly!) and excess deaths are reflecting similar but not identical things. Covid deaths do reflect the cause of death, but probably underestimate the death toll (unless Public Health England were counting). Excess deaths are giving an overall impression of the effect of the pandemic, but can give an estimate of deaths directly attributable to the disease that might be overestimated (because of [say] more heart deaths associated with unwillingness to seek treatment) or underestimated (because of [say] fewer deaths associated with traffic accidents). Detailed lesson plans to support teaching about (and assessment of) inventing measures have been created by Swan and Ridgway [13].

2.5 Sampling

Sampling is one of the Big Ideas in statistics. Covid demonstrates clearly that this Big Idea has not been grasped (or acted upon) by very many decision makers world-wide. For planning and action, we need to be able to estimate a number of parameters—how many people in the population are susceptible? Infected? How long will infected people (as a function of age, obesity, severity of infection, and other co-morbidities) stay in hospital? What is the case fatality rate? What proportion of people who have recovered are immune? (and for how long?). To determine these parameters, one needs to take sampling very seriously. Too much of the early work on estimation was based on opportunistic sampling. There are very big local variations in all these parameters, and parameters change over time, so careful testing needs to be an on-going process.

Previous Your condition could owe you hundreds of dollars. Here's how to find out in less than 2 minutes
Next Why biologists like Carl Bergstrom are warning that social media is a risk to humanity