To understand and model complex phenomena, such as the covid-19 pandemic, it is crucial to have sufficient and quality data. The phrases “measure what is measurable and make measurable what is not”, frequently attributed to Galileo Galilei, or “we only really know what we are talking about when we are able to measure it”, Lord Kelvin, embody this principle of modern science and make more sense, if possible, after what has been experienced during these months. However, throughout this crisis we have witnessed numerous episodes of missing data, changes in their definition – over time or according to their origin -, or lack of completeness. Knowing what type of problem is occurring at all times is essential to correct, in the statistical analysis, the biases caused and obtain good predictions.
In the initial months of the pandemic, one of the key elements to be able to model the evolution of a pandemic was not provided: reliable information on population mobility. This has been obtained, for some months, thanks to the agreement between the National Institute of Statistics (INE) and the main mobile phone companies in Spain; Specifically, aggregated data is produced on the daily flows of mobile phones that “stay overnight” in one cell and spend most of the day in another of the approximately 3,200 cells into which Spain has been divided for this purpose. As a consequence of the state of alarm, this valuable information was not available until the beginning of June.
During the first three months of the crisis, the main daily series on the evolution of the pandemic –number of confirmed cases, hospitalized, ICU, deceased– have been provided, both throughout Spain and by autonomous communities. However, the quality of the data, the lack of data in certain periods and the frequent lack of harmonization –that is, the application of different definition criteria depending on the origin of the data– have caused serious problems when analyzing them. For example, some autonomous communities reported the total number of covid-19 patients who had to be hospitalized since the epidemic began until the day in question, while others reported the number of patients who were hospitalized on that day. These series are not only different but, what is more serious, one cannot be calculated from the other.
Many of these defects would be solvable if there were coherence between the definitions of the series for the different autonomous communities, over time; others, such as the fact that they are not complete or the presence of certain biases, are inherent in the nature of the data. A first case is the so-called censored data. They are important to model, for example, the length of hospital care required by the population. If individual patient data are available – conveniently anonymized – it is possible to determine the time from when the patient is diagnosed until she needs to be hospitalized (if this is the case); the length of time you will be in the hospital and, more importantly, the length of time you will be in the ICU. At the height of the pandemic, for some patients this information was only partially known, since medical care had not concluded, and is called censored data. In contrast, an uncensored data would be that of a patient who, on the date of extraction of the information, has already finished his stay in the ICU. Naturally, the uncensored data give complete information on the magnitude under study, but the censored data also give very relevant information, if treated appropriately.
Many of these defects would be solvable if there were coherence between the definitions of the series for the different autonomous communities, over time
Another bias occurs when analyzing the daily number of deaths from covid-19. Sometimes it takes several days from when a death occurs until it is reported. To estimate this delay, and thus approximate the number of deaths on a specific day from the deaths that occurred on that day that have already been notified, the relevant information must be collected: day and time of death and the communication thereof. . However, deaths with a long reporting delay are more difficult to observe simply because not enough time has passed for that information to have been provided, while data with a low reporting delay are more present than they should be. This produces a bias, called truncation.
For the adequate estimation with truncated or censored data, and with many other biases, we must know what type of problem is occurring, and know some additional information to correct it (such as the notification delay, the fact of whether a temporary data in ICU is censored or not, in the above cases). The idea to approach a correct estimation is to try to express the characteristics of the (unobservable) variable of interest in terms of other quantities that depend on some observable variable, which can then be estimated empirically. That is, face the fight against bias with more data and, as Galileo proposed, make measurable what is not.
Ricardo Cao Abad is professor of Statistics and Operations Research at the University of Coruña and president of the expert group of the “Mathematical Action Against Coronavirus” of the Spanish Mathematics Committee (CEMat), which on August 27 and 28 promoted the summer school “Mathematics vs COVID-19” together with Menéndez Pelayo International University.
Ágata A. Timón G Longoria is the communication and outreach coordinator of the ICMAT
Coffee and theorems is a section dedicated to mathematics and the environment in which it is created, coordinated by the Institute of Mathematical Sciences (ICMAT), in which researchers and members of the center describe the latest advances in this discipline, share meeting points between the mathematics and other social and cultural expressions and remember those who marked its development and knew how to transform coffee into theorems. The name evokes the definition of the Hungarian mathematician Alfred Rényi: “A mathematician is a machine that transforms coffee into theorems.”
Editing and coordination: Ágata A. Timón García-Longoria (ICMAT)