Mathematics to create a cellular map of diseases

The human body is estimated to contain 30 trillion cells organized into tissues. Each human cell contains 6.4 billion DNA nucleotides, which are structured around 20,000 coding genes, and each gene can give rise to multiple proteins. An international consortium of scientists is trying to compose an atlas (Human Cell Atlas) to characterize molecularly (DNA, genes, proteins) and morphologically all the cells that make up the human body. This tremendous technical and economic effort has to incorporate mathematical methods that make it possible to extract all the relevant information and at the same time simplify it, to make it interpretable. To meet this challenge, in recent years the techniques of dimensionality reduction for single cell data analysis.

Currently we can characterize each cell very exhaustively. On the one hand, thanks to complex molecular biology techniques, we can identify the mutations present in the DNA of a specific cell or quantify the expression of the catalog of genes and proteins specifically expressed in it. This information is incorporated into a matrix with more than 20,000 rows —the approximate number of genes expressed in an experiment—, and as many columns as there are cells being analyzed, currently tens of thousands. On the other hand, imaging techniques —with increasingly higher resolution— are used to explore changes in the shape, size or structure of each cell.

Our ability to study this large amount of jointly generated data is very limited, due to both its dimensionality and its heterogeneity. Dimensionality reduction techniques allow cell maps to be created in as few as two dimensions, chosen to ensure that as much information as possible is preserved while being synthesized, facilitating the identification of groups of similar or similar cells. clusters, its visualization and its subsequent interpretation. Thanks to these maps, it has been possible to identify and quantify new cell subtypes associated with the genesis and development of different complex diseases, from cancer to cardiovascular diseases.

More traditional dimensionality reduction techniques, such as principal component analysis proposed by Karl Pearson more than a century ago, are based on the linear projection of information on a hyperplane, like a photograph projects the three-dimensional world on the plane of focus. These techniques have the advantage of respecting real distances relatively well in low-dimensional space, but they are often unable to capture all the complexity contained in the data, especially if the relationship between the system variables is nonlinear, as is the case with the molecular and phenotypic variables that can be measured in a cell.

For this reason, in the last decade new non-linear dimensionality reduction techniques have been proposed. The idea behind them is to identify a new two-dimensional space that summarizes as much information as possible, preserving distances insanely, to the detriment of losing, to a certain extent, the global structure. This allows us to identify groups of similar elements, for example cells, in the two-dimensional representation, even though the distances between the different groups are distorted.

Its behavior is similar to that of the Mercator map projection, the most used to make world maps, which increases the distortion of areas and distances as we get closer to the poles. At a local level, distances are maintained, that is, geographically close areas are on a map, but remote areas do not maintain distances when they cross meridians, which does not prevent the map from continuing to be useful.

To achieve their goal, these new methods use iterative algorithms based on directed graphs, built from the calculation of distances between data neighborhoods, generating attractive or repulsive forces in the new representation space depending on their similarity. The way in which the concept of neighborhood is defined in each piece of data, together with how and under what circumstances these forces are generated, is the key and the main difference between the different algorithms that we can find, such as t-distributed stochastic neighbor embedding (t-sne) or most recent Uniform Manifold Approximation and Projection (UMAP).

The mathematical theory behind the latter mixes concepts from algebraic topology, Riemanian geometry, and fuzzy logic to generate a representation of the data in the form of a graph; and probability theory, optimization, and mathematical programming to optimize its representation as faithfully as possible in a space of lower dimensions. The result is a powerful, fast, and scalable dimensionality reduction method that is highly useful in multidimensional data analysis, and in particular, in single-cell molecular data analysis. Despite its strengths, understanding the underlying mathematics is crucial to interpreting its results correctly.

These new dimensionality reduction algorithms well represent the type of methodologies that we must continue to develop in order to analyze the large amounts of biomedical data that are being generated, the volume and complexity of which will continue to increase in the coming decades. Only with the right mathematics will we be able to continue advancing in the understanding of the causal mechanisms of complex diseases, from cancer to Alzheimer’s and cardiovascular diseases, and thus, in the implementation of precision medicine.

Fatima Sanchez Cabo Director of the Bioinformatics Unit of the National Center for Cardiovascular Research (CNIC) and associate professor at the Autonomous University of Madrid;

Daniel Jimenez Carter is a senior technician at the Bioinformatics Unit of the CNIC.

Coffee and Theorems is a section dedicated to mathematics and the environment in which it is created, coordinated by the Institute of Mathematical Sciences (ICMAT), in which researchers and members of the center describe the latest advances in this discipline, share meeting points between mathematics and other social and cultural expressions and remember those who marked their development and knew how to transform coffee into theorems. The name evokes the definition of the Hungarian mathematician Alfred Rényi: “A mathematician is a machine that transforms coffee into theorems”.

Edition and coordination: Agate A. Timón G Longoria (ICMAT).

You can follow MATERIA on Facebook, Twitter and Instagramor sign up here to receive our weekly newsletter.