Statistics and medicine, by Manel Esteller

This last week we have contributed to discovering that a small group of patients with advanced cancers respond very well to chemotherapy, surviving far beyond what the cold statistics say, and it has returned to me rethink some concepts of probability and bioinformatics that we use daily in labs. They are ideas and techniques that people outside the world of biomedical research do not usually think about and do not know much of its terminology. As if it were a secret language only interpretable by a hidden code.

As those of you who follow me know, I have always attached importance to the exception. I think it defines the rule much more than the more common elements of the rule themselves. The great responders that I have mentioned, the centennial individuals, the children with Progeria, the rare diseases, the twins … all the highly infrequent casuistry gives us clues about the most common phenomena: the expected response to chemotherapy in the general population, the biological processes relevant to the aging of the human species, the cellular pathways involved in metabolism and neurological processes, or the relationship between genetics and the environment, respectively. Well, many times the exceptions are taken from the studies, they are called ‘outliers’ because they confuse the analyzes. Understandable position, but sometimes we lose the value of what is unique. Do not pay too much attention to me, I have always been a ‘rare bird’ in many ways that this article cannot cover in its short length.

The easiest way to study whether two events are related or associated (let’s not be too purist please!) Is to put them in two-by-two tables. These analyzes are subjected to statistical tests called Fisher’s test and Chi square. They give you a general idea if your hypothesis is on the right track. They have always seemed to me a slightly more sophisticated rule of three, but they are a good first approximation to test the concept we want to demonstrate. Then there are more sophisticated tests that are given names that seem inherited from the Victorian 19th century like the Wilcoxon-Mann-Whitney test or the Pearson’s coefficient or something from classical Russian literature like Kruskal Wallis.

From all these tests, as well as others such as ANOVA or regression analyzes (wake up!), We can ultimately tell if two variables are related or associated beyond pure chance. If this is the case, then the result will be said to be “significant”. And then here is introduced the God at whose altar all researchers in all disciplines pray: the value of the “P” (P-value). It sounds like a small thing, a rather bland modest consonant, if I may rhyme. But almost everyone bows before her. If your results end up having a statistical value higher than P = 0.05, you are already lost. And if we get thin, it better be lower than P = 0.01 or even lower than P = 0.001. The poor researcher will stretch his hair if it is not significant and if it is he will jump like a little goat. Too bad, because sometimes a lot of information and data is lost by these conventions and there are already groups that promote leaving aside the values ​​of the desired “P”. As in everything, the truth is ‘probably’ halfway between one and the other.

And how is the data? In a simple way, there are two types: continuous and discrete. The first are longitudinal, one behind the other without pause, such as height: 1.50 cm, 1.51 cm, 1.52 cm … The second are categorizable, often bimodal: being pregnant or not, having a driving license or not. All this said very simply because sometimes the boundaries between one type of data and the other are blurred, like the brightness of a winter autumn in Baker Street. And how is the data represented? Oops, there are tons of ways to do it! In elections, it is done in the form of bars or ‘small cheeses’ that reflect percentages or total numbers of votes and seats. But there are more sophisticated ways to do it, especially if we increase the complexity of the system, such as the 6,000 million pieces of our DNA. One way to represent genomic data is in “heatmaps” derived from cluster trees or also using principal component analysis (PCA). These representations tell us graphically if a sample belongs to one group or another, for example if it is benign or malignant. In a simple way, if a person when entering a party goes with those who rob the fridge or joins those who discuss politics, we will know which group they belong to.

Today there is many computer and mobile programs that do all these analyzes, and the new artificial intelligence reading a histological section or a chest X-ray will also directly give you the diagnosis. And it is perhaps easier to forget by looking at the Kaplan-Meier survival curves that each point on the graph represents a person. And perhaps we will get the illusion that we are more than a piece of information in a statistic. Probably.


Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.