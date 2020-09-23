The reproducibility of the experiment would seem to be a sign of its performance. When using statistical methods, things are more complicated.

World perhaps the most famous scientific experiment was conducted in Pisa in the late 16th century. That’s an astronomer Galileo Galilei showed that all objects, regardless of weight, fall at the same rate.

Most likely, Galileo did not drop pieces from the Leaning Tower of Pisa but used spheres of different weights rolled along sloping surfaces for measurements.

The experiment has since been repeated thousands and thousands of times. If the test is done correctly, the result has been the same every time.

In common sense in terms of the correctness of a scientific experiment, it is precisely reproducibility that proves it. When the same test is repeated, it is permissible to expect the same result.

But most of the time it doesn’t go that way, says the statistician Juha Lappi.

“In physics, things go around, but most science is not physics,” he points out. Retired Lapland has just been written by a professor at the University of Eastern Finland Lauri Mehtätalon with a new statistics textbook.

When researching issues related to human behavior, medicine, psychology, biology, or ecology, the subject is complex phenomena that can have enormous underlying factors. Even the best experiment yields only a small sample of the entire subject.

“Identical experiments may have the same results, but may not come,” Lapland summed up in its opinion In Helsingin Sanomat in July.

“Waiting for the same result from the same test is like expecting a coin to always fall the same side down. In reality, some tests always give different results, no matter how they are done correctly and in the same way. ”

Lapin the claim might sound strange, so we asked for an explanation. Understanding the argument requires a little understanding of statistics.

Even a blood pressure test is taken to find out if the test drug lowers blood pressure.

No medicine always works. There are all kinds of abnormalities in the conditions of patients and other health conditions. Someone’s blood pressure always drops even without medication.

Therefore, the effectiveness of the drug is tested in the so-called null hypothesis against. That is, it is assumed that blood pressure remains, on average, the same for those who receive the drug and those who do not.

The null hypothesis can be rejected if the test gives a result that would be unlikely if the drug has no effect.

However, the problem remains how large a difference between the test result and the null hypothesis is needed to reject the null hypothesis.

Here a statistical tool called significance level.

It refers to how often an experiment is allowed to produce a false positive result if the null hypothesis is true.

“In principle, the significance level can be chosen as low as desired,” says Lapland.

In a statistical experiment, random variation is always present. The lower the significance level, the stronger the evidence is required to distinguish the result from random variation.

However, this comes at a price. A decrease in the significance level also lowers the experiment intensity.

The intensity of an experiment refers to the probability that a false null hypothesis is rejected, i.e., for example, a drug is actually found to lower blood pressure.

In addition to the significance level, the intensity is affected by the sample size of the experiment, the magnitude of the effect tested, and the size of the random variation.

Thus, in a blood pressure drug test, the strength of the test is affected by how many people participate in the test, how effective the drug being tested is, and how much people’s blood pressure happens to fluctuate regardless of the drug.

In addition to these, it is affected by the level of significance chosen.

Quickly in mind, it would make sense to demand the strictest possible level of significance, but in practice this would erode the intensity of the experiment.

Even the efficacy of an effective drug would not be observed under random variation.

If the effect of the drug is thought to be real but very small, the strength of the test can be as low as 6%, for example. Then 94 percent of the experiments give an erroneous result stating that there is no effect.

As a kind a standard of significance of 5% has become standard in scientific experiments.

A result above this level is said to be statistically significant. This means that if, for example, the drug tested has no effect, it is correctly detected in 95% of the experiments.

“It is a common misconception that a statistically significant result would mean that the result is strong or significant,” Lapland points out.

That’s not what it means. It simply means that the result obtained is unlikely if the null hypothesis were true.

However, five percent of the experiments remain. For them, the level of significance produces so-called false positives, i.e., results in which the null hypothesis is rejected due to mere random variation, even though there is no real effect.

Problem worsens if the significance level is raised.

A level of significance defined as too loose can make even an ineffective drug appear effective if the mean blood pressure in the test group falls by chance over time.

But even at a tense level of significance, there are problems. The situation is difficult, for example, when the effect tested is small, for example, the medicine lowers blood pressure, but on average only slightly. Then the intensity of the experiment remains small and the null hypothesis cannot be refuted.

Even increasing the number of experiments does not directly help, as a new experiment is likely to produce a similar result for non-existent efficacy. However, the intensity of the experiment can be increased by increasing the sample size.

“ Even in scientific tests, it would be better to talk about margins of error, like opinion polls.

Larger the problem has recently been perceived to be false positive results caused by null hypothesis testing.

There has been a lot of talk about experiments in the field of psychology, for example, in which some surprising phenomenon has been observed, such as a slowdown in a person’s walking after being shown words related to old age.

Such experiments are typically small and inexpensive, so they are easy to do a lot. This means that there will also be false positives.

Already in 2015 Science News called null hypothesis testing doping in science, which created meanings from scratch even where they didn’t really exist.

That is, for example, the observation of slowing walking was apparently based only on random variation and could not be repeated later.

Data that examine many variables simultaneously are particularly susceptible to this.

If, however, the associations of different personality traits from everything possible from sleep to diet, number of children and life expectancy are mapped, the data will also be found to be purely random with statistical inevitability.

Statistics have ways to address this, such as testing multiple null hypotheses simultaneously. Science News has even demanded the abandonment of all null hypothesis testing.

Lapland, for its part, suggests that instead of the strict dichotomy used in null hypothesis testing, it would be better to speak in scientific tests as well as in opinion polls. margin of error. This would better highlight any uncertainties related to the results.

“ “It’s always good to keep in mind the uncertainties of the tests.”

False for the positive, the logic of scientific publication also becomes a problem.

Because scientific publications are often interested in surprising and peculiar results, researchers are often under pressure to publish these peculiar findings.

If at least twenty experiments are performed and one of them becomes a surprising result, the researcher may be tempted to forward it. He may even believe he has found something exciting himself.

This results in a situation called publication bias, in which false positive results that are compelled by statistical methods are reported as new discoveries.

“At this point, I would like to show restraint from the researchers. It is always worth thinking about the probability of finding false positives when analyzing the data. If a result seems too exciting, it may well be it. ”

Lapland urges us to remember that the chosen significance level also indicates the number of false positives.

Publication bias has led reproducibility crisis a phenomenon in which many experiments accepted into the research literature have not subsequently been replicated with the same results.

At its most cynical, it has been estimated that up to half of the published results, for example in the field of psychology, would be in such weak positions.

Lapland acknowledges the inconvenience of the situation, but points out that a failed replay still does not necessarily mean that the original result was wrong. If the intensity of the test is small, the real effect is often undetected.

Even detecting an effect in data where it does not actually exist does not mean that the experiment was misdone or misinterpreted.

It can only be a question of the uncertainty inherent in statistical methods.

“When drawing conclusions, it is always good to keep in mind the uncertainties of the experiments. Statistical methods sometimes produce false results, no matter how well the experiment is done. ”