On Wednesday evening last week the psychology department hosted a guest lecture by Professor David Shanks from UCL titled “The Replication Crisis in Psychology”.
As the title suggests, the lecture discussed the issues around the lack of replication, and the inability to replicate a number of experiments in the area of psychology. Replication should be important in science as this is how you check if you have a real observation or not. There were examples of examples ranging from the intentional fabrication of data by Diederik Stapel , publication bias and key studies not being reproducible.
I am sure that I would not need to search for long to find references showing that this is not just a problem in psychology. I actually spent about 4 months last year looking into similar issues relating to clinical trials . While it would also be foolish to rely on anecdotal evidence I have even spent the past few weeks failing to replicate published observations, despite replicating their experiments almost exactly.
He made a number of suggestions about how to attempt to correct these issues.
One of the issues he highlighted was the pressure put on psychologists (and I’ll include all scientists in this) to generate ‘positive’ results. A positive result being one that shows that something happens – using an example shown in the lecture that participants walk slower after being ‘primed’ by hearing words associated with an elderly stereotype than those that were given a different stereotype . A paper that showed no effect would be a lot less ‘sexy’.
Research grants fund academic science and these generally go to the ‘best’ scientists. But how do you decide who the best scientists are? The one with the largest H-index of course! The H-index is a metric of how many citations you have on how many papers, theoretically the more you have the more impact your science is having. However, there is a big but. ‘Positive’ papers are cited more than ‘negative’ papers so you are more likely to have a higher H-index if you publish more positive papers. In addition you are more likely get your paper into a ‘good’ journal of you have an exhilarating result.
This puts the pressure on people to find exciting and citable findings, whereas science usually chucks something back at your showing that there is no difference. This is still a result. It tells you something that you didn’t know before. That your intervention has no effect on what you were looking at. But this pressure to have cool results means that researchers engage in scientifically and ethically dubious practices to get these results. Based on a survey of psychologists  they found the following (among others):
- 0.6% of psychologists surveyed admitted to falsifying data
- 22% round off p-values (if you get a result of 0.054 you round to 0.05 to get significance),
- 63.4% said that they did not report all dependant measure.
- 55.9% said they decided to analyse and then decide if they were to gather more data
- 38.2% said that they decided whether to exclude data after looking at the effect of doing so.
Of course, not all of this can be pinned down to the pressure placed on scientists from above. Often we come up with our own hypothesis and we want it to be right – it can be disheartening when the data shows the opposite.
I have also observed students repeating an experiment a number of times and excluding the repeats that did not show what they wanted, even if this was the majority of them. If I tossed a coin 10 times and only showed you the data from half the experiments I could convince you that my coin only ever landed on heads.
While the pressure placed on scientists to get positive results and the fact people are not perfect and would prefer their hypothesis to be true explains some of this, but not enough for people to wilfully commit what is essentially scientific fraud. I personally apply Hanlon’s razor (Never attribute to malice that which is adequately explained by stupidity) and think that a number of these people did not realise how bad what they were doing is (although those who falsified their data have no excuse).
If I am honest my statistical knowledge is woefully inadequate for the job that I am expected to do, yet I don’t think the majority of my researcher friends would disagree with me if I were to say that I am probably one of the best at statistics in my peer group. While these issues seem obvious, I have only become acutely aware of the impact of some of these practices last year – when I had some time off my course and could read whatever I wanted. I wonder how many researchers understand the detrimental effects caused by their shoddy statistics.
David Shanks also attributed the reason as to why some studies cannot be replicated to the type of statistics that are commonly used in papers. We throw around P values in all our research, yet I don’t think many researchers know what the P value actually represents. Think you do? Go on then!
Is the P value?
- P is the probability that the results are due to chance, the probability that the null hypothesis is true.
- P is the probability that the results are not due to chance, the probability that the null hypothesis is true.
- P is the probability of observing results as extreme (or more extreme) as observed if the null hypothesis is true,
- P is the probability of observing results that would be replicated if the experiment was conducted a second time
- None of these
Answer at the bottom of the post.
There seems to be a lot of mutterings about the fact we should be using Bayesian statistics not fisher statistics (P values) . But I can see a problem with using them. If I generated a data set analysed it and generated a P value of 0.04, everyone (bar the statisticians) would be happy. I could give it to my supervisor, potentially get a compliment, include it in my thesis or paper and pass through my viva or the reviewers stages without anyone commenting on the statistics – perfect! If I was to attempt to use Bayesian statistics I would first need to go it alone and work out how to do it (and hope that I had done it right) and need to justify its use to everyone who came anywhere close to my data. With my knowledge in the area being rather ropey it would be hard to defend it use. We are all a bit lazy and frequently take the easiest route. I would therefore like to take this opportunity to invite you to berate me, heavily, if you see shoddy statistics in my thesis.
From my observations, statistics is usually taught at degree level by someone who isn’t entirely sure about statistics themselves, thus creating more problems. In addition, even at postgraduate level statistics are viewed as a confusing subject. Yet nothing is done about this. When I worked in industry we had easy access to statisticians to advise us and we had regular stats lectures to try and keep us up to speed. I have not seen anything similar on campus. The normal method seems to be to put it into graph pad and see what buttons you can press to get ‘the right result’.
While I don’t know what things are like over the road in chemistry and physics I can’t help but feel that the biosciences (and according to this lecture the social sciences) need to up their game and improve their statistical knowledge.
The good old ‘filing cabinet effect’. As previously mentioned, a scientists’ careers rely upon their ability to gain citations (other scientists publishing and talking about their work). As we said this is likely to be a ‘good’, ‘positive’ paper.
It means that if some data is generated that is not-significant it can suddenly seem very boring and the motivation is not there to write it up and (pay to) publish it. This means that there is a huge amount of information just sitting there, gathering dust in filing cabinets, or more likely in the depths of a hard drive.
In addition, David Shanks, listed examples of where people have attempted to reproduce a study, failed, and published the results anyway (as a good scientist should) only to be met with criticism from the original author .
Yet negative results are just as important as positive results. Replication is what science relies upon.
An idea that wasn’t covered in the talk relates to data sharing. It is one thing to falsify a couple of graphs, but the size of the task required to falsify a large data set might be enough to put off some forgers. However, someone who is intent of intentionally falsifying data will do this whatever barriers you put in place. What this would prevent (in my opinion) is so called ‘p-hacking’ (rounding down of P-values) and spurious analysis.
Data sharing has critics , the first argument relates to funding and as I will say below it science would probably be in a better place if we funded good science (data sharing, transparent protocols and complete reporting) and not good citations. The second relates to the effort required to share this data, but my opinion is that this effort is worth it if it means we do good science and stop churning out unreproducible rubbish.
How to solve?
The obvious answer is to put me in charge? As much as I would like to be dictator of the world, I feel that this might be ethically dubious and unobtainable.
I think one of the key things is to reward good science not good citations (although hopefully they will correlate). We need to stop doing things the way they have always been done, improve our statistical knowledge and try our best to not fall into the traps of data massaging, however appealing it may seem.
A P value is answer number 3. About a month ago I got this question wrong.
By Harry Holkham