Vital Statistics — sheffieldmeme

Statistical analysis divides bioligists. There are those of us who like nothing more than to dive in to a sparkly new method, and who feel a day has been wasted if it has not involved the interrogation of at least one little model. And then there are normal biologists, for whom stats are, at best, a necessary evil, an arduous process that has to be endured in order to convince others of what you already know to be true. The kind who got into biology to get away from maths, and who have been resentful of its intrusion into their working lives ever since. But, given that biological research without statistics is not really biological research, every tree-hugging, dolphin-bothering biologist must at some point roll up their sleeves and learn some stats. And the question that faces people like me, who have to teach them this stuff that they find difficult and pointless, is: what does a biologist need to know about statistics?

So I was interested to come (indirectly, via the excellent R-bloggers site) across a blog by Ewan Birney entitled Five statistical things I wished I had been taught 20 years ago. Making lists like these is always going to set you up for a fall, and if I disagree with some of Ewan’s points, that’s not to say I don’t admire him for posting them. And some of what he has to say I agree with entirely.

Of his five, for example, I would stick a big tick next to nos. 2 (learn R) and 4 (understand the difference between a P value and an effect size, with (especially relevant for a bioinformatician) reference to sample size), and I also agree with 3 that permutation / randomisation can be a very effective way of making sense of large and badly-behaved datasets.

Regarding number 5, on learning linear models and Principal Components Analysis, I kind of agree. In as much as, yes, properly understanding the basic linear model is probably the single most important piece of statistical knowledge that you can obtain. So much else flows from it: from General to Genearlised Linear Models, on to Generalised Additive (Mixed) Models, mixed effects models of many different flavours, Generalised Least Squares, and most other things you will ever need. On the way, cutting out a lot of confusing terminology too, by conceptually uniting things like regression, Anova and t-tests that most of us are taught as stand-alone tests. PCA, on the other hand, is a useful little tool for visualising multivariate data, and potentially then for feeding in to more interesting analyses, not much more.

But I can’t agree with number 1, on the importance of non-parametric statistics. In my experience, people tend to use these because they don’t really understand the assumptions of parametric methods like linear models: almost always, these make assumptions about the residuals, rather than the raw data – so one of your variables not being ‘normal’ is not necessarily cause for alarm – and almost always, there is a parametric technique which has been designed to cope with the specific odd residual distribution that you have, and which will (incidentally) be considerably more powerful than its non-parametric (typically rank-based) equivalent. If not, randomisation will generally be more informative than rank-based statistics. In addition, people tend not to have a good understanding of the assumptions made by non-parametric tests (often relating to the variance distribution, the very thing which may have driven you to non-parametrics in the first place!).

Vital statstics for biologists, then? Learn your linear models (some good books). And basic probability. But then judging risk and probability is a vital skill for everyone, if you ask me.

Or, you can follow the advice that I read in a ‘Statistics for Social Scientists’ book (and I don’t mean to bash the social sciences here, at all, I know they drive some excellent stats, that did just happen to be the book I read this advice in): If the data are not unanimous in support of your hypothesis – change your hypothesis…