Reducing unknowns in marine biodiversity research

I have been interested for a while in trying to catalogue the state of our knowledge of marine biodiversity. This interest can take various forms – previous work, for example, has documented biases in where in the water column we tend to look for marine species (in a nutshell: we look at the surface, or on the sea bed, but hardly ever in the big blue bit in the middle!). But over the last couple of years myself and some colleagues have been trying to document the extent of our biological knowledge of UK marine species. By ‘biological knowledge’ I mean the most basic natural history: How big does a species grow? What does it eat? How many offspring does it produce? And by ‘UK marine species’, I’m referring to a good subset (about 1000 species) of the more commonly encountered, large (comparatively speaking – bigger than a millimetre or so, anyway) fish and invertebrates which tend live on or near the sea bed.

I’m delighted to say that this work¹ has now been accepted for publication in Global Ecology & Biogeography, and I may write more about it when it actually comes out. Suffice to say: for most invertebrate species (and a surprising number of fish), we know little more than how big they get (if that). Basic natural historical knowledge for most marine species in the UK is sorely lacking – and note that we have probably the most intensively studied marine fauna in the world.

The question then is, can we fill these yawning gaps in our knowledge? I can’t see any funder jumping to pour money into 19th Century-style descriptive natural history (although it may be possible to sneak some in to more ‘impactful’ proposals…), so the first step is to ensure that we have properly mined the available data. For our study, we largely relied on data already entered into publicly available databases like FishBase and Biotic. But the suspicion remains that more data are out there, on dusty library shelves and in forgotten filing cabinets.

So, last summer my colleague Paul Somerfield from Plymouth Marine Lab and I set Calum Watt, a willing undergraduate student, loose in the MBA’s National Library, with 100 species names randomly chosen from our UK list, and the instruction to find as much biological information as he could – and to time himself doing it.

Now, there are various problems with the design of this ‘time trial’ – for instance, Calum’s species list was a random selection of all our species, but was not randomly ordered. So, because he only got 25 species in, he covered Annelids and most Arthropods, but didn’t get to Cnidarians, let alone Molluscs! His search strategy was deliberately rather haphazard too, and a more targeted approach may have proved more effective. Hell, there’s a reason why this is a blog post, not a paper!

Anyhow, his data have sat on my hard drive for a year, until I managed to find another willing student – Beth Mindel – to make some sense of all the “=NOW()” entries in his spreadsheet, and summarise what he found.

I guess the encouraging thing is that Calum did find some new trait data. He filled at least one gap (and often more) for 16 of the 25 species that he tackled, most frequently adding body size data, but also information on reproductive strategy, feeding method and diet. Also, when he found data on traits that we already thought we knew, in most cases the new information matched (or nearly did) the existing data, which was reassuring!

A couple of other (not unexpected) patterns emerged: the more time he spent searching, the more information he found and the more gaps he filled. Now, obviously at some point even the MBA library will be exhausted as a source of information, and we wouldn’t expect the accumulation of new information to continue on this linear trajectory indefinitely. But given that we have no idea when it might start to plateau, and given that this is just a bit of fun, we can at least wildly extrapolate from the 26 hours that Calum spent on this project to give a ‘best case’ indication of how long it might take for us to fill all the gaps in our database.

So, assuming that the Annelids and Arthropods are broadly representative of the 825 invertebrate species in our database (and so ignoring the 148 fish species for now), what are the numbers?

We collected data on 8 broadly-defined biological traits, so there are 825 × 8 = 6600 gaps to populate. In our study, we had filled in 1630 of these, leaving 4970 to fill. Calum managed to fill in an average of one gap every 29 minutes. So, at this rate of productivity it would take 144130 minutes, or approx 2400 hours to fully populate our database. 100 days, in other words, for a student (i.e. discounting sleep, days off, etc.). Or, 65 working weeks if we were to cost it properly.

Just goes to show, I think, that regardless of all the online tools now available, and the vast digitising projects underway, making the best use of the work of past generations will still require hard slog for someone in a good old-fashioned library.

¹Full ref: Tyler, Somerfield, Vanden Berghe, Bremner, Jackson, Langmead, Palomares and Webb. Extensive gaps and biases in our knowledge of a well-known fauna: implications for integrating biological traits into macroecology. Glob Ecol Biogeog, in press.