Tom Webb

December 15, 2015

Trait databases: the desirable and the possible

Tom Webb

December 15, 2015

Another major traits database has recently come online. This time, all you need to know about the life histories of 21,000+ amniotes (reptiles, birds, mammals), courtesy Nathan Myhrvold, Morgan Ernest and colleagues. I’ve been working with ecological traits for a good while now, and this kind of thing excites me. It also demonstrates the kind of self-interested altruism that typifies the Open Science mentality. As Morgan puts it in her blog post on the paper:

The project started because my collaborator, Nathan Myhrvold, and I both had projects we were interested in that involved comparing life history traits of reptiles, mammals, and birds, and only mammals had easily accessible life history databases with broad taxonomic coverage. So, we decided to work together to fix this. To save others the hassle of redoing what we were doing, we decided to make the dataset available to the scientific community.

In other words, you start by fixing a problem that you yourself have, and then make your solution available to save others the bother. Practical and admirable. The same thing is happening elsewhere, with other kinds of ecological data - take the ‘data rescuing’ example of the PREDICTS project:

https://twitter.com/KatheMathBio/status/676448069890744321

(Compare and contrast with this fascinating, frustrating new book by John William Prothero, The Design of Mammals: A Scaling Approach, another monumental data compilation which includes a multitude of intriguing scaling relationships, calculated from 16,000 records for around 100 response variables, almost none of which are replicable from the subsets of data provided in the Resources section online.)

But Morgan’s blog post becomes really interesting as she muses on the what the end game might be for traits databases. She proposes a centralised trait database, with a focus on individual records, that is easy to contribute to and where data are easily accessible. We had a short exchange on Twitter after reading this, but I’ve continued mulling it over and my thoughts have expanded past 140 characters. Hence, this.

Basically, I have been trying to imagine what this kind of meta-dataset might look like. And my difficulty in doing this in part boils down to how we define a ‘trait’.

The simplest definition is pretty broad, with a trait just any measurable property of an organism (noting that some ‘traits’ apply only to populations - e.g. abundance - or even entire species - e.g. range size). And my own work, like Morgan’s, has typically focused on life history and ecological traits - things like size, growth, reproduction, and feeding. In some respects these are some of the simplest traits to describe, but they can still be tough to measure, and (especially) to classify and record.

Part of this difficulty arises because much work on traits involves imposing categories on nature, and nature abhors a category. Then again, individuals of the same species can do quite different things, or the same individual might display different traits at different times or in different places. Some people have tried to get around this by using a ‘fuzzy coding’ approach - for instance, rather than having to classify me as ‘carnivore’ or ‘herbivore’, you could say that my diet is split, say, 15% carnivore to 85% herbivore. In many ways this seems a sensible solution, but it is of course rather subjective, still requires some rather arbitrary categorisation, and, in the context of this post, is very difficult to incorporate into a more generic database.

Other traits may seem simpler. Body size, for example, the so-called ‘master trait’, us surely easy to measure? You just weigh your organism, right? Or perhaps you measure it’s length. And is that total length, or wingspan, or leg length, or standard length? Oh, you want dry weight? Or mgC? Equivalent Sperical Volume, you say? And so on and on…

Related to body size are other morphological traits. For instance, my colleague Gavin Thomas and his group are busy 3D scanning beaks of all species of bird (you can help, if you like!). Such sophisticated morphological measurement quickly generate individual-level ‘trait’ databases of many dimensions; how might these be incorporated into a more general database? Record each dimension? Or use some agreed (but somewhat arbitrary) composite measure of ‘shape’?

One more example of a different class of traits. On the Marine Ecosystems Research Programme, I’m working quite a bit with ecosystem modellers, and their lists of desired traits are terrifying: Michaelis Menton half sat. uptake const.; Excreted ingested fraction; Respired fraction - all things that seem a long way from the sorts of life history databases I’m familiar with, or from things that can be easily observed in the field. Many of them will be body size and temperature dependent (at least). And so what might be most appropriate to record are the parameters from some fitted scaling relationship; but this means losing a lot of raw data, which surely we would like to retain?

And so on.

So whilst of course I applaud the efforts of people like Morgan to make large trait databases more useful and accessible, and agree completely that we should both think big and think individual, a complementary angle of attack might be to make linking existing databases easier. Of course, they need to be available, well documented, and appropriately licensed as a first step, and straightforward programmatic access should be designed in. But we can also make more efforts to link to taxonomic standards, to ensure we include accurate geographical and contextual information with individual records. Always ensuring our data are nice and tidy so that others can easily do more interesting things with them.

Tom Webb

December 20, 2012

Big data for big ecology

Tom Webb

December 20, 2012

As buzz words go, ‘big data’ is right up there just now. It seems that every question you care to think of, in every field from public policy to evolutionary biology, can be hit with the big data hammer. Add an ‘omics’ or two too, and you’re laughing. So I’m slightly ashamed that we decided to call our workshop at the British Ecological Society’s Annual Meeting ‘Big Data for Big Ecology’. But when I say ‘we’ I mean the BES Macroecology Special Interest Group, and Macroecology is – as its name suggests – ‘big ecology’, so it seemed natural to combine this with the buzz word du jour.

And as it turned out, I think we were vindicated. We held the first of two 1 hour workshops in a room that could comfortably seat 50. Over 100 squeezed in, and we had to turn some people away. So clearly the interest is there, perhaps at least partly because ecological ‘big data’ differ from the data collected in other fields, and we’re still feeling for how best to deal with issues of storage, access, and analysis. This contrasts with some other fields. For instance, sequence data take a pretty standard form, and it’s relatively straightforward to design a system to collate all sequence data – Genbank is testament to this. Ecological data are much more heterogeneous – people measure different things in different systems, there’s no universally agreed common unit of measurement, people work at different spatial scales, in different habitats and environments, and so on. There is also the matter of what we mean by ‘big’. Again, there’s a contrast here with genomics, where a million sequences is now almost a trivially small number. I think in ecology we’re much more likely to be dealing with records in the thousands or hundreds of thousands, so again the computational challenges are different: doing something clever with a large quantity of complex data, rather than with an absolutely huge amount of more simple (or at least, relatively standard) data.

The aim of this first workshop was to introduce a couple of major ecological datasets, then to discuss the issues associated with sharing data. Importantly, by involving figshare, we were able to present some solutions rather than simply rehashing the same old (perceived) problems. I posted a storify of this first hour here, but briefly we heard from Paula Lightfoot, data access officer for the UK’s National Biodiversity Network Trust. The NBN holds >80 million distribution records from around 150 data providers, consisting of almost 800 individual datasets. Data cover a very wide range of taxa, although birds, lepidoptera and flowering plants make up ¾ of the records. The NBN gateway has always been a fantastic public-facing portal to biodiversity data (go and have a play if you want to confirm this), but these data are underused in research. So for me it was particularly interesting to learn about recent improvements to the NBN’s data delivery system to try to address concerns such as those raised by a BES working group involving several of the Macroecology group (including myself and group chair Nick Isaac). Some of the data on NBN is sensitive or otherwise restricted access, but now you can trigger a single request which goes to all relevant data owners. Likewise, you can download information from multiple datasets as a single text file – which, as ecological data analysts, is often all that we want.

Charly Griffiths from the Marine Biological Association data team then gave an overview of the data holdings in Plymouth, which was really valuable I think to raise awareness of some of these phenomenal datasets among the overwhelmingly terrestrial community of the BES. Things like the Continuous Plankton Recorder data held by SAHFOS, which at >80y is among the longest-running and most spatially extensive ecological time series in existence. Or the Western Channel Observatory data, which is one of the very few long-term datasets to collect information across an entire community (“from genes to fish, from seconds to centuries”).

Then we changed tack, from talking about where we might find data, to what we should do with our own. A quick show of hands revealed that almost everyone in the room had used other people’s data in their work; rather fewer had shared their own data. Mark Hahnel from figshare gave a quick demo to show how easy it can be to share all kinds of outputs – from static figures to code to very large datasets – on the figshare platform, where it instantly gains a doi, and thus becomes citable.

Given how easy this process is, why don’t more people share their data? Our discussion identified two main objections. First, people remain highly protective of their data, and suspicious that there are armies of people just waiting for it to become public so that they can do the same analyses (only faster) that the data owner had planned. I think this is understandable – ecological data are often collected in pretty extreme environments, involving a huge amount of hard work, and it is natural to want to get the full benefit of this toil before others are able to profit.

There are two counters to this. First, the idealistic one: in most cases you were paid to collect your data, very often with public money; the data are not yours to hoard; you were not funded to advance your career, but to advance science. Second, more pragmatically: it’s unlikely that many people are especially interested in what you do. Only a small fraction of those who are will have both the time to start to work on your data, and the expertise to do anything useful. Fewer still will be inclined to screw you over, especially (and this is important) if you have taken the step of laying out your stall in public (on figshare or wherever). And academic karma will sort them out soon anyway…

The second issue, that of data ownership, is harder to address, regardless of any mandate to make data available. This is a particular problem for someone like me, who uses other people’s data all the time. The value that I add lies in combining existing datasets and analysing them in novel ways. Often I have had to secure various permissions to use the data in the first place, and the extent to which what I have produced is an original data product is not clear. So while my inclination is to share everything, I do have to be very careful that I’m not sharing anything where I have previously signed an agreement to say that I won’t. Even in these cases though it is still possible to share extensive metadata and the code used to access and analyse the data.

Scott Chamberlain, who delivered the second workshop, touched on some of these kinds of issues, as well as potential solutions. Scott and the rest of the ROpenSci team use APIs to access large datasets, and it is perfectly possible for a data provider to restrict access to their data via this API route. In which case, one can publish a load of R code documenting how data were accessed, manipulated and analysed, which could be replicated by anyone having the same data access privileges that you do (often gained through personal contact with the data provider). This could be a really neat solution to accessing multiply-owned datasets. Scott’s presentation is online here, and if you have any interest in accessing data using R, it is a must read, and highly endorsed by all of the 100 or so of us who were at the workshop (see some of the comments in my second storify).

So where do we go from here? That’s a genuine question: we clearly hit a nerve and got a huge amount of interest, so we want to take it forward. But how? Should we be writing a set of standards for ecological data? A catalogue of existing datasets? A set of tutorials? I appreciate that we are far from the only people interested in this, and don’t want to replicate the efforts of others – so maybe a list of these other efforts would be a good place to start? Any thoughts gratefully received, either in the comments here or via Twitter (@besmacroecol, @tomjwebb, #besbigdata) or our facebook group.

Tom Webb

August 1, 2012

Defining a Field

Tom Webb

August 1, 2012

What does it take to have a real impact on the development of your field? Those charged with assessing UK research have taken the view that a small number of exceptional papers are a better indicator of quality than a mass of ‘lesser’ papers. Now we can quibble (indeed, I have done) about the way that ‘exceptional’ papers are identified (in particular by risk-averse departments and institutions). Furthermore, I’ve argued that setting out with the intention of writing a ‘high impact paper’ is often antithetical to doing good science. However, that’s beside the point. Regardless of the nuts and bolts of measurement, the idea that one should be judged on quality not quantity seems to be reasonably widely accepted. But if we’re taking a retrospective view, is it always the case that you can trace the development of a field back to one or two highly influential papers? I’ve been pondering this since the first meeting last month of the British Ecological Society’s Macroecology Special Interest Group. (Macroecology is ecology at large spatial scales, by the way, and is what I do. There’s a brief but useful wikipedia page here.)

As is the nature of such inaugural meetings, first our committee chair Nick Isaac, then our opening keynote speaker, Ian Owens from the Natural History Museum, provided a potted history of the discipline. In so doing, it is common practice to pick a significant publication and trace its subsequent influence. But in macroecology, that’s tricky...

OK, you could pick the 1989 Science paper by James Brown and Brian Maurer which originally coined the term ‘macroecology’. This paper has accrued a satisfying-but-not-stellar 329 cites, but my suspicion is that it’s not actually been that widely read, and that many of the citations run something like “the term ‘macroecology’ was first coined by Brown & Maurer (1989)...”

Or we could focus on the pioneers of UK macroecology, Kevin Gaston and Tim Blackburn. They have forged a formidable partnership, coauthoring 87 papers over a 20 year period, with a phenomenal burst of productivity in the mid to late 1990s which saw them publish as coauthors around 10 papers a year, papers which provided the foundation for much subsequent macroecological work. (Both have been prolific independently of each other too. Sickening isn’t it?) The point is, though, that it is difficult to pick a single of these 50 or so papers as being suitable for the ‘what happened since the publication of x’ rhetorical device. Although their work in aggregate has been well cited – 8 papers from that period 1993-2000 have picked up >100 cites – the maximum number of cites for a single work is <300. Which is good, no doubt, but not spectacular.

So what Nick and Ian both did was to pick as their milestones three books, two by Kevin and Tim (Pattern & Process in Macroecology from 2000, which summarised much of their previous five years of work, and the edited volume Macroecology: Concepts & Consequences) and one by Brown (Macroecology, a single-word title which has amusingly been cited in 20 different ways according to ISI WoK, including the antonymous Microecology!). Rich Grenyer Tweeted at the time that this had interesting implications for the high-impact-paper-obsessed REF, but I think it also tells us something about the development of scientific fields more generally.

Of course there are occasions when one or two landmark publications define decades of subsequent research (think Einstein, or Crick & Watson, even the occasional book like Hubbell’s Unified Neutral Theory of Biodiversity). But often it is steady accumulation, the gradual assembly of a body of work which counts. This recognises that it is not always possible – or if possible, not desirable – to force everything of value that you have to say into the strict limits of some of the higher profile journals (and you may not wish to see what you consider to be important analyses buried, probably unread, in supplementary material). This view of science essentially treats the literature as a kind of open notebook – a record of thought processes and incremental progress, rather than a single statement of ultimate truth. And in the case of macroecology, this broad foundation has served us very well.

Perhaps I can make a (non-Olympic) sporting analogy. Cricket exists in several formats, with the extremes being the smash-bang-wallop of Twenty20 (matches last about 3 hours) and the rather more sedate 5 day test matches. A test match batsman will steadily accumulate, and won’t try to hit every ball out of the ground – although if a ball is tempting enough, of course he won’t turn down the opportunity for the big hit. This mix of accumulation and opportunism seems to me to be a much better strategy for ensuring that a field is built on solid foundations than the headline-grabbing, REF-driven, try-to-hit-everything-straight-into-Nature T20 style.

As any cricket fan will tell you, test matches are more substantial and ultimately far more satisfying than any limited overs jamboree. And the occasional 6 is made all the sweeter for its scarcity.