The Clunky Mechanics of Collaborative Writing

Forget the myth of the lone genius: science is a collaborative enterprise, requiring cooperation within large teams of people. Often, this is both a joy and a necessity: many hands make light work in the lab or the field, and collaborations massively extend the scale of research questions we’re able to address. More and more, however, writing of papers, reports and funding applications is becoming collaborative too, and that’s seldom so pleasant; the best writing is personal, and writing by committee is difficult. There are counter-examples - the teams of writers beavering away on blockbuster US sitcoms spring to mind - but attempting to write something coherent when authors are scattered over multiple institutions, sometimes across many time zones, brings significant challenges. Although this post has been triggered by big collaborative writing project that's consuming my time recently, it would be inappropriate and hugely unfair to comment on some of the issues of content that plague all such enterprises; what I want to focus on instead is the actual mechanics of how we write. Some of the orthodox habits of academic collaborative writing seem incredibly frustrating, but I’m unsure of how to address them in a way that is accessible to all. The first issue feels like it should be a non-issue, given that content is platform-independent: what software do you use to write? For almost everyone I’ve ever worked with, the answer is: MS Word. My view is that Word is fine for individual writing. Sure, ‘features’ like its habit of second-guessing your formatting choices or its eccentric decisions regarding wrapping text around figures are frustrating. But as a simple typewriter it’s OK, and although I’m trying to shift to Scrivener more for my own writing (using it now), I do still use Word daily. The commenting system in Word - comments and tracked changes - also works fine as long as the number of commenters is small. But in a team of 20 or so, and with a heavily edited document, it becomes very unwieldy and, more importantly, very unstable - I’ve lost so many annotations through crashes.

What are the alternatives? I’ve never really got on with Google Docs, or at least haven’t seen anything that would place it substantially above Word. I know papers are now being written with full version control in GitHub, but even the advocates of this approach admit that, for now, the technical barriers to entry are too high to expect all collaborators to use it. And to be honest, a 20-author document is always going to be unwieldy, regardless of the platform. Perhaps we just have to learn to live with this.

Then there is the profusion of files which again, inevitably results when lots of people are simultaneously working on different parts of a complex document. How do we track these? Emailing individual docs around is a recipe for disaster, so some kind of file sharing system needs to be adopted. I’m a big fan of dropbox, but not everyone is (some institutions actively block it, I’ve recently discovered). On this application we’ve been using MS SharePoint, which I have to admit I’m not a great fan of, perhaps because it’s just more of an effort than dropbox. Perhaps if we used its full features, e.g. signing out documents for editing, rather than downloading and re-uploading everything, it might have won me over. Upshot anyway is that I have both a SharePoint site and a Dropbox folder each containing close on 200 files, with many combinations of initials and dates appended, which must be suboptimal… (Incidentally, sticking a date at the end of a filename is useful, but if you do so yy-mm-dd format allows for the most logical sorting.)

That’s a lot of problems raised, with little in the way of solutions. So if you have any killer tips, do let me know! But I will finish with two more constructive points. First, I said I didn’t want to talk about content, but I think there is one thing worth emphasising: Nothing is Sacred. There is no place in a piece of collaborative writing for egos and intransigence. By all means argue your corner if you feel a coauthor has completely missed your point. But think: if your coauthor doesn’t get it, no chance that a reviewer will, or any other reader. So don’t just dismiss their concern; take their criticism on board, re-work your text, and get back to them. In this sense, collaboration is the first stage of peer review. (A related aside on comments and tracked-changes, in the context of multi-author documents: comments can be useful, if they take the form of “X please write two sentences on that thing you know about here…”. But before adding comments like “This needs more detail” or “This doesn’t work for me”, stop and think whether you could preempt your own comment by actually editing the document.)

Second, a couple formatting tips. Every proposal I’ve ever been involved with has ended up bouncing off the page limit, and so it’s useful to be able to make maximum use of available space. One tip I got from Twitter (whoever it was, thanks!) is to turn on automatic hyphenation in Word (Tools > Hyphenation…), which gains a surprising amount of space. Apart from that, I’m a bit wary of pushing font size and margins right to the limit, simply because confronting your reviewers with massive blocks of dense text is not going to do you any favours. The key then is to edit, edit, and edit again (see also the point above about nothing being sacred).

I take a strategic approach to this: any paragraph that ends with a line less than half full is ripe for reduction by one line, easily, simply by cutting unnecessary words and rephrasing. Be really harsh on repetition, verbiage, impressive-but-meaningless words, and repetition. (See what I did there?) Use short words like ‘use’, no need to utilise longer synonyms like ‘utilise’; this will help readability too. Funnily enough, this is an area where Twitter really helps, as it gets you used to fitting ideas into a strictly delimited space.

So anyway. I now open this to input from coauthors…

On endlings and singletons

There can be few words as poignant as ‘endling’, the name given to the last surviving individual of a species. Tell me you don’t find this image of Benjamin, the last Thylacine, heartbreaking? Or that you weren’t moved by the plight of Lonesome George? And what about Martha, the passenger pigeon? Doesn't her story make you weep at our limitless environmental profligacy? But what links all of these endlings is that we know they were once members of a thriving population - in Martha’s case, one of the most abundant vertebrate species on the planet. There are other species which are known only from a single individual. These species, perhaps lacking the poignancy of endlings but arguably more significant to students of biodiversity, are known as 'singletons'.

Singletons of course are to be expected when your survey area is small, or if your look for only a short period of time. If I counted the birds in my garden for an hour tomorrow morning, I’d expect to see multiple individuals of a few species - magpies, wood pigeons, house sparrows - but I’d be very surprised to see more than one sparrowhawk or wren. However, if I extended my search to my whole street, or to the whole of Sheffield, the number of singletons would drop off precipitously. And at the scale of the UK, over the course of an entire summer, any breeding bird species by definition must be represented by at least two individuals, so there should be no singletons at all in our core avifauna. Any that remain can be dismissed as shivering vagrants blown across the Atlantic, ornithological curiosities but ecologically insignificant.

Enter the sea, though, and the number of singletons remains stubbornly high, even when we expand enormously our study region. For instance, in an analysis of European benthic invertebrates I did a few years ago, I found that about 10% of the species in our very large database (2,300 species sampled from >15,000 locations throughout European seas) were singletons. Similar patterns appear when interrogating the Ocean Biogeographic Information System database I blogged about recently. For instance, OBIS contains records for >11,500 species occurring in the seas around Britain, yet over 45% of these are represented by a single record. At the global scale, 20% of the almost 80,000 marine animal species which occur in OBIS are singletons.

What's happening here? Are these singletons simply very rare species? We expect most species to be rare, but do our surveys of marine habitats really cover so small an area that we never pick up their conspecifics? And if this is so, what does this mean for marine ecosystem functioning? Do these rare species play a role? Individually, maybe not - Kevin Gaston has written on the dominance of common species in terms of numbers, biomass, and probably ecosystem functioning, in most communities. But collectively the singletons in a sample can be abundant, and if there were particular biological characteristics associated with being a singleton, then this could be significant. Unfortunately, about the one generalisation we can make about rarely observed marine species is that we know little about their biology, so we’re not yet in a position to answer this question.

An alternative explanation is that many singletons are mistakes. When analysing diversity surveys, we can take steps to ensure that taxonomic names are consistent, for instance by using the World Register of Marine Species to ensure that we use the accepted name for each species and not one of its (often many) synonyms (I've done that in the cases mentioned above). But what if the person who sorted the sample simply got their identification wrong? There’s not much we can do about that kind of mistake, although one would hope that errors of identification are not so frequent as to explain the very high prevalence of singletons.

Probably we won’t know the answers to such questions until sampling of a few large marine ecosystems reaches a sufficient intensity that we can have confidence that surveys accurately reflect the composition and relative abundance of the species present. For now, we can at least use the presence of singletons to tell us something about how far away from such complete knowledge we are. As I suggested in my last post, in certain marine systems the answer to this is: a very long way indeed. In the meantime, we are becoming more and more aware of the threats facing many marine species. We must hope that the singletons we find in our surveys are only statistical loners, the first observed rather than the last remaining individual. If they do in fact represent the Benjamins, Marthas and Lonesome Georges of their kind, then marine biodiversity is in more trouble than we thought.

The big blue bit in the middle: still big, still blue

Last week, I had the dubious pleasure of revisiting some work I did over three years ago. Back then, as the Census of Marine Life was in its final stages, I got together with Edward Vanden Berghe, then managing the Ocean Biogoegraphic Information System (OBIS), to investigate the suspicion of CoML senior scientist Ron O’Dor that surveys of marine biodiversity largely overlooked ‘the big blue bit in the middle’ – the deep pelagic ocean, by far the largest habitat on Earth. The idea that Edward and I hit on was to use OBIS to produce a plot that would show if Ron was right. OBIS contained at that time around 20 million records, with each record representing the occurrence of a specific species in a particular location. Only around 7 million of these also recorded the depth at which the species had been recorded, but by comparing the depths of these 7 million samples with a global map of ocean depth, we were able to place each of them at a position in the water column. As you can see (and as we showed more rigorously in the resulting paper), Ron was right: in all regions of the ocean, biodiversity records from midwater are far less common than those from the sea bed, or those from surface waters.

Fig 1. Global distribution within the water column of recorded marine biodiversity, using approximately 7 million occurrence records extracted from from OBIS in 2009. The horizontal axis splits the oceans into five zones on the basis of depth, with the width of each zone on this axis proportional to its global surface area. The vertical axis is ocean depth, on a linear scale. The inset shows in greater detail the continental shelf and slope, where the majority of records are found. Note this is slightly different from the version previously published, as it is scaled to the 2013 range of data.

We discussed the implications of this chronic under-sampling of the world’s biggest ecosystem in our paper, but when talking about this work I prefer to quote from another paper, by Bruce Robison:

The largest living space on Earth lies between the ocean’s sunlit upper layers and the dark floor of the deep sea… Within this vast midwater habitat are the planet’s largest animal communities, composed of creatures adapted to a… world without solid boundaries. Thes animals probably outnumber all others on Earth, but they are so little known that their biodiversity has yet to be even estimated

Since our paper came out, I have continued to use OBIS data in my research attempting to describe and explain the distribution of diversity in our oceans. At the same time, OBIS has changed too, both structurally - it’s moved from Rutgers in NJ to the IOC Project office for IODE in Ostend - and in terms of its content, now housing over 35 million records, including almost 19 million which recorded sample depth.

So back to last week, and an email from the current manager of OBIS, Ward Appeltans, asking if I might be able to update the figure from our 2010 paper with new OBIS data.

With some trepidation, I opened up the file of R code I’d used for the original analysis. And got a pleasant surprise: it was readable! Largely this was because I submitted it as an appendix to our paper, and so had taken more care than usual to annotate it carefully. I think this demonstrates an under-apprecaiated virtue of sharing data and code: in preparing it such that it is comprehensible to others, it becomes much more useful to your future self. This point is nicely made in a new paper by Ethan White and colleagues on making using and reusing data easier.

So, rather than days of fiddling, I was able to get the code up and running with new data really quite quickly. Of course, there were a few minor bugs to sort out - one thing I always do with R code now, but didn’t at the time, is to insert the command rm(list = ls()) at the top, to clear my workspace. The fact my old code didn’t work immediately was, I think, down to my failure to do this - the code required an object that was clearly hanging around in my workspace at the time. But it was simply a matter of correcting a name inside a function and it all worked fine. (Actually, one thing still doesn’t work well, which is getting the figure from R into a satisfactory, scaleable vector format which looks nice in other packages - the PDF looks OK (but not great) in Preview but awful viewed in Acrobat, for example - but that’s another story…)

What happens, then, to our view of the depth distribution of marine biodiversity knowledge when we increase the number of observations from 7 million to 19 million?

Fig 2. Figure 1 updated to use the c. 19 million suitable occurrence records available in OBIS in 2013.

Actually, rather little: the overall pattern is pretty much the same, with far more records from shallow than deep seas, and a paucity of midwater records at all depths. The big blue bit in the middle remains both big and blue.

Postscript: Ward at OBIS emailed me to suggest that this post comes across a bit on the negative side, which was certainly not my intention. Even back in 2009 OBIS was a phenomenal resource for marine biodiversity research; the fact that in under 4 years, the number of useful records for my analysis has increased >2.5x is amazing. My view is still that the big blue bit in the middle remains both big and blue, but it's very heartening to see the fingers of yellow extending further and further into the colossal deep pelagic ocean. It would be nice to think that our data visualisation exercise has had something to do with this!