A biologist playing the numbers game

A biologist playing the numbers game

Biologists generally dislike numbers, as a rule. Probably because numbers require you to do all the work before anything interesting happens. Numbers don’t metabolise, or synthesise, or secrete, or replicate. They don’t behave in different ways under the same conditions. They are, in a word, reliable. We like what they represent, but we don’t like that they are an abstraction of what actually interests us about biology.

But this is 2017. Gone are the days of Leidy, Manson and Darwin where a biologist could spend their life avoiding the numbers game and still rise to the top of their field. Biology means data, and data means statistics! At some point, every young biologist goes through the realisation that they have to bite the bullet and actually learn a bit of stats in R (or MatLab, if you would rather use something with a price tag), rather than subsist on the vague idea of statistical analysis all those modules you took furnished you with. After all, wouldn’t it be nice to be an author on one of those nice shiny papers with all that important looking multivariate analysis in it.

So what to do? Well go on a course of course. I did exactly that. I found myself an interesting and relevant looking course ran by PR Statistics, looking at analysis of population genomics data in R. The course took me through how to use various packages available in R, particularly Adegenet [1], to reveal structure in your data and was instructed by the developers behind Adegenet: Thibaut Jombart & Zhian N Kamvar, who were as knowledgeable and skilled instructors as you could encounter. If you perform statistical analysis on allelic frequencies in population data sets, then I would highly recommend this package, it contains everything that you could need to elucidate even the most subtle structure in data. Set in the almost idyllic location of Margam country park, east of Swansea, it was a week which did not leave me, or any of my course mates (many of whom had travelled from as far as the USA) wanting. I feel compelled to also mention the cake that the cooks set out for us every day, which resulted in me leaving Margam a few pounds heavier, as well as a week wiser. If this course is representative of all courses ran by PR Statistics, then I can highly recommend them.
So armed with my new, more informed view on statistical analysis in R, I can go forth and see what I can make of my own data sets and see if I can’t produce some of those oh so aesthetic graphs myself. As it happens, I quite like numbers now.

Many thanks to Oliver Hooker for organising the course, and to Thibaut and Zhian for their expert instruction.

By Arthur Morris

Home Page

[1] T. Jombart, “Adegenet: A R package for the multivariate analysis of genetic markers,” Bioinformatics, vol. 24, no. 11, pp. 1403–1405, 2008.

Workarounds, KISS and the dangers of overcomplicating things

Workarounds, KISS and the dangers of overcomplicating things


Around 6 months ago I was asked to look into the program PICRUSt (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States) for a data set produced by an amplicon sequencing run on the Ion Torrent. Basically, PICRUSt takes a 16S OTU table that has been classified (taxonomically, not some top-secret government thing, unfortunately) and uses information from sequenced genomes of known or closely related organisms to predict the potential genomic and metabolic functionality of the community identified in the 16S dataset. As complex as the process sounds, in reality, it actually only consists of three steps: Normalisation of the dataset by 16S copy number, this corrects for any potential under/over representation of functionality due to variation in the number of copies of the 16S gene in different bacterial genomes; prediction of the metagenome, basically multiplying the occurrences of functional genes, in this case KOs (KEGG orthologs), within the known genomes by the corrected OTU table from the previous step; finally categorising by function, collapsing the table of thousands of functional KOs into hierarchical KEGG pathways:


Despite my background in sequencing, population genetics and phylogenetics, and having learned/taught myself many different analysis packages and programs over the course of my career, and having a solid, reliable, method of producing and analysing OTU tables from the data obtained from the ion torrent and other sequencing platforms, I’ve never considered myself as a bioinformatician. But four steps should be easy enough… right?

The little yellow arrow in the workflow now represents around 2-3 months of probably the steepest learning curve I have ever ventured on to.

OTU tables are simply a table of counts of OTUs (operational taxonomic units i.e. species/observations etc.) for each sample within your dataset. Despite their simplicity, the method used by myself and others in the research group to construct the OTU table was different to that in the online PICRUSt guide and the information contained therein, also different. I could already sense the increase in learning curve gradient, but carried on forward anyway not quite realising the dangers that lay ahead!

Operating system (UNIX/Windows) cross-compatibility, version control, system resources, version control, new programming languages, version control, more system resources, overloading system resources, complete system failure, starting from scratch again, installation-and-compilation-of-new-algebraic-libraries-for-your-systems-mathematical-calculations, version control, new programming languages, manual editing of enormous databases, scripts, packages and version control. These points are a hint at some of the processes I went through and the problems I had to deal with, in the creation of what I now call ‘The Workaround’. I still don’t consider myself a bioinformatician.

‘The Workaround’
The workaround consists of a small number of R scripts and processes for the reformatting and production of new files that are so simple they belie the amount of work that went into producing them. The steps are straightforward enough that anyone with any sort of experience of working with OTU tables or sequencing data should be able to complete them. The entire workflow is robust and repeatable and I have since worked with a few different ways of visualising and representing the data for publication.

pic 2
Using STAMP to identify SEED subsystems which are differentially abundant between Candidatus Accumulibacter phosphatis sequences obtained from a pair of enhanced biological phosphorus removal (EBPR) sludge metagenomes(data originally described in Parks and Beiko, 2010).

PICRUSt appears to becoming an ever more popular tool in the analysis of microbiomes and one that compliments many of the studies and analyses already performed by many of the members of our research group. I am currently in the process of writing up ‘The Workaround’ into a step-by-step guide to be placed on the bioinformatics wiki for anyone to access but in the meantime if anyone would like to speak to me about the possibilities of applying this type of analysis to any existing or future experiments I’m more than happy to help!

Post by Toby Wilkinson.
About Toby:
I am a postdoc and perpetual resident of Aberystwyth, having come here as an undergrad in 2002 to study Zoology, worked through a PhD in parasitology starting in 2005 and then holding various positions as technician/research assistant/PDRA since, I’ve never quite been able to bring myself to leave Aberystwyth. Over the last few years I’ve worked in various roles in the Herbivore Gut Environment group working on the microbiome of ruminants building up my experience in NGS and bioinformatics, and more recently with Sharon Huws on the further characteristaion of novel antimicrobial peptides, but also continuing work in NGS and the study of the dynamics of various bacterial communities in a number of environments.