Workarounds, KISS and the dangers of overcomplicating things

Workarounds, KISS and the dangers of overcomplicating things


Around 6 months ago I was asked to look into the program PICRUSt (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States) for a data set produced by an amplicon sequencing run on the Ion Torrent. Basically, PICRUSt takes a 16S OTU table that has been classified (taxonomically, not some top-secret government thing, unfortunately) and uses information from sequenced genomes of known or closely related organisms to predict the potential genomic and metabolic functionality of the community identified in the 16S dataset. As complex as the process sounds, in reality, it actually only consists of three steps: Normalisation of the dataset by 16S copy number, this corrects for any potential under/over representation of functionality due to variation in the number of copies of the 16S gene in different bacterial genomes; prediction of the metagenome, basically multiplying the occurrences of functional genes, in this case KOs (KEGG orthologs), within the known genomes by the corrected OTU table from the previous step; finally categorising by function, collapsing the table of thousands of functional KOs into hierarchical KEGG pathways:


Despite my background in sequencing, population genetics and phylogenetics, and having learned/taught myself many different analysis packages and programs over the course of my career, and having a solid, reliable, method of producing and analysing OTU tables from the data obtained from the ion torrent and other sequencing platforms, I’ve never considered myself as a bioinformatician. But four steps should be easy enough… right?

The little yellow arrow in the workflow now represents around 2-3 months of probably the steepest learning curve I have ever ventured on to.

OTU tables are simply a table of counts of OTUs (operational taxonomic units i.e. species/observations etc.) for each sample within your dataset. Despite their simplicity, the method used by myself and others in the research group to construct the OTU table was different to that in the online PICRUSt guide and the information contained therein, also different. I could already sense the increase in learning curve gradient, but carried on forward anyway not quite realising the dangers that lay ahead!

Operating system (UNIX/Windows) cross-compatibility, version control, system resources, version control, new programming languages, version control, more system resources, overloading system resources, complete system failure, starting from scratch again, installation-and-compilation-of-new-algebraic-libraries-for-your-systems-mathematical-calculations, version control, new programming languages, manual editing of enormous databases, scripts, packages and version control. These points are a hint at some of the processes I went through and the problems I had to deal with, in the creation of what I now call ‘The Workaround’. I still don’t consider myself a bioinformatician.

‘The Workaround’
The workaround consists of a small number of R scripts and processes for the reformatting and production of new files that are so simple they belie the amount of work that went into producing them. The steps are straightforward enough that anyone with any sort of experience of working with OTU tables or sequencing data should be able to complete them. The entire workflow is robust and repeatable and I have since worked with a few different ways of visualising and representing the data for publication.

pic 2
Using STAMP to identify SEED subsystems which are differentially abundant between Candidatus Accumulibacter phosphatis sequences obtained from a pair of enhanced biological phosphorus removal (EBPR) sludge metagenomes(data originally described in Parks and Beiko, 2010).

PICRUSt appears to becoming an ever more popular tool in the analysis of microbiomes and one that compliments many of the studies and analyses already performed by many of the members of our research group. I am currently in the process of writing up ‘The Workaround’ into a step-by-step guide to be placed on the bioinformatics wiki for anyone to access but in the meantime if anyone would like to speak to me about the possibilities of applying this type of analysis to any existing or future experiments I’m more than happy to help!

Post by Toby Wilkinson.
About Toby:
I am a postdoc and perpetual resident of Aberystwyth, having come here as an undergrad in 2002 to study Zoology, worked through a PhD in parasitology starting in 2005 and then holding various positions as technician/research assistant/PDRA since, I’ve never quite been able to bring myself to leave Aberystwyth. Over the last few years I’ve worked in various roles in the Herbivore Gut Environment group working on the microbiome of ruminants building up my experience in NGS and bioinformatics, and more recently with Sharon Huws on the further characteristaion of novel antimicrobial peptides, but also continuing work in NGS and the study of the dynamics of various bacterial communities in a number of environments.