Workarounds, KISS and the dangers of overcomplicating things

Workarounds, KISS and the dangers of overcomplicating things


Around 6 months ago I was asked to look into the program PICRUSt (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States) for a data set produced by an amplicon sequencing run on the Ion Torrent. Basically, PICRUSt takes a 16S OTU table that has been classified (taxonomically, not some top-secret government thing, unfortunately) and uses information from sequenced genomes of known or closely related organisms to predict the potential genomic and metabolic functionality of the community identified in the 16S dataset. As complex as the process sounds, in reality, it actually only consists of three steps: Normalisation of the dataset by 16S copy number, this corrects for any potential under/over representation of functionality due to variation in the number of copies of the 16S gene in different bacterial genomes; prediction of the metagenome, basically multiplying the occurrences of functional genes, in this case KOs (KEGG orthologs), within the known genomes by the corrected OTU table from the previous step; finally categorising by function, collapsing the table of thousands of functional KOs into hierarchical KEGG pathways:


Despite my background in sequencing, population genetics and phylogenetics, and having learned/taught myself many different analysis packages and programs over the course of my career, and having a solid, reliable, method of producing and analysing OTU tables from the data obtained from the ion torrent and other sequencing platforms, I’ve never considered myself as a bioinformatician. But four steps should be easy enough… right?

The little yellow arrow in the workflow now represents around 2-3 months of probably the steepest learning curve I have ever ventured on to.

OTU tables are simply a table of counts of OTUs (operational taxonomic units i.e. species/observations etc.) for each sample within your dataset. Despite their simplicity, the method used by myself and others in the research group to construct the OTU table was different to that in the online PICRUSt guide and the information contained therein, also different. I could already sense the increase in learning curve gradient, but carried on forward anyway not quite realising the dangers that lay ahead!

Operating system (UNIX/Windows) cross-compatibility, version control, system resources, version control, new programming languages, version control, more system resources, overloading system resources, complete system failure, starting from scratch again, installation-and-compilation-of-new-algebraic-libraries-for-your-systems-mathematical-calculations, version control, new programming languages, manual editing of enormous databases, scripts, packages and version control. These points are a hint at some of the processes I went through and the problems I had to deal with, in the creation of what I now call ‘The Workaround’. I still don’t consider myself a bioinformatician.

‘The Workaround’
The workaround consists of a small number of R scripts and processes for the reformatting and production of new files that are so simple they belie the amount of work that went into producing them. The steps are straightforward enough that anyone with any sort of experience of working with OTU tables or sequencing data should be able to complete them. The entire workflow is robust and repeatable and I have since worked with a few different ways of visualising and representing the data for publication.

pic 2
Using STAMP to identify SEED subsystems which are differentially abundant between Candidatus Accumulibacter phosphatis sequences obtained from a pair of enhanced biological phosphorus removal (EBPR) sludge metagenomes(data originally described in Parks and Beiko, 2010).

PICRUSt appears to becoming an ever more popular tool in the analysis of microbiomes and one that compliments many of the studies and analyses already performed by many of the members of our research group. I am currently in the process of writing up ‘The Workaround’ into a step-by-step guide to be placed on the bioinformatics wiki for anyone to access but in the meantime if anyone would like to speak to me about the possibilities of applying this type of analysis to any existing or future experiments I’m more than happy to help!

Post by Toby Wilkinson.
About Toby:
I am a postdoc and perpetual resident of Aberystwyth, having come here as an undergrad in 2002 to study Zoology, worked through a PhD in parasitology starting in 2005 and then holding various positions as technician/research assistant/PDRA since, I’ve never quite been able to bring myself to leave Aberystwyth. Over the last few years I’ve worked in various roles in the Herbivore Gut Environment group working on the microbiome of ruminants building up my experience in NGS and bioinformatics, and more recently with Sharon Huws on the further characteristaion of novel antimicrobial peptides, but also continuing work in NGS and the study of the dynamics of various bacterial communities in a number of environments.

Why Are Some Virus Capsids So Geometric?

Why Are Some Virus Capsids So Geometric?

Keywords – viral capsid, assembly, symmetric, geometric, icosahedron, subunits, pentamers, hexamers

phage heads
Upon doing some literature searching for phages, I came across a paper written in 1967 on the topic of ultrastructure of phages. Scrolling through, there were some subjectively pretty microscopy images of infecting phages and other diagrams. The diagram of phage head shapes in particular caught my eye. I began to think about how nice and satisfying the symmetry and geometry is in bacteriophages. I then wondered why phage heads have this characteristic and what advantages it has.

The general structure of a phage head:
A phage head is formed of either 2 or 3 parts. All freshly formed virions have a core of genomic material. This could be double or single stranded RNA or DNA. This is surrounded by the capsid. This is a proteinaceous coat, formed of number of identical subunits, which may be formed by even smaller molecular subunits. These subunits are called capsomeres [1].

Why the patterns?
It seems I am not the only one to question this, and in fact the question was almost fully answered by Crick and Watson back in the 1950s. They were studying small viruses, and hypothesised that the virus requires the protein coat to protect its genomic material. The best and most efficient way to do that is to form the coat from lots of small identical molecular subunits. These are then easier to produce when inside the host cell than say, one or two large molecules. These subunits also have the added advantage that they can only arrange themselves in so many ways around the core to create a shell [2].

So, this explains why capsids tend to have such a regular shape. But what other advantages does this confer?
Perhaps this can be put down to evolutionary adaptations. The easier the molecule is to reproduce; the more virion offspring are produced. It also makes sense that identical subunits can only attach in so many ways, and that this results in a pattern, seen in all those offspring. But this formation seems to be like puzzle pieces, in that this method does not require any energy [3]. This is ideal for spontaneous formation of virion capsids in the host cell, as the virus does not have to concern itself with sequestering energy from elsewhere.

Phage heads tend to take on the shape of a platonic body; one of five regular shapes, of which only octahedrons and icosahedrons have been seen using microscopy. Icosahedrons, as seen in the image below, are commonly seen in phages, and are 20 sided shapes with 12 vertices. This geometry offers stability and strength [4].

Another way is to look at this from the genetic material’s point of view. It needs to be protected from stray or attacking enzymes and needs a way to nicely organise itself. These structures not only allow for nice neat organisation of the genome, but also by doing this, it can create secondary characteristics. For example, it is thought that some regions can take on translation and replication roles, brought about by the way that the nucleic acids are stored within the capsid, allowing them to form double-stranded loops, such as in Leviviridae viruses [5].

It seems that these phage heads are well adapted to their purpose. I’m sure we can all agree that they are very clever and the stuff of nightmares!!!


Post by Jessica Friedersdorff.

[1] Bradley DE. Ultrastructure of bacteriophage and bacteriocins. Bacteriol Rev 1967;31:230–314.
[2] Crick FH, Watson JD. P1-Structure of small viruses. Nature 1956;177:473–5.
[3] Bruinsma RF, Gelbart WM, Reguera D, Rudnick J, Zandi R. Viral self-assembly as a thermodynamic process. Phys Rev Lett 2003;90:248101. doi:10.1103/PhysRevLett.90.248101.
[4] Mannige R V., Brooks CL. Periodic table of virus capsids: Implications for natural selection and design. PLoS One 2010;5:1–7. doi:10.1371/journal.pone.0009423.
[5] Morais MC. Breaking the symmetry of a viral capsid. Proc Natl Acad Sci 2016;113:201613612. doi:10.1073/pnas.1613612113.

But what does it mean?!?


But what does it mean?!?

During practical-based modules, I often ask undergraduates to start their practical reports with a statement of their hypothesis. This usually throws them into a mild panic, as class practicals are primarily about generating data rather than proving/disproving a hypothesis and they cannot easily negotiate that apparent disparity.

The relationship between data-generating and hypothesis-driven research is a troubled one. Twenty years ago, a loud and often-heard cry of the experimentalist after a ‘big data’ or ’-omics’ talk was ‘but what IS the hypothesis?’. Testing a hypothesis was the mantra of every bench scientist, and even today some funding agencies and scientific publishers still insist on placing hypotheses front and centre of all submissions.

But what was the hypothesis being tested when the E. coli genome was sequenced? Should we look down our noses disapprovingly at the humble genome, denigrated as a mere ‘fishing trip’, or ‘stamp collection’, because of its lack of a noble hypothesis? Do we emulate my poor undergraduates and struggle valiantly to find a hidden rationale behind the data-collecting exercise and justify its existence? Or should we celebrate the diversity, abundance and scale of the datasets that we can now generate, with or without accompanying hypothesis?

We can’t all be the ones to discover the next cure for cancer, or the novel antibiotic to which there is no possibility of resistance. However, we can all contribute resources to aid those explorers in their search. Those resources can be new knowledge, acquired through the steadfast testing of hypotheses, or they can be collections of datasets, alongside the tools and knowhow to interrogate those data.
The genome is the ultimate blueprint of an organism’s biology, however we have barely begun learning how to look inside a genome, and from its sequence deduce salient features of the host’s biology. Hypothesis-led experimentation is one way to improve our understanding of the sequence/function relationship, and now increasingly we find ourselves testing hypotheses that have themselves come directly from big datasets.

In essence, big datasets are trying to tell us everything we want to know, but to get there we need to find out what questions to ask, for which they are the answer.

Post by Dr. Dave Whitworth.