Finding Novel Antimicrobial Peptides with AMPLY

Rise of the superbugs
The rise of antibiotic resistant strains of microbes (bacteria, parasites, viruses and fungi) is probably the leading threat facing humankind. Increasingly desperate warnings about the real-world implications of the increasing resistance are now front page news. The problem is a multifaceted one. Antimicrobial resistance is not just about the loss of human life, but inextricably intertwined with increased patient morbidity and massive economic consequences for global healthcare systems. There are two possible solutions, either a socio-political/behavioural change or a technical/scientific response. Humankind has shown itself remarkably intransigent when faced with doom laden prophecies that require behavioural modification to circumvent (see also Climate Change), therefore it is probably prudent to assume that a managed technical response may be our best hope. But new antibiotics are unlikely to arise spontaneously. Mokyr highlights one of the issues with relying on the existing pharmaceutical industry to address the problem: “…few economies have ever left [decisions like these] entirely to the decentralized decision-making processes of competitive firms. The market test by itself is not always enough” (Mokyr, 1998).

Discovery of AMPs

Figure 1: A cationic, helical AMP (Taliecin-1)

The discovery of AMPs dates back to 1939, when Dubos extracted an antimicrobial agent from a soil Bacillus strain. The designation of AMPs has been extended to encompass a general view of them as a group of anionic antimicrobial proteins/peptides; host defence peptides; cationic amphipathic peptides and cationic AMPs. In contrast to acquired immune mechanisms these endogenous peptides provide a fast and effective means of defence against pathogens as part of the innate immune response. Antimicrobial peptides are evolutionary ancient weapons and their ubiquity throughout the animal and plant kingdoms supports the hypothesis that they have played a key role in the successful evolution of complex multi-cellular organisms. Such is their diversity they can be found in locations as disparate as the skin secretions of a frog to the defensive arsenal of a protozoa.

Dolby Bioinformatics

Figure 2: The Dolby certification logo (

One specific feature of AMPs that makes them difficult to find is that they’re small (often less than 20 amino acids in length – which is comparatively tiny compared to typical proteins). In a typical ‘omic dataset containing, potentially millions and millions of datapoints, isolating interesting AMPs for synthesis and testing is a challenging test. For inspiration we can look to the music industry. In the mid-20th century recordings were made on magnetic tape and engineers wrestled with an ever present low level of hissing noise in the background that threatened to drown out the music. Various ingenious solutions were deigned to mitigate the persistent hiss from forms of “low-noise” tape which recorded more signal; running the tape at a higher speed, or using dynamic pre-emphasis during recording and a form of dynamic de-emphasis during playback. This latter approach became the backbone of the Dolby noise reduction system, which became all pervasive in home audio equipment from the late 60s onwards. The audio engineer’s struggle to maximise signal-to-noise is the same core problem that faces computational biologists and the ongoing analysis of ‘omic “big data” in the search for tiny novel AMPs. There is music there, but at the moment the hiss is tremendous.

The detection of AMPs in metagenomic data is a tantalising low-hanging fruit for computational biologists, however. Post-computational wet-lab work is relatively cheap with spot synthesis of peptides up to around 25aas long possible from a wide array of third party companies with prices from as low as £2.50 per amino acid. A well organised screening program can screen in excess of 100 peptides a day, per person, against a model bacterial organism to test for activity. As a potential workflow the rapid assessment of multiple ‘omic datasets; identification of homologues of pattern matched AMPs; rapid synthesis and screening and a rush to publication would appear to provide a grant-friendly drug-discovery goldmine! But to tap this rich vein, improving the hit rate of putative AMPs from ‘omic data needs to be streamlined and improved.

The AMPLY Pipeline
Finding small sequences (you’re interested in) that often look a lot like other small sequences (you’re not interested in) in datafiles that can contain potentially gigabytes of data is a trickier task than it first appears. Annotation in metagenomics is an art and the determination of what’s real and what’s not often relies purely on defining mutually agreed thresholds. However, as the length of the aligned data being identified starts to shorten, a lot of the assumptions on PercentageID, BitScore and E-Value thresholds begins to fall away. It’s here we return to the Dolby signal-to-noise analogy – the “music” of the AMPs in metagenomic datasets are often drowned out by the sheer volume of background noise and to find them we need to adopt a novel strategy of aggressive emphasis.

Designed by Ben Thomas at Aberystwyth University in the CreeveyLab ( and funded by Life Science Wales (, AMPLY ( is a pipeline designed to plug this gap between the ‘omic data and lab work. AMPLY is designed to provide a basis to sift-out AMPs suitable as synthesis candidates and provide potential regions for crude synthesis by adopting a hyper-wide “balance of evidence” approach. AMPLY passes over data with a series of detection methods, then wrapping the summative results of both them and presenting the final results into a final tableau (known as the “bitpad”) where each potential AMP can be evaluated on the strength of a series of hundreds of datapoints, rather than just a couple of numeric values.

Figure 3: The AMPLY workflow

To date, AMPLY has been used to find, characterise and synthesise over 800 potentially novel AMPs which have been lab screened in partnership with Tika Diagnostics ( at St. George’s Hospital in London. Among the AMPs discovered by AMPLY many are highly active against MRSA (a key superbug) and offer encouraging potential treatment avenues for future development. While there is still much work to be done, results so far have been extremely promising: AMPLY has been used to find bioactive AMPs in datasets as diverse as the skin of Peruvian poison dart frogs to the testicles of a Salamander so the only limitation in the AMPLY pipeline is the diversity of the stream of ‘omic data provided to it….

So, if you’re reading this blog and have interesting data and would like to be part of the drive to find new antimicrobials then get in touch for potential collaborations. We are always interested.

Contact Ben Thomas at, or via Twitter @flwrs4algrnon

Mokyr, Joel. “The political economy of technological change.” Technological revolutions in Europe (1998): 39-64.

A biologist playing the numbers game

A biologist playing the numbers game

Biologists generally dislike numbers, as a rule. Probably because numbers require you to do all the work before anything interesting happens. Numbers don’t metabolise, or synthesise, or secrete, or replicate. They don’t behave in different ways under the same conditions. They are, in a word, reliable. We like what they represent, but we don’t like that they are an abstraction of what actually interests us about biology.

But this is 2017. Gone are the days of Leidy, Manson and Darwin where a biologist could spend their life avoiding the numbers game and still rise to the top of their field. Biology means data, and data means statistics! At some point, every young biologist goes through the realisation that they have to bite the bullet and actually learn a bit of stats in R (or MatLab, if you would rather use something with a price tag), rather than subsist on the vague idea of statistical analysis all those modules you took furnished you with. After all, wouldn’t it be nice to be an author on one of those nice shiny papers with all that important looking multivariate analysis in it.

So what to do? Well go on a course of course. I did exactly that. I found myself an interesting and relevant looking course ran by PR Statistics, looking at analysis of population genomics data in R. The course took me through how to use various packages available in R, particularly Adegenet [1], to reveal structure in your data and was instructed by the developers behind Adegenet: Thibaut Jombart & Zhian N Kamvar, who were as knowledgeable and skilled instructors as you could encounter. If you perform statistical analysis on allelic frequencies in population data sets, then I would highly recommend this package, it contains everything that you could need to elucidate even the most subtle structure in data. Set in the almost idyllic location of Margam country park, east of Swansea, it was a week which did not leave me, or any of my course mates (many of whom had travelled from as far as the USA) wanting. I feel compelled to also mention the cake that the cooks set out for us every day, which resulted in me leaving Margam a few pounds heavier, as well as a week wiser. If this course is representative of all courses ran by PR Statistics, then I can highly recommend them.
So armed with my new, more informed view on statistical analysis in R, I can go forth and see what I can make of my own data sets and see if I can’t produce some of those oh so aesthetic graphs myself. As it happens, I quite like numbers now.

Many thanks to Oliver Hooker for organising the course, and to Thibaut and Zhian for their expert instruction.

By Arthur Morris

Home Page
[1] T. Jombart, “Adegenet: A R package for the multivariate analysis of genetic markers,” Bioinformatics, vol. 24, no. 11, pp. 1403–1405, 2008.

Workarounds, KISS and the dangers of overcomplicating things

Workarounds, KISS and the dangers of overcomplicating things


Around 6 months ago I was asked to look into the program PICRUSt (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States) for a data set produced by an amplicon sequencing run on the Ion Torrent. Basically, PICRUSt takes a 16S OTU table that has been classified (taxonomically, not some top-secret government thing, unfortunately) and uses information from sequenced genomes of known or closely related organisms to predict the potential genomic and metabolic functionality of the community identified in the 16S dataset. As complex as the process sounds, in reality, it actually only consists of three steps: Normalisation of the dataset by 16S copy number, this corrects for any potential under/over representation of functionality due to variation in the number of copies of the 16S gene in different bacterial genomes; prediction of the metagenome, basically multiplying the occurrences of functional genes, in this case KOs (KEGG orthologs), within the known genomes by the corrected OTU table from the previous step; finally categorising by function, collapsing the table of thousands of functional KOs into hierarchical KEGG pathways:


Despite my background in sequencing, population genetics and phylogenetics, and having learned/taught myself many different analysis packages and programs over the course of my career, and having a solid, reliable, method of producing and analysing OTU tables from the data obtained from the ion torrent and other sequencing platforms, I’ve never considered myself as a bioinformatician. But four steps should be easy enough… right?

The little yellow arrow in the workflow now represents around 2-3 months of probably the steepest learning curve I have ever ventured on to.

OTU tables are simply a table of counts of OTUs (operational taxonomic units i.e. species/observations etc.) for each sample within your dataset. Despite their simplicity, the method used by myself and others in the research group to construct the OTU table was different to that in the online PICRUSt guide and the information contained therein, also different. I could already sense the increase in learning curve gradient, but carried on forward anyway not quite realising the dangers that lay ahead!

Operating system (UNIX/Windows) cross-compatibility, version control, system resources, version control, new programming languages, version control, more system resources, overloading system resources, complete system failure, starting from scratch again, installation-and-compilation-of-new-algebraic-libraries-for-your-systems-mathematical-calculations, version control, new programming languages, manual editing of enormous databases, scripts, packages and version control. These points are a hint at some of the processes I went through and the problems I had to deal with, in the creation of what I now call ‘The Workaround’. I still don’t consider myself a bioinformatician.

‘The Workaround’
The workaround consists of a small number of R scripts and processes for the reformatting and production of new files that are so simple they belie the amount of work that went into producing them. The steps are straightforward enough that anyone with any sort of experience of working with OTU tables or sequencing data should be able to complete them. The entire workflow is robust and repeatable and I have since worked with a few different ways of visualising and representing the data for publication.

pic 2
Using STAMP to identify SEED subsystems which are differentially abundant between Candidatus Accumulibacter phosphatis sequences obtained from a pair of enhanced biological phosphorus removal (EBPR) sludge metagenomes(data originally described in Parks and Beiko, 2010).

PICRUSt appears to becoming an ever more popular tool in the analysis of microbiomes and one that compliments many of the studies and analyses already performed by many of the members of our research group. I am currently in the process of writing up ‘The Workaround’ into a step-by-step guide to be placed on the bioinformatics wiki for anyone to access but in the meantime if anyone would like to speak to me about the possibilities of applying this type of analysis to any existing or future experiments I’m more than happy to help!

Post by Toby Wilkinson.
About Toby:
I am a postdoc and perpetual resident of Aberystwyth, having come here as an undergrad in 2002 to study Zoology, worked through a PhD in parasitology starting in 2005 and then holding various positions as technician/research assistant/PDRA since, I’ve never quite been able to bring myself to leave Aberystwyth. Over the last few years I’ve worked in various roles in the Herbivore Gut Environment group working on the microbiome of ruminants building up my experience in NGS and bioinformatics, and more recently with Sharon Huws on the further characteristaion of novel antimicrobial peptides, but also continuing work in NGS and the study of the dynamics of various bacterial communities in a number of environments.