Introduction

Welcome to the insect R package, a pipeline for analyzing next generation sequencing (NGS) amplicon libraries with informatic sequence classification trees. The pipeline employs a machine-learning approach that uses a set of training sequences from GenBank, BOLD and other databases to learn a classification tree, which is then used to assign taxonomic IDs to a set of query sequences generated from an NGS platform such as Illumina MiSeq. The package also includes a suite of functions for FASTQ/FASTA sequence parsing, de-multiplexing, paired-end read stitching, primer trimming, quality filtering, de-replication, re-replication, etc. The functions are designed for use with multiple processors, but smaller datasets can be run on standard personal computers if time is available. While not a prerequisite, insect is designed to be used in conjunction with the ape package (Paradis et al., 2004; Paradis, 2012), which features a memory-efficient binary format for DNA (the “DNAbin” object type) among many other useful functions. The insect package can be used to process environmental DNA (eDNA) meta-barcode libraries as well as single-source NGS/Sanger amplicon sequences.

The most time-consuming and memory-intensive stage of the insect work-flow generally involves training the DNA classifier. For example, the COI classifier used in this tutorial, which was built from a training dataset consisting of 150,000 unique ~310 bp COI barcode sequences, took around three days to run on 24 cores and used around 40 GB of memory. The insect classification trees are amplicon specific, so a unique tree is generally required for each primer set. However, trees are already available for some of the more commonly used barcoding primers here. The package also includes functions and instructions for downloading and filtering training data (including a “virtual PCR” tool and a taxonomic quality filter), building the taxonomy database and training the classifier; however, these methods are beyond the scope of this introductory tutorial. New classification trees and updates are frequently added to the collection, so please feel free to suggest a barcoding primer set with which to train a classifier and we will endeavor to add it to the list.

To produce a classification tree, the training data is first obtained from GenBank, BOLD and/or other databases where barcode sequences with accurate species-level IDs are available. These sequences are filtered to remove any with obvious taxonomic labeling issues, and trimmed to retain only the region of interest using the virtualPCR function (sequences that do not span the entire amplicon region are removed). the learn function then recursively partitions the training sequences as follows: The dataset is initially divided in two, and a profile hidden Markov model is derived for each subset (see Durbin et al. (1998) for a detailed description of these models). The lowest common taxon of each subset is also stored at each new node. The partitioning and model training procedure then continues recursively, splitting the training data into smaller and smaller subsets while adding new nodes, models and taxonomic information to the classification tree.

Once the classifier tree has been trained, query sequences obtained from the specified primer set can be analysed to assign taxonomic IDs along with confidence values. The classification algorithm works as follows: starting from the root node of the classification tree, the likelihood of the query sequence (the log-probability of the sequence given a particular model) is computed for each of the models at the child nodes using the forward algorithm (see Durbin et al. (1998)). The competing likelihood values are then compared by computing their Akaike weights (see Johnson and Omland, 2004). If one model is overwhelmingly more likely to have produced the sequence than the other, that child node is selected and the classification is updated to reflect the taxonomic information of the training sequences belonging to the node.

This procedure is repeated recursively, continuing down the tree until either an inconclusive result is returned from the model comparison test (i.e. the Akaike weight is lower than a pre-defined threshold, usually 0.9), or a terminal leaf node is reached, at which point a species-level ID is generally returned. The classify function outputs the taxon name, rank and serial number (or NCBI taxon ID, WORMS aphia ID, or other identifier depending on the taxonomy database used in the training step), along with the Akaike weight of the model at the final node. Note that the default behavior is for the Akaike weight to ‘decay’ as it moves down the tree, by computing the cumulative product of all preceding Akaike weight values. This is perhaps an overly conservative approach, but it minimizes the chance of producing a spurious/incorrect taxon ID.

In addition to the two key functions learn and classify, the package includes several tools to encode the work-flow from raw sequence data input to tabular output. At any stage during the process, users can export the sequence data using the writeFASTA or writeFASTQ functions.

A worked example

This tutorial demonstrates the insect work-flow using an example dataset of COI sequences derived from Autonomous Reef Monitoring Structures (ARMS) in Timor-Leste, amplified using the metazoan COI barcoding primers mlCOIintF and jgHCO2198 (GGWACWGGWTGAACWGTWTAYCCYCC and TAIACYTCIGGRTGICCRAARAAYCA, respectively; Leray et al. (2013)).

First install and load the package as follows:

install.packages("insect")
library(insect)

Next, download the latest marine eDNA classifier for the Leray et al. (2013) primer set from here and extract the contents of the zip archive to the current working directory.

This can alternatively be done from within R as follows:

URL <- "https://www.dropbox.com/s/aawh33hneqru6j9/metazoan_COI_marine_v3.zip?dl=1"
download.file(URL, destfile = "metazoan_COI_marine_v3.zip", mode = "wb")
unzip("metazoan_COI_marine_v3.zip")
file.remove("metazoan_COI_marine_v3.zip")

The zip archive contains the following files:

Non-stitched and pre-stitched example FASTQ files are provided in order to demonstrate the work-flow for both of these commonly-encountered input data formats. The FASTQ files can be read into R as either concatenated upper-case character strings or binary “DNAbin” objects, with “quality” attributes. In this example we will opt for the latter by leaving bin = TRUE (the default option). This is because subsetting or concatenating the character vector will result in the loss of the “quality” attributes, while this isn’t an issue with the DNAbin list objects, whose quality scores are attributed to each sequence individually.

S1R1 <- readFASTQ("COI_sample1_read1.fastq")
S1R2 <- readFASTQ("COI_sample1_read2.fastq")
S2 <- readFASTQ("COI_sample2.fastq")

The next step is to stitch the forward and reverse reads from sample 1 together to create a single vector of sequences, similar to sample 2. The stitch function performs this operation, as well as optionally removing any sequences that don’t contain the primer sequences in either direction, trimming the primers from those that do, and outputting all sequences in the 5’ -> 3’ orientation. The optional primer filter-trim and sequence orientation is activated by passing the primer sequences to the stitch function (again either as character strings or “DNAbin” objects).

S1 <- stitch(S1R1, S1R2, up = "GGWACWGGWTGAACWGTWTAYCCYCC", 
             down = "TAIACYTCIGGRTGICCRAARAAYCA")

Of the original 1000 sequences in sample 1, 299 were retained and stitched while the remainder didn’t contain both of the primer sequences in either direction and were discarded.

The sample 2 sequences were pre-stitched on the Illumina platform but still have their primers attached. Here, we will use the trim function to discard any sequences that don’t contain both primer sequences, orientate all sequences in the 5’ -> 3’ direction, and finally, trim the primer sequences from each end:

S2 <- trim(S2, up = "GGWACWGGWTGAACWGTWTAYCCYCC", 
             down = "TAIACYTCIGGRTGICCRAARAAYCA")

Next we will merge all samples together into one DNAbin list, while retaining the sample origins in the sequence names. In keeping with the style of the qiime pipeline (Caporaso et al., 2010), we append the sample names to the front of each sequence name, delimited with an underscore:

names(S1) <- paste0("Sample1_", names(S1))
names(S2) <- paste0("Sample2_", names(S2))
x <- c(S1, S2)

The sequences now need further filtering to remove low-quality reads, those with ambiguous base calls, singletons and overly short/long sequences. The function qfilter is a quality control function that can be used to apply any or all of these filters. By default, the function removes any sequences with a mean quality score less than 30, those that contain at least one ambiguous base call, those that appear only once in the dataset, and those with length outside the range of 50 - 500 nucleotides (inclusive). To disable any of the filters, simply set the parameter value to NULL. In this example we will stick with the default settings, except that we will change the acceptable length range to 250 - 350 bp.

x <- qfilter(x, minlength = 250, maxlength = 350)

The final step is to load the classification tree and run the classify function to assign taxonomic IDs and confidence scores. First, read in the classification_tree.rds file:

tree <- readRDS("classification_tree.rds")

This ‘insect’ class object is just a large dendrogram with additional attributes for classifying sequences including profile HMMs and taxonomic information. The classify function may take a minute or two to process these sequences, since it uses a computationally intensive dynamic programming algorithm to find the likelihood values of each (unique) sequence given the models at each node of the tree. The exception is when the argument ping is set to TRUE and there is an exact match between the query sequence and at least one of the sequences in the training dataset, in which case the function simply returns the common ancestor of the matching sequences. The classify function can also be run in parallel by setting the cores argument to 2 or more depending on the number available (you can run parallel::detectCores() if you are unsure). Classification times can vary, and depend on several factors including the number of unique sequences in the dataset, the size of the tree, the length of the input sequences, the processing speed, number of processors used, etc. The average time for classifying COI sequences using the tree above is approximately 3 - 4 seconds per unique sequence per processor. For example, a dataset containing 10,000 sequences of which 1000 are unique would take around an hour on a single processor, half an hour on two, etc.

longDF <- classify(x, tree, cores = 4)
representative taxID taxon rank phylum class order family genus species Sample1 Sample2 score
Sample1_M… 2 Animalia kingdom 4 0 0.9999
Sample1_M… 1839 Ascidiacea class Chordata Ascidiacea 30 0 0.9999
Sample1_M… 2 Animalia kingdom 16 0 0.9922
Sample1_M… 2 Animalia kingdom 23 2 0.9561
Sample1_M… 1839 Ascidiacea class Chordata Ascidiacea 2 0 0.9999
Sample1_M… 1839 Ascidiacea class Chordata Ascidiacea 27 0 0.9999
Sample1_M… 1839 Ascidiacea class Chordata Ascidiacea 24 0 0.9999
Sample1_M… 2 Animalia kingdom 6 0 1.0000
Sample1_M… 368670 Florideophyceae class Rhodophyta Florideophyceae 2 0 0.9599
Sample1_M… 2 Animalia kingdom 2 0 0.9537
Sample1_M… 1839 Ascidiacea class Chordata Ascidiacea 2 0 0.9999
Sample1_M… 1041930 Terpios gelatinosus species Porifera Demospongiae Suberitida Suberitidae Terpios Terpios gelatinosus 2 0 NA
Sample1_M… 2 Animalia kingdom 2 0 0.9984

The function produces a data frame with one row for each unique sequence, the first few rows of which are shown above. Assuming ping is set to TRUE, query sequences that have exact matches in the training dataset (and hence bypass the recursive classification procedure; e.g. Terpios gelatinosus above), are assigned a score of NA. For a more succinct output we can aggregate the table to show one row for each unique taxon as follows:

taxa <- aggregate(longDF[3:10], longDF["taxID"], head, 1)
counts <- aggregate(longDF[11:12], longDF["taxID"], sum)
shortDF <- merge(taxa, counts, by = "taxID")
taxID taxon rank phylum class order family genus species Sample1 Sample2
1 Biota 0 47
2 Animalia kingdom 55 88
1086 Eumalacostraca subclass Arthropoda Malacostraca 0 3
1090 Peracarida superorder Arthropoda Malacostraca 0 6
1102 Harpacticoida order Arthropoda Hexanauplia Harpacticoida 0 19
1135 Amphipoda order Arthropoda Malacostraca Amphipoda 0 4
1839 Ascidiacea class Chordata Ascidiacea 85 0
368670 Florideophyceae class Rhodophyta Florideophyceae 2 2
1041930 Terpios gelatinosus species Porifera Demospongiae Suberitida Suberitidae Terpios Terpios gelatinosus 2 0

As shown in the above example, many of the sequences return fairly uninformative taxon IDs (e.g. ‘Biota’). This is a fairly typical feature of eDNA datasets that can contain a large number of pseudo-genes, chimeras, other PCR artifacts, and novel sequences that are dissimilar to anything recorded in the reference database(s). Query sequences with high similarity to reference sequences can also occasionaly produce uninformative classifications due to inconclusive model comparison tests at top-level nodes. This may be circumvented by reducing the threshold parameter or setting decay = FALSE; however, users are advised against the excessive relaxation of these parameters since it may increase the chance of returning spurious classifications (these tend to be very rare when using the conservative default values of 0.9 and TRUE, respectively). Further testing and optimization may help to address some of these best-practice considerations, and will be a focus of future research.

This basic introduction to the insect package has outlined the steps involved in parsing paired-end NGS data, and filtering, trimming primers, de-replication, and taxonomic identification using a pre-built classification tree. The next tutorial in the series will deal with downloading and curating a primer-specific local sequence database and using it to build a classification tree.

The insect package is released under the GPL-3 license, and is free to distribute under certain conditions; however it comes with no warranty. Please direct bug reports to the GitHub issues page

Acknowledgements

This software was developed with funding from a Rutherford Foundation Postdoctoral Research Fellowship from the Royal Society of New Zealand. Thanks to Molly Timmers for helpful discussion and sharing COI data, and to Danyl McLauchlan and Dinindu Senanayake for assistance with high performance computing facilities.

References

Caporaso,J.G. et al. (2010) QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7, 335–336.

Durbin,R. et al. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge.

Johnson,J.B. and Omland,K.S. (2004) Model selection in ecology and evolution. Trends in Ecology and Evolution, 19, 101–108.

Leray,M. et al. (2013) A new versatile primer set targeting a short fragment of the mitochondrial COI region for metabarcoding metazoan diversity: application for characterizing coral reef fish gut contents. Frontiers in Zoology, 10, 34.

Paradis,E. (2012) Analysis of Phylogenetics and Evolution with R. Second Edition. Springer, New York.

Paradis,E. et al. (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics, 20, 289–290.