The ‘phylogram’ package for developing evolutionary trees with R

Shaun Wilkinson

2017-06-12


Abstract

phylogram is an R package for developing evolutionary trees as deeply-nested lists known as “dendrogram” objects. It provides functions for importing and exporting trees in the Newick parenthetic text format, as well as several functions for command-line tree manipulation. With an emphasis on speed and computational efficiency, phylogram also includes a suite of tools for rapidly computing distance matrices and building large trees using fast alignment-free k-mer counting and divisive clustering techniques. This package makes R’s powerful nested-list architecture more accessible to evolutionary biologists, and facilitates the analysis of very large biological sequence datasets.

Introduction

The R environment continues to gain popularity as a platform for working with evolutionary trees, due to the reproducible code-based workflow and the many powerful analytical tools available in a suite of open-source packages such as ape (E. Paradis, Claude, and Strimmer 2004), phangorn (Schliep 2011) and Phytools (Revell 2012). These packages typically employ a tree structure known as the “phylo” object, whose primary element is an integer matrix with one row for each edge in the graph, and two columns giving the indices of the connecting nodes.

An alternative tree structure is the “dendrogram” object, generated using the as.dendrogram function in the stats package (R Core Team 2017). Rather than a matrix of edges, a dendrogram is a hierarchical list. These ‘lists of lists’ can be deeply nested, with the limit depending on the C stack size (settable with options("expressions")). A useful feature of this representation is its modularity, whereby the sub tree of a tree is itself a tree - a dendrogram within a dendrogram. This means that dendrogram objects are subsettable in the same way that standard lists are, which in addition to the inbuilt editing functions such as cut and merge, facilitates intuitive command-line tree manipulation. An especially powerful feature of this object type is that tree editing operations can be carried out recursively using fast inbuilt functions in the “apply” family such as dendrapply and lapply.

Each node of a dendrogram object has the following mandatory attributes:

Rather than lists, terminal leaf nodes are length-1 integer vectors whose values correspond to the indices of the members in the set. The “members” attributes of leaf nodes is always 1, the “midpoint” attribute is 0, and they have two additional attributes:

Aside from those listed above, users may attach other objects as attributes to the dendrogram nodes. For example, “label” attributes can be attached to inner nodes, and users can specify plotting parameters for each node by setting the attributes “nodePar” and “edgePar”.

Example 1: Build a dendrogram object manually

Consider the simple example of a tree with three members named “A”, “B” and “C”, where “B” and “C” are more closely related to each other than they are to “A”. An unweighted Newick string for this tree would be (A,(B,C)); We can manually create a dendrogram object for this basic phylogeny and plot the tree as follows:

x <- list(1, list(2, 3))
## attach "leaf" and "label" attributes to leaf nodes
attr(x[[1]], "leaf") <- TRUE
attr(x[[2]][[1]], "leaf") <- attr(x[[2]][[2]], "leaf") <- TRUE
attr(x[[1]], "label") <- "A"
attr(x[[2]][[1]], "label") <- "B"
attr(x[[2]][[2]], "label") <- "C"
## set "height" attributes for all nodes
attr(x, "height") <- 1
attr(x[[1]], "height") <- 0
attr(x[[2]], "height") <- 0.5
attr(x[[2]][[1]], "height") <- attr(x[[2]][[2]], "height") <- 0
## set "midpoints" attributes for all nodes
attr(x, "midpoint") <- 0.75
attr(x[[1]], "midpoint") <- 0
attr(x[[2]], "midpoint") <- 0.5
attr(x[[2]][[1]], "midpoint") <- attr(x[[2]][[2]], "midpoint") <- 0
## set "members" attributes for all nodes
attr(x, "members") <- 3
attr(x[[1]], "members") <- 1
attr(x[[2]], "members") <- 2
attr(x[[2]][[1]], "members") <- attr(x[[2]][[2]], "members") <- 1
## set class as "dendrogram" 
## Note that setting the class for the root node
## automatically sets the class of all nested subnodes
class(x) <- "dendrogram"

### plot the dendrogram
plot(x, yaxt = "n")

Figure 1: A simple dendrogram with three terminal leaf nodes

As demonstrated in this example, manually setting attributes on dendrogram objects can be rather tedious, motivating the development of functions automating the generation and manipulation of these tree structures.

The ‘phylogram’ package

Here, we introduce phylogram, an R package for working with evolutionary trees as deeply-nested lists. The package contains functions for importing and exporting dendrogram objects to and from Newick-style parenthetic text, computing distance matrices, and assembling, visualizing, manipulating and rendering trees for publication. A primary focus of the phylogram package is to facilitate the assembly of very large trees as quickly and efficiently as possible. This can be achieved with or without a multiple sequence alignment, and with or without a matrix of pairwise distances. These functions are detailed below with examples of their utility.

Importing and exporting trees

The Newick (a.k.a. New Hampshire) parenthetic text format (Felsenstein et al. 1986) is a universal phylogenetic tree representation that is compatible with most tree-editing software. The phylogram package features the text parser read.dendrogram that reads a character string or text files in the Newick format and generates a dendrogram object. This function supports weighted edges, labels with special meta-characters (enclosed in single quotation marks), comments (enclosed in square brackets; ignored by the parser), multifuricating nodes, and both rooted and unrooted trees. Inner-node labels are ignored; however, the addition of “label” attributes for non-leaf dendrogram nodes will be available in a future version. Objects of class “dendrogram” can be exported as Newick-style parenthetic text using the function write.dendrogram.

Example 2: Import and export a tree from a Newick string

The simple Newick string in Example 1 can be imported as a dendrogram object using the read.dendrogram function as follows:

library(phylogram)
newick <- "(A,(B,C));"
x <- read.dendrogram(text = newick)
x
#> 'dendrogram' with 2 branches and 3 members total, at height 2

The following command writes the object back to the console in Newick format without edge weights:

write.dendrogram(x, edges = FALSE)
#> [1] "(A,(B,C));"

The syntax is similar when reading and writing text files, except that the text argument is replaced by file, and a valid file path is passed to the function.

Distance matrix computation

The phylogram package features the function kcount for counting k-mers within a sequence or set of sequences, and kdistance for computing a fast, alignment-free k-mer distance matrix. Briefly, all k-letter words are enumerated by sliding a window of length k along the sequences, and the abundance of each k-mer (e.g. the \(4^3\) possible DNA 3-mers AAA, AAC, AAG, …, TTT) in each sequence is used to calculate the pairwise distances. The default distance metric used by kdistance is the k-mer (k-tuple) distance measure outlined in Edgar (2004). For two DNA sequences \(a\) and \(b\) the fractional common k-mer count over the \(4^k\) possible words of length \(k\) is calculated as:

\[ \begin{equation} F = \sum\limits_{\tau} \frac{min (n_a(\tau), n_b (\tau))}{min (L_a , L_b ) - k + 1} \tag{1} \end{equation} \] where \(\tau\) represents each possible k-mer, \(n_a(\tau)\) and \(n_b(\tau)\) are the number of times \(\tau\) appears in each sequence, \(k\) is the k-mer length and \(L\) is the sequence length.

We then calculate the pairwise distance between \(a\) and \(b\) as:

\[ \begin{equation} d = \frac{log(0.1 + F) - log(1.1)}{log(0.1)} \tag{2} \end{equation} \]

For n sequences, the kdistance operation has time and memory complexity \(O(n^2)\) and thus can become computationally infeasible when the sequence set is large (i.e. more than ~ 10,000 sequences). As such, the phylogram package also offers the function mbed, that only computes the distances from each sequence to a smaller (or equal) sized subset of ‘seed’ sequences (Blackshields et al. 2010). As suggested by the authors, the default number of seeds is \(log_2(n)^2\).

DNA and amino acid sequences can be passed to kcount, kdistance and mbed either as a list of non-aligned sequences or a matrix of aligned sequences, preferably in either the “DNAbin” or “AAbin” raw-byte format (see the ape package documentation for more information on these S3 classes). Character sequences are supported; however ambiguity codes may not be recognized or treated appropriately, since raw ambiguity codes are counted according to their underlying residue frequencies (e.g. the 5-mer “ACRGT” would contribute 0.5 to the tally for “ACAGT” and 0.5 to that of “ACGGT”). This excludes the ambiguity code “N”, which is ignored.

Example 3: Compute k-mer distance matrices for the woodmouse dataset

The ape package contains a dataset of 15 aligned mitochondrial cytochrome b gene DNA sequences from the woodmouse Apodemus sylvaticus. This is a subset of those originally published in Michaux et al. (2003). While the phylogram distance functions do not require sequences to be aligned, this example will enable us to compare our k-mer distances with the traditional alignment-reliant distances produced by ape::dist.dna. First, load the dataset as follows:

library(ape)
data(woodmouse)
## view the first few rows and columns 
as.character.DNAbin(woodmouse[1:5, 1:5])
#>         [,1] [,2] [,3] [,4] [,5]
#> No305   "n"  "t"  "t"  "c"  "g" 
#> No304   "a"  "t"  "t"  "c"  "g" 
#> No306   "a"  "t"  "t"  "c"  "g" 
#> No0906S "a"  "t"  "t"  "c"  "g" 
#> No0908S "a"  "t"  "t"  "c"  "g"

This is a semi-global (‘glocal’) alignment featuring some incomplete sequences, with unknown characters represented by the ambiguity code “n” (e.g. No305). To avoid artificially inflating the distances between these partial sequences and the others, we first trim the gappy ends by subsetting the global alignment (note that the ape function dist.dna also removes columns with ambiguity codes prior to distance computation by default ).

woodmouse <- woodmouse[, apply(woodmouse, 2, function(v) !any(v == 0xf0))]

The following code first computes the full \(n \times n\) distance matrix, and then the embedded distances of each sequence to three randomly selected seed sequences. In both cases the k-mer size is set to 5.

### Compute the full distance matrix and print the first few rows and columns
woodmouse.kdist <- kdistance(woodmouse, k = 5)
print(as.matrix(woodmouse.kdist)[1:7, 1:7], digits = 2)
#>         No305  No304  No306 No0906S No0908S No0909S No0910S
#> No305   0.000 0.0241 0.0215   0.026   0.027   0.028   0.029
#> No304   0.024 0.0000 0.0042   0.017   0.018   0.025   0.019
#> No306   0.021 0.0042 0.0000   0.014   0.014   0.020   0.014
#> No0906S 0.026 0.0171 0.0136   0.000   0.019   0.025   0.011
#> No0908S 0.027 0.0184 0.0136   0.019   0.000   0.024   0.019
#> No0909S 0.028 0.0246 0.0197   0.025   0.024   0.000   0.026
#> No0910S 0.029 0.0193 0.0145   0.011   0.019   0.026   0.000

### Compute and print the embedded distance matrix
set.seed(999)
seeds <- sample(1:15, size = 3)
woodmouse.mbed <- mbed(woodmouse, seeds = seeds, k = 5)
### remove the attributes for printing by subsetting the distance matrix
print(woodmouse.mbed[,], digits = 2)
#>         No0909S No0913S  No304
#> No305    0.0277  0.0295 0.0241
#> No304    0.0246  0.0085 0.0000
#> No306    0.0197  0.0080 0.0042
#> No0906S  0.0255  0.0175 0.0171
#> No0908S  0.0241  0.0215 0.0184
#> No0909S  0.0000  0.0282 0.0246
#> No0910S  0.0259  0.0145 0.0193
#> No0912S  0.0167  0.0241 0.0202
#> No0913S  0.0282  0.0000 0.0085
#> No1103S  0.0123  0.0193 0.0154
#> No1007S  0.0038  0.0286 0.0250
#> No1114S  0.0340  0.0345 0.0295
#> No1202S  0.0264  0.0145 0.0193
#> No1206S  0.0219  0.0197 0.0158
#> No1208S  0.0038  0.0309 0.0273

Building trees

To avoid excessive time and memory use when building large trees (e.g. n > 10,000), the phylogram package features the function topdown for divisive tree building, free of both alignment and distance matrix computation. This function first generates a matrix of k-mer counts, and then recursively partitions the sequences using successive k-means clustering (k = 2). While this method may not necessarily reconstruct sufficiently accurate phylogenetic trees for taxonomic purposes, it offers a fast and efficient means of producing large trees for a variety of other applications. These include tree-based sequence weighting (e.g. Gerstein et al. 1994), guide trees for progressive multiple sequence alignment (e.g. Sievers et al. 2011), and other recursive operations such as classification tree learning.

Example 4: Reconstruct a phylogenetic tree for the woodmouse data

In this example we will compare the fast top-down method with standard neighbor-joining trees constructed from full distance matrices computed using both the alignment-free k-mer distance and the Kimura (1980) distance as featured in the ape package examples.

## set out plotting panes
op <- par(no.readonly = TRUE)
par(mfrow = c(1, 3), mar = c(1, 2, 3, 3), cex = 0.3)

## (1) neighbor joining tree with Kimura 1980 distance
### compute the n x n K80 distance matrix 
woodmouse.dist <- ape::dist.dna(woodmouse, model = "K80") 
### build the neighbor-joining tree
tree1.phylo <- ape::nj(woodmouse.dist)
### export as Newick text
tree1.newick <- ape::write.tree(tree1.phylo)
### import as "dendrogram" object
tree1.dendro <- read.dendrogram(text = tree1.newick)
### sort nodes by size
tree1.dendro <- ladder(tree1.dendro)
### plot the nj tree
plot(tree1.dendro, horiz = TRUE, yaxt = "n", 
     main = "Neighbor-joining tree with\nK80 distance matrix")

## (2) neighbor joining tree with k-mer distance
### compute the n x n k-mer distance matrix 
woodmouse.kdist <- kdistance(woodmouse, k = 5) 
### build the neighbor-joining tree
tree2.phylo <- ape::nj(woodmouse.kdist)
### export as Newick text
tree2.newick <- ape::write.tree(tree2.phylo)
### import as "dendrogram" object
tree2.dendro <- read.dendrogram(text = tree2.newick)
### sort nodes by size
tree2.dendro <- ladder(tree2.dendro)
### plot the nj tree
plot(tree2.dendro, horiz = TRUE, yaxt = "n", 
     main = "Neighbor-joining tree with\nk-mer distance matrix (k=5)")

## (3) topdown tree without distance matrix
set.seed(999)
tree3 <- phylogram::topdown(woodmouse, k = 5, nstart = 20)
### sort nodes by size
tree3 <- phylogram::ladder(tree3)
### plot the topdown tree
plot(tree3, horiz = TRUE, yaxt = "n", 
     main = "Top-down tree without\ndistance matrix (k=5)")


## reset plotting parameters
par(op)

Figure 2: Comparison of tree-building methods for woodmouse sequences

The top-down and neighbor-joining trees show relatively congruent fine-scale topologies, regardless of the distance measure and method used. However, for large sequence sets, the top-down (divisive) method builds trees orders of magnitude faster than traditional alignment and distance matrix-based methods, since computation time and memory use increasing linearly rather than quadratically with n.

Tree editing/manipulation

The phylogram package features several additional functions to facilitate some of the more common manipulation operations. Leaf nodes and internal branching nodes can be removed using the function prune, which identifies and recursively deletes nodes based on regular expression pattern matching of node “label” attributes. To aid visualization, the function ladder rearranges the tree, sorting nodes by the number of members (analogous to the ladderize function in the ape package). Another function aiding in tree visualization is ultrametricize, which resets the “height” attributes of all terminal leaf nodes to zero. The function reposition scales the height of all nodes in a tree by a given constant (passed via the argument shift), and features the option to reset all node heights so that height of the farthest terminal leaf node from the root is zero (by specifying shift = "reset"). The function remidpoint recursively corrects all “midpoint”, “members” and “leaf” attributes following manual editing of a tree.

Tree vizualization

Publication-quality trees can be generated from dendrogram objects using the stats plotting function plot.dendrogram, and the extensive plotting functions available in a dendrogram-enhancing packages such as circlize (Z. Gu et al. 2014) and dendextend (Galili 2015). The latter also offers the facility to convert dendrograms to “ggdend” objects, for which many powerful ‘grammar of graphics’ plotting functions are available in the ggplot2 (Wickham 2009) and ggdendro (DeVries and Ripley 2016) packages. Moreover, there are several advanced plotting options for “phylo” objects in the ape package (E. Paradis, Claude, and Strimmer 2004), made accessible for dendrogram objects here via the Newick import/export functions read.dendrogram and write.dendrogram. Given the extensive tree visualization options already available, we elected not to include any additional plotting functions in the phylogram package.

Future direction

A primary motivation for the development of this package was the requirement to perform recursive tree operations in a straightforward and intuitive manner. Machine learning and data mining techniques such as classification and regression trees (CART) offer many potential benefits for the field of bioinformatics, particularly in the exploratory analysis of large datasets generated by high-throughput sequencing. This tool makes the nested-list architecture, which is ideally suited to such tasks, more accessible. We hope that it will help evolutionary biologists to further explore this appealing opportunity.

References

Blackshields, Gordon, Fabian Sievers, Weifeng Shi, Andreas Wilm, and Desmond G Higgins. 2010. “Sequence embedding for fast construction of guide trees for multiple sequence alignment.” Algorithms for Molecular Biology 5: 21. doi:10.1186/1748-7188-5-21.

DeVries, Andrie, and Brian D Ripley. 2016. Ggdendro: Create Dendrograms and Tree Diagrams Using ’Ggplot2’. https://cran.r-project.org/package=ggdendro.

Edgar, Robert C. 2004. “Local homology recognition and distance measures in linear time using compressed amino acid alphabets.” Nucleic Acids Research 32: 380–85. doi:10.1093/nar/gkh180.

Felsenstein, J, J Archie, WHE Day, W Maddison, C Meacham, FJ Rohlf, and D Swofford. 1986. “The newick tree format, 1986.” http://evolution.genetics.washington.edu/phylip/newicktree.html.

Galili, Tal. 2015. “dendextend: An R package for visualizing, adjusting and comparing trees of hierarchical clustering.” Bioinformatics 31 (22): 3718–20. doi:10.1093/bioinformatics/btv428.

Gu, Zuguang, Lei Gu, Roland Eils, Matthias Schlesner, and Benedikt Brors. 2014. “circlize implements and enhances circular visualization in R.” Bioinformatics 30 (19): 2811–2. doi:10.1093/bioinformatics/btu393.

Kimura, Motoo. 1980. “A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences.” Journal of Molecular Evolution 16 (2): 111–20. doi:10.1007/BF01731581.

Michaux, J R, E Magnanou, E Paradis, C Nieberding, and R Libois. 2003. “Mitochondrial phylogeography of the woodmouse (Apodemus sylvaticus) in the Western Palearctic region.” Molecular Ecology 12 (3): 685–97. doi:10.1046/j.1365-294X.2003.01752.x.

Paradis, Emmanuel, Julien Claude, and Korbinian Strimmer. 2004. “APE: analyses of phylogenetics and evolution in R language.” Bioinformatics 20: 289–90. doi:10.1093/bioinformatics/btg412.

R Core Team. 2017. “R: A Language and Environment for Statistical Computing.” Vienna, Austria: R Foundation for Statistical Computing. https://cran.r-project.org/.

Revell, Liam J. 2012. “phytools: an R package for phylogenetic comparative biology (and other things).” Methods in Ecology and Evolution 3 (2): 217–23. doi:10.1111/j.2041-210X.2011.00169.x.

Schliep, Klaus Peter. 2011. “phangorn: phylogenetic analysis in R.” Bioinformatics 27 (4): 592–93. doi:10.1093/bioinformatics/btq706.

Wickham, Hadley. 2009. ggplot2: Elegant Graphics for Data Analysis. New York: Springer-Verlag. http://ggplot2.org.