1993: first release of R
1996: Ihaka & Gentleman, J. Comput. Graph. Statist., 5: 299–314
2000: R 1.0.0
2001: start the development of ape
2002: first release of ape
2008: first release of phangorn
2019: 260 packages depending on ape on CRAN, 16 on BioConductor (and 100’s elsewhere: R-Forge, GitHub, . . .)
So many functions, but how to find/remember them?
Operators are also functions:
`+` # or get("+")
## function (e1, e2) .Primitive("+")
`+`(1, 2)
## [1] 3
Find generic functions for a class
Common generic functions are print, plot, c, summary, anova, [, …
library(ape)
methods(class="phylo")
## [1] + addConfidences all.equal
## [4] as.evonet as.hclust as.matching
## [7] as.networx as.splits c
## [10] coalescent.intervals cophenetic degree
## [13] di2multi drop.tip identify
## [16] is.binary is.rooted is.ultrametric
## [19] keep.tip makeLabel midpoint
## [22] multi2di Nedge Nnode
## [25] Ntip plot print
## [28] reorder root simSeq
## [31] skyline summary unroot
## [34] updateLabel vcv
## see '?methods' for accessing help and source code
Run all examples of the help files!
example(plot.phylo)
However examples in (CRAN) R packages are not allowed to run longer than 5 seconds. So for inference this results frequently in toy examples. You need to change to higher number of bootstrap iterations, change stop criterion for optimisation et cetera.
Many packages on CRAN and all on Bioconductor come with vignettes.
vignette()
vignette(package="phangorn")
vignette("Trees", package="phangorn")
And there is google and stackoverflow.
phylo
classThe phylo
class is a tree structure in ape
are the standard for storing phylogenetic in R. It is also used in more
than 260 packages on CRAN.
This class represents phylogenetic trees and networks; (internal) nodes are of degree ≥ 2; the tips (terminal nodes) are of degree 1. The tip labels may be replicated (or can even missing). Idem for the node labels (which are optional).
An object of class phylo
is a list containing at least
the following objects:
tip.label
, can be
““.Nnode
with the number of internal
nodes.phylo
.library(ape)
tree <- read.tree(text = "(,(,));")
plot(tree)
nodelabels()
tiplabels()
str(tree)
## List of 3
## $ edge : int [1:4, 1:2] 4 4 5 5 1 5 2 3
## $ Nnode : int 2
## $ tip.label: chr [1:3] "" "" ""
## - attr(*, "class")= chr "phylo"
## - attr(*, "order")= chr "cladewise"
The storage mode is important. Saving a vector of integers saves
memory in comparison to vector of doubles. Also objects of class
phylo
often in C-Code so you need to take care of this.
Also comparing doubles is dangerous:
x <- 3
y <- sqrt(3)^2
x==y
## [1] FALSE
In the edge matrix each edge (branch) is coded by the nodes it connects.
tree$edge
## [,1] [,2]
## [1,] 4 1
## [2,] 4 5
## [3,] 5 2
## [4,] 5 3
Tips (terminal nodes, leaves, labels, …) are numbered from 1 to
n
, where n
is the number of tips. The root is
numbered n+1
. The ancestral node (closer to the root) is
always on the lefthand column.
Some important properties and observations: * the first column only
contains values greater than n
. * all nodes, but the root,
appear exactly once in the second column. * in a strictly bifurcating
tree, all nodes greater then n
appear twice and the root
node n+1
, in case of ‘twice’ for rooted and
trice
for unrooted trees.
So we can start and construct a minimal tree ourselves:
library(ape)
simple_tree <- structure(list(edge = matrix(c(3L, 3L, 1L, 2L), 2, 2),
tip.label = c("t1", "t2"), Nnode = 1L), class = "phylo")
plot(simple_tree)
A phylo
object always contains an edge
,
tip.label
and Nnode
. Often a phylo object
contains some of the following elements:
edge.length
. The length of
the vector is number of rows of the edge
matrix and the
order corresponds to the rows in edge
.node.labels
of length
Nnode
. The node labels are labeled from n+1
to
n+Nnode
Additionally most phylo
objects have an attribute
order
, which can be either “cladewise”, “postorder” or
“pruningwise”.
Tree traversal in ape is implemented iterative in contrast to the recursive implementation in most other phylogenetic programs.
Preorder tree traversal is travelling through the tree from the root to the tips visiting always the all nodes the right subtree before visiting the right subtree.
tree <- rtree(5)
plot(tree, show.tip.label = FALSE)
nodelabels()
tiplabels()
reorder(tree, "cladewise")$edge
## [,1] [,2]
## [1,] 6 7
## [2,] 7 1
## [3,] 7 2
## [4,] 6 8
## [5,] 8 3
## [6,] 8 9
## [7,] 9 4
## [8,] 9 5
Postorder tree traversal travels through the tree from the tips through the root. All the descendent nodes need to be visited before the ancestral node.
reorder(tree, "postorder")$edge
## [,1] [,2]
## [1,] 9 4
## [2,] 9 5
## [3,] 8 3
## [4,] 8 9
## [5,] 7 1
## [6,] 7 2
## [7,] 6 7
## [8,] 6 8
Defining the order of the traversal of the tree this way allows to
iterate just through the tree. Reorder not only changes the order of the
edge
matrix, but also of the edge.length
if it
exists.
###Excercise 1: Write a function which counts how many descendants each node has. The output should look like:
set.seed(42)
tree <- rtree(5)
plot(tree, show.tip.label = FALSE)
nodelabels()
tiplabels()
nr_desc(tree)
## [1] 1 1 1 1 1 5 4 3 2
My solutions
library(phangorn)
nr_desc_cheat <- function(x){
lengths(Descendants(x, 1:max(x$edge)))
}
nr_desc <- function(x){
x <- reorder(x, "postorder")
res <- numeric(max(x$edge))
res[1:Ntip(x)] = 1L
for(i in 1:nrow(x$edge)){
tmp = x$edge[i,1]
res[tmp] = res[tmp] + res[x$edge[i,2] ]
}
res
}
###Excercise 2: Write a function which computes the distance from each node to the root.
tree$edge.length[] <- 1
root_to_tip(tree)
## [1] 1 2 3 4 4 0 1 2 3
My solution:
root_to_tip <- function(x){
x <- reorder(x)
res <- numeric(max(x$edge))
for(i in 1:nrow(x$edge)){
pa <- x$edge[i,1]
ch <- x$edge[i,2]
res[ch] = res[pa] + x$edge.length[i]
}
res
}
An object of class ‘multiPhylo’ is a list of of several trees each of class ‘phylo’. If all trees share the same tip labels, as in bootstrap samples or from MCMC analyses, trees can be saved in a memory efficient way storing the labels only once in an attribute TipLabel.
trees <- rmtree(1000, 100)
trees
## 1000 phylogenetic trees
attr(trees, "TipLabel")
## NULL
trees_compact <- .compressTipLabel(trees)
attr(trees_compact, "TipLabel")
## [1] "t92" "t69" "t100" "t2" "t82" "t24" "t18" "t99" "t55" "t40"
## [11] "t21" "t57" "t42" "t94" "t13" "t53" "t54" "t83" "t32" "t80"
## [21] "t60" "t29" "t73" "t43" "t58" "t72" "t79" "t98" "t38" "t1"
## [31] "t86" "t5" "t78" "t16" "t77" "t88" "t65" "t56" "t33" "t59"
## [41] "t89" "t74" "t25" "t51" "t96" "t17" "t46" "t14" "t47" "t6"
## [51] "t52" "t66" "t37" "t67" "t31" "t34" "t30" "t63" "t81" "t22"
## [61] "t75" "t84" "t64" "t20" "t15" "t45" "t23" "t12" "t49" "t26"
## [71] "t9" "t71" "t97" "t95" "t27" "t28" "t11" "t85" "t76" "t41"
## [81] "t10" "t93" "t48" "t68" "t61" "t19" "t3" "t62" "t87" "t50"
## [91] "t36" "t35" "t91" "t90" "t70" "t44" "t4" "t7" "t8" "t39"
object.size(trees) / object.size(trees_compact)
## 2.5 bytes
Switching between the two representations can be done via ‘.compressTipLabel’ and ‘.uncompressTipLabel’.
Package ape
and phangorn
contains functions
to import trees and splits networks:
read.tree
Newick filesread.nexus
NEXUS files (only trees)read.evonet
networks in extended Newick formatand
write.tree
Newick (one or several trees)write.nexus
NEXUS (one or several trees with or without
translate block)write.evonet
extended Newick (.nhx)read.networx
phylogenetic splits networks in nexus
format (e.g. Splitstree)write.tree(tree, "example_tree.nex")
tree_read <- read.tree(file = "example_tree.nex")
plot(tree_read)
Any R object(s) can be saved on disk: this is much better if these data are used only in R. (*Faster to read/write, smaller files, and no loss of numerical accuracy.) Several objects:
save(X, a, y, file = "X.rda")
several objects with
their names
load("X.rda")
\(\rightarrow\) restores object names (i.e.,
possibly deletes previous X, a, y in the workspace) A single
object:
saveRDS(X, file = "X.rds")
Y <- readRDS("X.rds")
\(\rightarrow\) the name of the object can be
changed
phylo
objectsBefore you write a function look what’s already out there.
There are several convenience functions to achieve common tasks:
Operation | High level | Low level |
---|---|---|
How many tips? | Ntip(tr) | length(tr$tip.label) |
How many node? | Nnode(tr) | tr$Nnode |
How many edges? | Nedge(tr) | nrow(tr$edge) |
What are the ancestor of x | Ancestors(tr, x, “parent”) | tr$edge[tr$edge[,2]==x , 1] |
What are the descendants of x | Descendants(tr, x, “children”) | tr$edge[tr$edge[,1]==x , 2] |
Ntip, Nnode, Nedge are generic and work on phylo
and
multiPhylo
objects. Ancestors
and
Descendants
are vectorized , i.e. argument node can be a
vector.
root(phy, outgroup, ...)
, unroot(phy)
,
midpoint(tree)
, drop.tip(phy, tip, ...)
,
keep.tip(phy, tip)
,
extract.clade(phy, node, ...)
,
bind.tree(x, y, ...)
, c(...)
,
multi2di(phy, random = TRUE)
,
di2multi(phy, tol = 1e-8)
, rotate(phy, node)
,
rotateConstr(phy, X)
,
reorder(phy, order = "cladewise")
Some of the above functions have the option interactive (FALSE by default) and many are generic.
And some functions to explore the Baumraum (tree space)
nni
, rNNI
, rSPR
,
allTrees
.
We usually don’t simply stop when we have the tree. We often want to do some downstream analsyses using that phylogeny. This, in fact, is the whole point of the workshop. Broadly, phylogenetic comparative methods typically cluster into discrete and continuous trait methods. Continuous trait methods are for traits that can’t be broken into discrete units. This often includes weights, measurements, and ratios. Discrete traits are, as the name implies, able to be broken down into discrete units. Biogeography often falls into this classification.
Both continuous and discrete trait methods (including biogeography) will typically use a model of evolution to trace a character from the tips of a tree to the root. As an example, let’s a take a quick look at an ancestral state estimation.
library(geiger)
## Loading required package: phytools
## Loading required package: maps
q <- list(rbind(c(-.5, .5), c(.5, -.5)))
trait <- geiger::sim.char(tree, par = q, model = "discrete", n=1)
fit <- ace(trait, tree, type = "discrete", model = "ER")
fit
##
## Ancestral Character Estimation
##
## Call: ace(x = trait, phy = tree, type = "discrete", model = "ER")
##
## Log-likelihood: -2.772589
##
## Rate index matrix:
## 1 2
## 1 . 1
## 2 1 .
##
## Parameter estimates:
## rate index estimate std-err
## 1 5.5115 6989.907
##
## Scaled likelihoods at the root (type '...$lik.anc' to get them for all nodes):
## 1 2
## 0.5000082 0.4999918
Let’s walk through this output:
Log-likelihood is the likelihood of the data given the Equal-Rates model. The equal rates model can be seen right below - this model says that you are equally likely to go from a state 1 to a state 2 as vice versa - some of you may know this as the Mk model of Lewis (2001) or the Jukes-Cantor model (1969). Scaled likelihoods at the root is the likelihood of a state 1 or 2 at the root. We see these are nearly equivalent. Look at the tree - why should this be?
Before we visualize this, we’re going to install one more R Package that we didn’t install prior to the workshop. Many, if not most, R packages are available via CRAN, which is a central packaging repository. However, others are available on GitHub. There are also alternative R package repositories, such as BioConductor. This is where ggtree lives. We will now install it:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
## Bioconductor version '3.16' is out-of-date; the current release version '3.18'
## is available with R version '4.3'; see https://bioconductor.org/install
BiocManager::install("ggtree")
## Bioconductor version 3.16 (BiocManager 1.30.22), R 4.2.2 (2022-10-31)
## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'ggtree'
## Old packages: 'aplot', 'biomartr', 'bslib', 'checkmate', 'cluster', 'cpp11',
## 'crosstalk', 'dbplyr', 'deSolve', 'dotCall64', 'dplyr', 'DT', 'duckdb',
## 'e1071', 'evaluate', 'expm', 'foreign', 'FossilSim', 'gert', 'GGally',
## 'ggpp', 'ggrepel', 'ggthemes', 'gmp', 'haven', 'htmlTable', 'htmltools',
## 'htmlwidgets', 'httpuv', 'httr2', 'jsonlite', 'later', 'lattice', 'leaflet',
## 'lifecycle', 'magick', 'maps', 'markdown', 'Matrix', 'matrixStats',
## 'multicool', 'mvtnorm', 'nlme', 'philentropy', 'phytools', 'plotrix',
## 'PlotTools', 'polyclip', 'poorman', 'progress', 'R.utils', 'rbibutils',
## 'RcppArmadillo', 'RcppEigen', 'RCurl', 'Rdpack', 'reactR', 'restez',
## 'RevGadgets', 'Rfast', 'rgl', 'RJSONIO', 'Rogue', 'rpart', 'rprojroot',
## 'RSQLite', 'sass', 'scales', 'segmented', 'sets', 'shiny', 'sp', 'spam',
## 'stringdist', 'stringi', 'stringr', 'terra', 'testthat', 'tinytex',
## 'TreeDist', 'treemapify', 'units', 'vctrs', 'vroom', 'waldo', 'wk', 'xfun',
## 'XML', 'xml2'
With ggtree installed, let’s make some pretty graphics.
library(ggtree)
## Registered S3 methods overwritten by 'treeio':
## method from
## MRCA.phylo tidytree
## MRCA.treedata tidytree
## Nnode.treedata tidytree
## Ntip.treedata tidytree
## ancestor.phylo tidytree
## ancestor.treedata tidytree
## child.phylo tidytree
## child.treedata tidytree
## full_join.phylo tidytree
## full_join.treedata tidytree
## groupClade.phylo tidytree
## groupClade.treedata tidytree
## groupOTU.phylo tidytree
## groupOTU.treedata tidytree
## is.rooted.treedata tidytree
## nodeid.phylo tidytree
## nodeid.treedata tidytree
## nodelab.phylo tidytree
## nodelab.treedata tidytree
## offspring.phylo tidytree
## offspring.treedata tidytree
## parent.phylo tidytree
## parent.treedata tidytree
## root.treedata tidytree
## rootnode.phylo tidytree
## sibling.phylo tidytree
## ggtree v3.6.2 For help: https://yulab-smu.top/treedata-book/
##
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
##
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
##
## G Yu. Data Integration, Manipulation and Visualization of Phylogenetic
## Trees (1st ed.). Chapman and Hall/CRC. 2022. ISBN: 9781032233574
##
## LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR
## Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package
## for phylogenetic tree input and output with richly annotated and
## associated data. Molecular Biology and Evolution. 2020, 37(2):599-603.
## doi: 10.1093/molbev/msz240
##
## Attaching package: 'ggtree'
## The following object is masked from 'package:ape':
##
## rotate
plot <- ace(trait, tree, type = "discrete", model = "ER")
plot(tree, cex = 0.5, adj = c(0.2), type = "fan",
no.margin = TRUE, show.tip.label = FALSE)
tiplabels(pch = 16, col = c( "lightblue",
"chocolate"))
nodelabels(pie = fit$lik.anc, cex = 0.5)
legend("bottomleft", pch = 15, bty = "n",
legend = c("0", "1"),
col = c( "lightblue",
"chocolate"))
Some of the earliest ways of mapping biogeography to trees were based on parsimony. In the time since, it became common to use methods like the above, while treating a landmas as a discrete state, finally, newer models, such as DEC, became more common to use for biogeography. Let’s grab a dataset and take a ramble through some of these methods.
First, we’ll simulate a tree:
big_tree <- rtree(45)
big_tree
##
## Phylogenetic tree with 45 tips and 44 internal nodes.
##
## Tip labels:
## t25, t26, t17, t2, t38, t27, ...
##
## Rooted; includes branch lengths.
rtree
as a function simulates a random tree. Now let’s
add some data to it. This time, we will simulate a tree using a method
called “All Rates Different.” Unlike the Mk model or the Jukes-Cantor,
this implies that the transition rates are all different between
possible character states.
library(geiger)
geo <- rTraitDisc(big_tree, "ARD", rate = c(0.1, 0.2))
fit <- ace(geo, big_tree, type = "discrete", model = "ER")
fit
##
## Ancestral Character Estimation
##
## Call: ace(x = geo, phy = big_tree, type = "discrete", model = "ER")
##
## Log-likelihood: -16.25669
##
## Rate index matrix:
## A B
## A . 1
## B 1 .
##
## Parameter estimates:
## rate index estimate std-err
## 1 0.1276 0.0429
##
## Scaled likelihoods at the root (type '...$lik.anc' to get them for all nodes):
## A B
## 0.98887515 0.01112485
## Explicitly biogeographic models
The above is useful. However, we may want models with parameters that actually carry biogeographic meaning. The most common and well-used of these is the DEC model, standing for Dispersal-Extirpation-Cladogenesis. The idea of this is that a lineage could be dispersing, being removed from an area, or speciating. Described by Ree in 2005, this has gone on to be one of the more popular approaches to understanding phylogeography in deep time. R methods for doing this have fallen apart in the past few years. If there is interest, I’m happy to run this in RevBayes later this week.
Jukes T.H., Cantor C.R. 1969. Evolution of Protein Molecules. Mammalian Protein Metabolism. 3:21–132. 10.1016/B978-1-4832-3211-9.50009-7 Lewis P.O. 2001. A Likelihood Approach to Estimating Phylogeny from Discrete Morphological Character Data. Systematic Biology. 50:913–925. 10.1080/106351501753462876
Paradis E. (2012) Definition of Formats for Coding Phylogenetic Trees in R url
Paradis E. (2012) Analysis of Phylogenetics and Evolution with R, 2nd ed., Springer, New York
Paradis E. & Schliep K. (2019) ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R, Bioinformatics 35 (3), 526-528
Ree R.H., Moore B.R., Webb C.O., Donoghue M.J., Crandall K. 2005. A likelihood framework for inferring the evolution of geographic range on phylogenetic trees. Evolution. 59:2299–2311. 10.1111/j.0014-3820.2005.tb00940.x
Schliep K.P. 2011. phangorn: phylogenetic analysis in R. Bioinformatics, 27(4) 592-593