Tree Statistics, Metrics, Summarizations, and Other Calculations

Some general tree metrics that are calculated without reference to any particular model or data and general report some tree metadata (e.g., tree length, node ages, etc.) are available as instance methods. More specialized tree statistics, however, are available through functions in various other modules:

  • The treemeasure module provides for calculation of statistics that are typically calculated on a single tree.
  • The treecompare module provides for calculation of statistics that are typically calculated between trees
  • The treescore module provides for statistics that typically score a tree under a model and with reference to some sort of data.
  • The coalescent module provides for calcuations on trees under the coalescent model.

In addition, see the PhylogeneticDistanceMatrix class for statistics, operations, and inferences based on phylogenetic (taxon-to-taxon) distances, including rapid calculation of MRCA’s, patristics distances, neighbor-joining (NJ) and Unweighted Pair Group with Mathematical Average (UPGMA) trees, phylogenetic community statistics (such as the Mean Pairwise Distance, MPD, or Mean Nearest Taxon Distance, MNTD), and more.

Native Tree Statistic and Metric Methods

Basic meta-information about tree structure are available as native Tree methods.

Tree Length

The length method returns the sum of edge lengths of a Tree object, with edges that do not have any length assigned being treated as edges with length 0. The following example shows how to identify the “critical” value for an Archie-Faith-Cranston or PTP test from a sample of Tree objects, i.e. a tree length equal to or greater than 95% of the trees in the sample:

#! /usr/bin/env python

import dendropy

trees = dendropy.TreeList.get(
    path="pythonidae.random.bd0301.tre",
    schema="nexus")
tree_lengths = [tree.length() for tree in trees]
tree_lengths.sort()
crit_index_95 = int(0.95 * len(tree_lengths))
crit_index_99 = int(0.99 * len(tree_lengths))

print("95%% critical value: %s" % tree_lengths[crit_index_95])
print("99%% critical value: %s" % tree_lengths[crit_index_99])

Node Ages

The calc_node_ages method calculates the age of a node (i.e., the sum of edge lengths from the node to a tip) and assigns it to the age attribute. The following example iterates through the post-burnin of an MCMC sample of ultrametric trees, calculating the age of the MRCA of two taxa, and reports the mean age of the node.

#! /usr/bin/env python

import dendropy
from dendropy.calculate import treemeasure

trees = dendropy.TreeList.get(
        path="pythonidae.beast-mcmc.trees",
        schema="nexus",
        tree_offset=200)
maculosa_childreni_ages = []
for idx, tree in enumerate(trees):
    tree.calc_node_ages()
    taxon_labels = ["Antaresia maculosa","Antaresia childreni"]
    mrca = tree.mrca(taxon_labels=taxon_labels)
    maculosa_childreni_ages.append(mrca.age)
print("Mean age of MRCA of 'Antaresia maculosa' and 'Antaresia childreni': %s" \
    % (float(sum(maculosa_childreni_ages))/len(maculosa_childreni_ages)))


Number of Lineages at a Particular Time and Lineage Through Time Plots

The num_lineages_at method of the Tree class returns the number of lineages at a particular time given in terms of distance from the root. The following example extracts the number of lineages at fixed intervals along the length of the tree to use in an Lineage Through Time (LTT) plot:

import dendropy

tree = dendropy.Tree.get(
            path="hiv1.nexus",
            schema="nexus")

# Returns distance of node furthest from root, i.e., maximum time available on
# tree
total_time = tree.max_distance_from_root()

# Divide time span into 10 steps
step = float(total_time) / 10

# To store tuples of (time, number of lineages)
ltt = []

# Start at first time step
current_time = step
while current_time <= total_time:
    # Get number of lineages at current time
    num_lineages = tree.num_lineages_at(current_time)
    # Store it
    ltt.append( (current_time, num_lineages) )
    # Move to next time step
    current_time += step

# Get the final number of lineages
# Note: may not be the same as the number of tips if the tree has extinct
# tips/taxa; though, if this were the case, we would not be dealing with an
# ultrametric tree.
if current_time < total_time:
    ltt.append( (total_time, tree.num_lineages_at(total_time)) )

# Print results
for t, num_lineages in ltt:
    print("{:12.8f}\t{}".format(t, num_lineages))

Unary Tree Statistics and Metrics

Numerous specialized statistics and indexes of tree shape and structure (B1, Colless’ imbalance, Pybus-Harvey-Gamma, etc.) are available through the treemeasure module:

import collections
import dendropy
from dendropy.calculate import treemeasure
from dendropy.calculate import statistics

# Since we do not want to waste memory by keeping the actual trees around
# after we are done calculating the statistics, we use the tree yielder
# instead of:
#       dendropy.TreeList.get(
#           path="pythonidae.beast-mcmc.trees",
#           schema="nexus",
#           tree_offset=200)

tree_stats = collections.defaultdict(list)
for tree_idx, tree in enumerate(dendropy.Tree.yield_from_files(
            files=["pythonidae.beast-mcmc.trees"],
            schema="nexus")):
    if tree_idx < 200:
        continue # burnin
    tree_stats["B1"].append(treemeasure.B1(tree))
    tree_stats["colless"].append(treemeasure.colless_tree_imbalance(tree))
    tree_stats["PBH"].append(treemeasure.pybus_harvey_gamma(tree))
    tree_stats["sackin"].append(treemeasure.sackin_index(tree))
    tree_stats["treeness"].append(treemeasure.treeness(tree))

for key in tree_stats:
    values = tree_stats[key]
    mean, var = statistics.mean_and_sample_variance(values)
    hpd = statistics.empirical_hpd(values)
    print("{:15}: mean = {}, variance = {}, hpd = ({}, {})".format(key, mean, var, hpd[0], hpd[1]))

Pybus-Harvey Gamma

The Pybus-Harvey Gamma statistic is given by the pybus_harvey_gamma instance method. The following example iterates through the post-burn-in of an MCMC sample of trees, reporting the mean Pybus-Harvey Gamma statistic:

#! /usr/bin/env python

import dendropy
from dendropy.calculate import treemeasure

trees = dendropy.TreeList.get(
        path="pythonidae.beast-mcmc.trees",
        schema="nexus",
        tree_offset=200)
pbhg = []
for idx, tree in enumerate(trees):
    pbhg.append(treemeasure.pybus_harvey_gamma(tree))
print("Mean Pybus-Harvey-Gamma: %s" \
    % (float(sum(pbhg))/len(pbhg)))


Patristic Distances

The PhylogeneticDistanceMatrix is the most efficient way to calculate the patristic distances between taxa or leaves on a tree, when doing multiple such calculations. The easiest way to get an object of this class for a particular tree is to call phylogenetic_distance_matrix. The object is callable, taking two Taxon objects as arguments and returning the sum of edge lengths between the two. The following example reports the pairwise distances between all taxa on the input tree:

import dendropy

tree = dendropy.Tree.get(
    path="pythonidae.mle.nex",
    schema="nexus")
pdc = tree.phylogenetic_distance_matrix()
for i, t1 in enumerate(tree.taxon_namespace[:-1]):
    for t2 in tree.taxon_namespace[i+1:]:
        print("Distance between '%s' and '%s': %s" % (t1.label, t2.label, pdc(t1, t2)))

Note that the PhylogeneticDistanceMatrix object does not automatically update if the original Tree changes: it is essentially a snapshot of Tree at the point in which it is instantiated. If the original Tree changes, you should create a new instance of the corresponding PhylogeneticDistanceMatrix object.

Comparing and Summarizing Trees

Distances Between Trees

Unweighted Robinson-Foulds Distance

The unweighted Robinson-Foulds distance (often referred to as just the Robinson-Foulds distance) is given by the dendropy.calculate.treecompare.symmetric_difference function:

import dendropy
from dendropy.calculate import treecompare

s1 = "(a,(b,(c,d)));"
s2 = "(a,(d,(b,c)));"

# establish common taxon namespace
tns = dendropy.TaxonNamespace()

# ensure all trees loaded use common namespace
tree1 = dendropy.Tree.get(
        data=s1,
        schema='newick',
        taxon_namespace=tns)
tree2 = dendropy.Tree.get(
        data=s2,
        schema='newick',
        taxon_namespace=tns)

## Unweighted Robinson-Foulds distance
print(treecompare.symmetric_difference(tree1, tree2))

Note that the two trees must share the same TaxonNamespace reference, otherwise an error will be raised:

>> import dendropy
>> from dendropy.calculate import treecompare
>> s1 = "(a,(b,(c,d)));"
>> s2 = "(a,(d,(b,c)));"
>> tree1 = dendropy.Tree.get(data=s1, schema='newick')
>> tree2 = dendropy.Tree.get(data=s2, schema='newick')
>> print(treecompare.symmetric_difference(tree1, tree2))
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
        print(treecompare.symmetric_difference(tree1, tree2))
    File "/Users/jeet/Documents/Projects/Phyloinformatics/DendroPy/dendropy/dendropy/calculate/treecompare.py", line 85, in symmetric_difference
        is_bipartitions_updated=is_bipartitions_updated)
    File "/Users/jeet/Documents/Projects/Phyloinformatics/DendroPy/dendropy/dendropy/calculate/treecompare.py", line 221, in false_positives_and_negatives
        raise error.TaxonNamespaceIdentityError(reference_tree, comparison_tree)
    dendropy.utility.error.TaxonNamespaceIdentityError: Non-identical taxon namespace references: <TaxonNamespace object at 0x10052d310> is not <TaxonNamespace object at 0x101572210>

Note, too, that results very much depend on the rooting states of the tree:

import dendropy
from dendropy.calculate import treecompare

s1 = "(a,(b,(c,d)));"
s2 = "((a,b),(c,d));"

tns = dendropy.TaxonNamespace()

unrooted_tree1 = dendropy.Tree.get(
        data=s1,
        schema='newick',
        taxon_namespace=tns)
unrooted_tree2 = dendropy.Tree.get(
        data=s2,
        schema='newick',
        taxon_namespace=tns)

rooted_tree1 = dendropy.Tree.get(
        data=s1,
        schema='newick',
        rooting="force-rooted",
        taxon_namespace=tns)
rooted_tree2 = dendropy.Tree.get(
        data=s2,
        schema='newick',
        rooting="force-rooted",
        taxon_namespace=tns)

## Unweighted Robinson-Foulds distance (rooted) = 2
print(treecompare.symmetric_difference(rooted_tree1, rooted_tree2))
## Unweighted Robinson-Foulds distance (unrooted) = 0
print(treecompare.symmetric_difference(unrooted_tree1, unrooted_tree2))
## Unweighted Robinson-Foulds distance (rooted1 to unrooted2) = 3
print(treecompare.symmetric_difference(rooted_tree1, unrooted_tree2))
## Unweighted Robinson-Foulds distance (unrooted1 to rooted2) = 5
print(treecompare.symmetric_difference(unrooted_tree1, rooted_tree2))

Weighted Robinson-Foulds Distance

The weighted Robinson-Foulds distance takes edge lengths into account, and is given by the dendropy.calculate.treecompare.weighted_robinson_foulds_distance:

import dendropy
from dendropy.calculate import treecompare

s1 = "((t5:0.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):0.028969):0.065840,t3:0.170221):0.383247);"
s2 = "((t5:2.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):1):0.065840,t3:0.170221):0.383247);"

tns = dendropy.TaxonNamespace()

tree1 = dendropy.Tree.get(
        data=s1,
        schema='newick',
        taxon_namespace=tns)
tree2 = dendropy.Tree.get(
        data=s2,
        schema='newick',
        taxon_namespace=tns)

## Weighted Robinson-Foulds distance = 2.971031
print(treecompare.weighted_robinson_foulds_distance(tree1, tree2))

## Compare to unweighted Robinson-Foulds distance: 0
print(treecompare.symmetric_difference(tree1, tree2))

Euclidean Distance

The Euclidean distance, like the weighted Robinson-Foulds distance takes edge lengths into account, but squares the edge lengths instead of taking the absolute distance, and is given by the dendropy.calculate.treecompare.euclidean_distance:

import dendropy
from dendropy.calculate import treecompare

s1 = "((t5:0.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):0.028969):0.065840,t3:0.170221):0.383247);"
s2 = "((t5:2.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):1):0.065840,t3:0.170221):0.383247);"

tns = dendropy.TaxonNamespace()

tree1 = dendropy.Tree.get(
        data=s1,
        schema='newick',
        taxon_namespace=tns)
tree2 = dendropy.Tree.get(
        data=s2,
        schema='newick',
        taxon_namespace=tns)

## Euclidean distance = 2.22326363775
print(treecompare.euclidean_distance(tree1, tree2))

Majority-Rule Consensus Tree from a Collection of Trees

To get the majority-rule consensus tree of a TreeList object, you can call the consensus instance method. You can specify the frequency threshold for the consensus tree by the min_freq argument, which default to 0.5 (i.e., a 50% majority rule tree). The following example aggregates the post-burn-in trees from four MCMC samples into a single TreeList object, and prints the 95% majority-rule consensus as a Newick string:

#! /usr/bin/env python

import dendropy

trees = dendropy.TreeList()
for tree_file in ['pythonidae.mb.run1.t',
        'pythonidae.mb.run2.t',
        'pythonidae.mb.run3.t',
        'pythonidae.mb.run4.t']:
    trees.read(
            path=tree_file,
            schema='nexus',
            tree_offset=20)
con_tree = trees.consensus(min_freq=0.95)
print(con_tree.as_string(schema='newick'))

Frequency of a Split in a Collection of Trees

The frequency_of_split method of a TreeList object returns the frequency of occurrence of a single split across all the Tree objects in the TreeList. The split can be specified by passing a split bitmask directly using the split_bitmask keyword argument, as a list of Taxon objects using the taxa keyword argument, or as a list of taxon labels using the labels keyword argument. The following example shows how to calculate the frequency of a split defined by two taxa, “Morelia amethistina” and “Morelia tracyae”, from the post-burn-in trees aggregated across four MCMC samples:

#! /usr/bin/env python

import dendropy

trees = dendropy.TreeList()
for tree_file in ['pythonidae.mb.run1.t',
        'pythonidae.mb.run2.t',
        'pythonidae.mb.run3.t',
        'pythonidae.mb.run4.t']:
    trees.read(
            path=tree_file,
            schema='nexus',
            tree_offset=20)
split_leaves = ['Morelia amethistina', 'Morelia tracyae']
f = trees.frequency_of_bipartition(labels=split_leaves)
print('Frequency of split %s: %s' % (split_leaves, f))

The Maximum Clade Credibility Tree: The Tree that Maximizes the Product of Split Support

The Maximum Clade Credibility Tree (MCCT) is one that maximize the product of split support, and is returned for a collection of trees managed in a TreeList instance by the maximum_product_of_split_support_tree method:

#! /usr/bin/env python

import dendropy

trees = dendropy.TreeList()
for tree_file in ['pythonidae.mb.run1.t',
        'pythonidae.mb.run2.t',
        'pythonidae.mb.run3.t',
        'pythonidae.mb.run4.t']:
    trees.read(
            path=tree_file,
            schema='nexus',
            tree_offset=20)

mcct = trees.maximum_product_of_split_support_tree()
print("\nTree {} maximizes the product of split support (log product = {}): {}".format(trees.index(mcct), mcct.log_product_of_split_support, mcct))

msst = trees.maximum_sum_of_split_support_tree()
print("\nTree {} maximizes the sum of split support (sum = {}): {}".format(trees.index(msst), msst.sum_of_split_support, msst))

Unfortunately, terminology in usage and literature regarding this type of summary is very confusing, and sometimes the term “MCCT” is used to refer to the tree that maximizes the sum of split support and “MCT” to the tree that maximizes the product of split support. If the tree that maximizes the sum of split support is the criteria required, then the maximum_sum_of_split_support_tree method of the TreeList object should be used.

Scoring Trees Under the Coalescent

Probability Under the Coalescent Model

The coalescent module provides a range of methods for simulations and calculations under Kingman’s coalescent framework and related models. For example:

log_probability_of_coalescent_tree
Given a Tree object as the first argument, and the haploid population size as the second, returns the log probability of the Tree under the neutral coalescent.

Numbers of Deep Coalescences

reconciliation_discordance
Given two Tree objects sharing the same leaf-set, this returns the number of deep coalescences resulting from fitting the first tree (e.g., a gene tree) to the second (e.g., a species tree). This is based on the algorithm described Goodman, et al. (Goodman, et al., 1979. Fitting the gene lineage into its species lineage,a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Zool. 19: 99-113).
monophyletic_partition_discordance
Given a Tree object as the first argument, and a list of lists of Taxon objects representing the expected monophyletic partitioning of the TaxonNamespace of the Tree as the second argument, this returns the number of deep coalescences found in the relationships implied by the Tree object, conditional on the taxon groupings given by the second argument. This statistic corresponds to the Slatkin and Maddison (1989) s statistic, as described here.

Number of Deep Coalescences when Embedding One Tree in Another (e.g. Gene/Species Trees)

Imagine we wanted to generate the distribution of the number of deep coalescences under two scenarios: one in which a population underwent sequential or step-wise vicariance, and another when there was simultaneous fragmentation. In this case, the containing tree and the embedded trees have different leaf sets, and there is a many-to-one mapping of embedded tree taxa to containing tree taxa.

The ContainingTree class is designed to allow for counting deep coalescences in cases like this. It requires a TaxonNamespaceMapping object, which provides an association between the embedded taxa and the containing taxa. The easiest way to get a TaxonNamespaceMapping object is to call the special factory function create_contained_taxon_mapping. This will create a new TaxonNamespace to manage the gene taxa, and create the associations between the gene taxa and the containing tree taxa for you. It takes two arguments: the TaxonNamespace of the containing tree, and the number of genes you want sampled from each species. If the gene-species associations are more complex, e.g., different numbers of genes per species, we can pass in a list of values as the second argument to create_contained_taxon_mapping. This approach should be used with caution if we cannot be certain of the order of taxa (as is the case with data read in Newick formats). In these case, and in more complex cases, we might need to directly instantiate the TaxonNamespaceMapping object. The API to describe the associations when constructing this object is very similar to that of the TaxonNamespacePartition object: you can use a function, attribute or dictionary.

The ContainingTree class has its own native contained coalescent simulator, embed_contained_kingman, which simulates and embeds a contained coalescent tree at the same time.

#! /usr/bin/env python

import dendropy
from dendropy.simulate import treesim
from dendropy.model import reconcile

# simulation parameters and output
num_reps = 10

# population tree descriptions
stepwise_tree_str = "[&R](A:120000,(B:80000,(C:40000,D:40000):40000):40000):100000;"
frag_tree_str = "[&R](A:120000,B:120000,C:120000,D:120000):100000;"

# taxa and trees
containing_taxa = dendropy.TaxonNamespace()
stepwise_tree = dendropy.Tree.get(
        data=stepwise_tree_str,
        schema="newick",
        taxon_namespace=containing_taxa)
frag_tree = dendropy.Tree.get(
        data=frag_tree_str,
        schema="newick",
        taxon_namespace=containing_taxa)

# taxon set association
genes_to_species = dendropy.TaxonNamespaceMapping.create_contained_taxon_mapping(
        containing_taxon_namespace=containing_taxa,
        num_contained=8)

# convert to containing tree
stepwise_tree = reconcile.ContainingTree(stepwise_tree,
            contained_taxon_namespace=genes_to_species.domain_taxon_namespace,
            contained_to_containing_taxon_map=genes_to_species)
frag_tree = reconcile.ContainingTree(frag_tree,
            contained_taxon_namespace=genes_to_species.domain_taxon_namespace,
            contained_to_containing_taxon_map=genes_to_species)

# for each rep
for rep in range(num_reps):
    stepwise_tree.embed_contained_kingman(default_pop_size=40000)
    frag_tree.embed_contained_kingman(default_pop_size=40000)

# write results

# returns dictionary with contained trees as keys
# and number of deep coalescences as values
stepwise_deep_coals = stepwise_tree.deep_coalescences()
stepwise_out = open("stepwise.txt", "w")
for tree in stepwise_deep_coals:
    stepwise_out.write("%d\n" % stepwise_deep_coals[tree])

# returns dictionary with contained trees as keys
# and number of deep coalescences as values
frag_deep_coals = frag_tree.deep_coalescences()
frag_out = open("frag.txt", "w")
for tree in frag_deep_coals:
    frag_out.write("%d\n" % frag_deep_coals[tree])








If you have used some other method to simulate your trees, you can use embed_tree to embed the trees and count then number of deep coalescences.

#! /usr/bin/env python

import dendropy
from dendropy.simulate import treesim
from dendropy.model import reconcile

# simulation parameters and output
num_reps = 10

# population tree descriptions
stepwise_tree_str = "[&R](A:120000,(B:80000,(C:40000,D:40000):40000):40000):100000;"
frag_tree_str = "[&R](A:120000,B:120000,C:120000,D:120000):100000;"

# taxa and trees
containing_taxa = dendropy.TaxonNamespace()
stepwise_tree = dendropy.Tree.get(
        data=stepwise_tree_str,
        schema="newick",
        taxon_namespace=containing_taxa)
frag_tree = dendropy.Tree.get(
        data=frag_tree_str,
        schema="newick",
        taxon_namespace=containing_taxa)

# taxon set association
genes_to_species = dendropy.TaxonNamespaceMapping.create_contained_taxon_mapping(
        containing_taxon_namespace=containing_taxa,
        num_contained=8)

# convert to containing tree
stepwise_tree = reconcile.ContainingTree(stepwise_tree,
            contained_taxon_namespace=genes_to_species.domain_taxon_namespace,
            contained_to_containing_taxon_map=genes_to_species)
frag_tree = reconcile.ContainingTree(frag_tree,
            contained_taxon_namespace=genes_to_species.domain_taxon_namespace,
            contained_to_containing_taxon_map=genes_to_species)

# for each rep
for rep in range(num_reps):
    gene_tree1 = treesim.contained_coalescent_tree(containing_tree=stepwise_tree,
        gene_to_containing_taxon_map=genes_to_species,
        default_pop_size=40000)
    stepwise_tree.embed_tree(gene_tree1)
    gene_tree2 = treesim.contained_coalescent_tree(containing_tree=frag_tree,
        gene_to_containing_taxon_map=genes_to_species,
        default_pop_size=40000)
    frag_tree.embed_tree(gene_tree2)

# write results

# returns dictionary with contained trees as keys
# and number of deep coalescences as values
stepwise_deep_coals = stepwise_tree.deep_coalescences()
stepwise_out = open("stepwise.txt", "w")
for tree in stepwise_deep_coals:
    stepwise_out.write("%d\n" % stepwise_deep_coals[tree])

# returns dictionary with contained trees as keys
# and number of deep coalescences as values
frag_deep_coals = frag_tree.deep_coalescences()
frag_out = open("frag.txt", "w")
for tree in frag_deep_coals:
    frag_out.write("%d\n" % frag_deep_coals[tree])








For more details on simulating contained coalescent trees and counting numbers of deep coalescences on them, see “Multispecies Coalescent (“Contained Coalescent” or “Censored Coalescent”) Trees” or “Simulating the Distribution of Number Deep Coalescences Under Different Phylogeographic History Scenarios”.