# Tree Statistics, Metrics, Summarizations, and Other Calculations¶

Some general tree metrics that are calculated without reference to any particular model or data and general report some tree metadata (e.g., tree length, node ages, etc.) are available as instance methods. More specialized tree statistics, however, are available through functions in various other modules:

• The `treemeasure` module provides for calculation of statistics that are typically calculated on a single tree.

• The `treecompare` module provides for calculation of statistics that are typically calculated between trees

• The `treescore` module provides for statistics that typically score a tree under a model and with reference to some sort of data.

• The `coalescent` module provides for calcuations on trees under the coalescent model.

In addition, see the PhylogeneticDistanceMatrix class for statistics, operations, and inferences based on phylogenetic (taxon-to-taxon) distances, including rapid calculation of MRCA’s, patristics distances, neighbor-joining (NJ) and Unweighted Pair Group with Mathematical Average (UPGMA) trees, phylogenetic community statistics (such as the Mean Pairwise Distance, MPD, or Mean Nearest Taxon Distance, MNTD), and more.

## Native Tree Statistic and Metric Methods¶

Basic meta-information about tree structure are available as native `Tree` methods.

### Tree Length¶

The `length` method returns the sum of edge lengths of a `Tree` object, with edges that do not have any length assigned being treated as edges with length 0. The following example shows how to identify the “critical” value for an Archie-Faith-Cranston or PTP test from a sample of `Tree` objects, i.e. a tree length equal to or greater than 95% of the trees in the sample:

```#! /usr/bin/env python
# -*- coding: utf-8 -*-

import dendropy

trees = dendropy.TreeList.get(
path="pythonidae.random.bd0301.tre",
schema="nexus")
tree_lengths = [tree.length() for tree in trees]
tree_lengths.sort()
crit_index_95 = int(0.95 * len(tree_lengths))
crit_index_99 = int(0.99 * len(tree_lengths))

print("95%% critical value: %s" % tree_lengths[crit_index_95])
print("99%% critical value: %s" % tree_lengths[crit_index_99])
```

### Node Ages¶

The `calc_node_ages` method calculates the age of a node (i.e., the sum of edge lengths from the node to a tip) and assigns it to the `age` attribute. The following example iterates through the post-burnin of an MCMC sample of ultrametric trees, calculating the age of the MRCA of two taxa, and reports the mean age of the node.

```#! /usr/bin/env python
# -*- coding: utf-8 -*-

import dendropy
from dendropy.calculate import treemeasure

trees = dendropy.TreeList.get(
path="pythonidae.beast-mcmc.trees",
schema="nexus",
tree_offset=200)
maculosa_childreni_ages = []
for idx, tree in enumerate(trees):
tree.calc_node_ages()
taxon_labels = ["Antaresia maculosa","Antaresia childreni"]
mrca = tree.mrca(taxon_labels=taxon_labels)
maculosa_childreni_ages.append(mrca.age)
print("Mean age of MRCA of 'Antaresia maculosa' and 'Antaresia childreni': %s" \
% (float(sum(maculosa_childreni_ages))/len(maculosa_childreni_ages)))

```

### Number of Lineages at a Particular Time and Lineage Through Time Plots¶

The `num_lineages_at` method of the `Tree` class returns the number of lineages at a particular time given in terms of distance from the root. The following example extracts the number of lineages at fixed intervals along the length of the tree to use in an Lineage Through Time (LTT) plot:

```import dendropy

tree = dendropy.Tree.get(
path="hiv1.nexus",
schema="nexus")

# Returns distance of node furthest from root, i.e., maximum time available on
# tree
total_time = tree.max_distance_from_root()

# Divide time span into 10 steps
step = float(total_time) / 10

# To store tuples of (time, number of lineages)
ltt = []

# Start at first time step
current_time = step
while current_time <= total_time:
# Get number of lineages at current time
num_lineages = tree.num_lineages_at(current_time)
# Store it
ltt.append( (current_time, num_lineages) )
# Move to next time step
current_time += step

# Get the final number of lineages
# Note: may not be the same as the number of tips if the tree has extinct
# tips/taxa; though, if this were the case, we would not be dealing with an
# ultrametric tree.
if current_time < total_time:
ltt.append( (total_time, tree.num_lineages_at(total_time)) )

# Print results
for t, num_lineages in ltt:
print("{:12.8f}\t{}".format(t, num_lineages))
```

## Unary Tree Statistics and Metrics¶

Numerous specialized statistics and indexes of tree shape and structure (B1, Colless’ imbalance, Pybus-Harvey-Gamma, etc.) are available through the `treemeasure` module:

```import collections
import dendropy
from dendropy.calculate import treemeasure
from dendropy.calculate import statistics

# Since we do not want to waste memory by keeping the actual trees around
# after we are done calculating the statistics, we use the tree yielder
#       dendropy.TreeList.get(
#           path="pythonidae.beast-mcmc.trees",
#           schema="nexus",
#           tree_offset=200)

tree_stats = collections.defaultdict(list)
for tree_idx, tree in enumerate(dendropy.Tree.yield_from_files(
files=["pythonidae.beast-mcmc.trees"],
schema="nexus")):
if tree_idx < 200:
continue # burnin
tree_stats["B1"].append(treemeasure.B1(tree))
tree_stats["colless"].append(treemeasure.colless_tree_imbalance(tree))
tree_stats["PBH"].append(treemeasure.pybus_harvey_gamma(tree))
tree_stats["sackin"].append(treemeasure.sackin_index(tree))
tree_stats["treeness"].append(treemeasure.treeness(tree))

for key in tree_stats:
values = tree_stats[key]
mean, var = statistics.mean_and_sample_variance(values)
hpd = statistics.empirical_hpd(values)
print("{:15}: mean = {}, variance = {}, hpd = ({}, {})".format(key, mean, var, hpd[0], hpd[1]))
```

### Pybus-Harvey Gamma¶

The Pybus-Harvey Gamma statistic is given by the `pybus_harvey_gamma` instance method. The following example iterates through the post-burn-in of an MCMC sample of trees, reporting the mean Pybus-Harvey Gamma statistic:

```#! /usr/bin/env python
# -*- coding: utf-8 -*-

import dendropy
from dendropy.calculate import treemeasure

trees = dendropy.TreeList.get(
path="pythonidae.beast-mcmc.trees",
schema="nexus",
tree_offset=200)
pbhg = []
for idx, tree in enumerate(trees):
pbhg.append(treemeasure.pybus_harvey_gamma(tree))
print("Mean Pybus-Harvey-Gamma: %s" \
% (float(sum(pbhg))/len(pbhg)))

```

### Patristic Distances¶

The `PhylogeneticDistanceMatrix` is the most efficient way to calculate the patristic distances between taxa or leaves on a tree, when doing multiple such calculations. The easiest way to get an object of this class for a particular tree is to call `phylogenetic_distance_matrix`. The object is callable, taking two `Taxon` objects as arguments and returning the sum of edge lengths between the two. The following example reports the pairwise distances between all taxa on the input tree:

```import dendropy

tree = dendropy.Tree.get(
path="pythonidae.mle.nex",
schema="nexus")
pdc = tree.phylogenetic_distance_matrix()
for i, t1 in enumerate(tree.taxon_namespace[:-1]):
for t2 in tree.taxon_namespace[i+1:]:
print("Distance between '%s' and '%s': %s" % (t1.label, t2.label, pdc(t1, t2)))
```

Note that the `PhylogeneticDistanceMatrix` object does not automatically update if the original `Tree` changes: it is essentially a snapshot of `Tree` at the point in which it is instantiated. If the original `Tree` changes, you should create a new instance of the corresponding `PhylogeneticDistanceMatrix` object.

## Comparing and Summarizing Trees¶

### Distances Between Trees¶

#### Unweighted Robinson-Foulds Distance¶

The unweighted Robinson-Foulds distance (often referred to as just the Robinson-Foulds distance) is given by the `dendropy.calculate.treecompare.symmetric_difference` function:

```import dendropy
from dendropy.calculate import treecompare

s1 = "(a,(b,(c,d)));"
s2 = "(a,(d,(b,c)));"

# establish common taxon namespace
tns = dendropy.TaxonNamespace()

# ensure all trees loaded use common namespace
tree1 = dendropy.Tree.get(
data=s1,
schema='newick',
taxon_namespace=tns)
tree2 = dendropy.Tree.get(
data=s2,
schema='newick',
taxon_namespace=tns)

## Unweighted Robinson-Foulds distance
print(treecompare.symmetric_difference(tree1, tree2))
```

Note that the two trees must share the same `TaxonNamespace` reference, otherwise an error will be raised:

```>> import dendropy
>> from dendropy.calculate import treecompare
>> s1 = "(a,(b,(c,d)));"
>> s2 = "(a,(d,(b,c)));"
>> tree1 = dendropy.Tree.get(data=s1, schema='newick')
>> tree2 = dendropy.Tree.get(data=s2, schema='newick')
>> print(treecompare.symmetric_difference(tree1, tree2))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
print(treecompare.symmetric_difference(tree1, tree2))
File "/Users/jeet/Documents/Projects/Phyloinformatics/DendroPy/dendropy/dendropy/calculate/treecompare.py", line 85, in symmetric_difference
is_bipartitions_updated=is_bipartitions_updated)
File "/Users/jeet/Documents/Projects/Phyloinformatics/DendroPy/dendropy/dendropy/calculate/treecompare.py", line 221, in false_positives_and_negatives
raise error.TaxonNamespaceIdentityError(reference_tree, comparison_tree)
dendropy.utility.error.TaxonNamespaceIdentityError: Non-identical taxon namespace references: <TaxonNamespace object at 0x10052d310> is not <TaxonNamespace object at 0x101572210>
```

Note, too, that results very much depend on the rooting states of the tree:

```import dendropy
from dendropy.calculate import treecompare

s1 = "(a,(b,(c,d)));"
s2 = "((a,b),(c,d));"

tns = dendropy.TaxonNamespace()

unrooted_tree1 = dendropy.Tree.get(
data=s1,
schema='newick',
taxon_namespace=tns)
unrooted_tree2 = dendropy.Tree.get(
data=s2,
schema='newick',
taxon_namespace=tns)

rooted_tree1 = dendropy.Tree.get(
data=s1,
schema='newick',
rooting="force-rooted",
taxon_namespace=tns)
rooted_tree2 = dendropy.Tree.get(
data=s2,
schema='newick',
rooting="force-rooted",
taxon_namespace=tns)

## Unweighted Robinson-Foulds distance (rooted) = 2
print(treecompare.symmetric_difference(rooted_tree1, rooted_tree2))
## Unweighted Robinson-Foulds distance (unrooted) = 0
print(treecompare.symmetric_difference(unrooted_tree1, unrooted_tree2))
## Unweighted Robinson-Foulds distance (rooted1 to unrooted2) = 3
print(treecompare.symmetric_difference(rooted_tree1, unrooted_tree2))
## Unweighted Robinson-Foulds distance (unrooted1 to rooted2) = 5
print(treecompare.symmetric_difference(unrooted_tree1, rooted_tree2))

```

#### Weighted Robinson-Foulds Distance¶

The weighted Robinson-Foulds distance takes edge lengths into account, and is given by the `dendropy.calculate.treecompare.weighted_robinson_foulds_distance`:

```import dendropy
from dendropy.calculate import treecompare

s1 = "((t5:0.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):0.028969):0.065840,t3:0.170221):0.383247);"
s2 = "((t5:2.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):1):0.065840,t3:0.170221):0.383247);"

tns = dendropy.TaxonNamespace()

tree1 = dendropy.Tree.get(
data=s1,
schema='newick',
taxon_namespace=tns)
tree2 = dendropy.Tree.get(
data=s2,
schema='newick',
taxon_namespace=tns)

## Weighted Robinson-Foulds distance = 2.971031
print(treecompare.weighted_robinson_foulds_distance(tree1, tree2))

## Compare to unweighted Robinson-Foulds distance: 0
print(treecompare.symmetric_difference(tree1, tree2))
```

#### Euclidean Distance¶

The Euclidean distance, like the weighted Robinson-Foulds distance takes edge lengths into account, but squares the edge lengths instead of taking the absolute distance, and is given by the `dendropy.calculate.treecompare.euclidean_distance`:

```import dendropy
from dendropy.calculate import treecompare

s1 = "((t5:0.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):0.028969):0.065840,t3:0.170221):0.383247);"
s2 = "((t5:2.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):1):0.065840,t3:0.170221):0.383247);"

tns = dendropy.TaxonNamespace()

tree1 = dendropy.Tree.get(
data=s1,
schema='newick',
taxon_namespace=tns)
tree2 = dendropy.Tree.get(
data=s2,
schema='newick',
taxon_namespace=tns)

## Euclidean distance = 2.22326363775
print(treecompare.euclidean_distance(tree1, tree2))
```

### Majority-Rule Consensus Tree from a Collection of Trees¶

To get the majority-rule consensus tree of a `TreeList` object, you can call the `consensus` instance method. You can specify the frequency threshold for the consensus tree by the `min_freq` argument, which default to 0.5 (i.e., a 50% majority rule tree). The following example aggregates the post-burn-in trees from four MCMC samples into a single `TreeList` object, and prints the 95% majority-rule consensus as a Newick string:

```#! /usr/bin/env python
# -*- coding: utf-8 -*-

import dendropy

trees = dendropy.TreeList()
for tree_file in ['pythonidae.mb.run1.t',
'pythonidae.mb.run2.t',
'pythonidae.mb.run3.t',
'pythonidae.mb.run4.t']:
path=tree_file,
schema='nexus',
tree_offset=20)
con_tree = trees.consensus(min_freq=0.95)
print(con_tree.as_string(schema='newick'))
```

### Frequency of a Split in a Collection of Trees¶

The `frequency_of_split` method of a `TreeList` object returns the frequency of occurrence of a single split across all the `Tree` objects in the `TreeList`. The split can be specified by passing a split bitmask directly using the `split_bitmask` keyword argument, as a list of `Taxon` objects using the `taxa` keyword argument, or as a list of taxon labels using the `labels` keyword argument. The following example shows how to calculate the frequency of a split defined by two taxa, “Morelia amethistina” and “Morelia tracyae”, from the post-burn-in trees aggregated across four MCMC samples:

```#! /usr/bin/env python
# -*- coding: utf-8 -*-

import dendropy

trees = dendropy.TreeList()
for tree_file in ['pythonidae.mb.run1.t',
'pythonidae.mb.run2.t',
'pythonidae.mb.run3.t',
'pythonidae.mb.run4.t']:
path=tree_file,
schema='nexus',
tree_offset=20)
split_leaves = ['Morelia amethistina', 'Morelia tracyae']
f = trees.frequency_of_bipartition(labels=split_leaves)
print('Frequency of split %s: %s' % (split_leaves, f))
```

### The Maximum Clade Credibility Tree: The Tree that Maximizes the Product of Split Support¶

The Maximum Clade Credibility Tree (MCCT) is one that maximize the product of split support, and is returned for a collection of trees managed in a `TreeList` instance by the `maximum_product_of_split_support_tree` method:

```#! /usr/bin/env python
# -*- coding: utf-8 -*-

import dendropy

trees = dendropy.TreeList()
for tree_file in ['pythonidae.mb.run1.t',
'pythonidae.mb.run2.t',
'pythonidae.mb.run3.t',
'pythonidae.mb.run4.t']:
path=tree_file,
schema='nexus',
tree_offset=20)

mcct = trees.maximum_product_of_split_support_tree()
print("\nTree {} maximizes the product of split support (log product = {}): {}".format(trees.index(mcct), mcct.log_product_of_split_support, mcct))

msst = trees.maximum_sum_of_split_support_tree()
print("\nTree {} maximizes the sum of split support (sum = {}): {}".format(trees.index(msst), msst.sum_of_split_support, msst))
```

Unfortunately, terminology in usage and literature regarding this type of summary is very confusing, and sometimes the term “MCCT” is used to refer to the tree that maximizes the sum of split support and “MCT” to the tree that maximizes the product of split support. If the tree that maximizes the sum of split support is the criteria required, then the `maximum_sum_of_split_support_tree` method of the `TreeList` object should be used.

## Scoring Trees Under the Coalescent¶

### Probability Under the Coalescent Model¶

The `coalescent` module provides a range of methods for simulations and calculations under Kingman’s coalescent framework and related models. For example:

`log_probability_of_coalescent_tree`

Given a `Tree` object as the first argument, and the haploid population size as the second, returns the log probability of the `Tree` under the neutral coalescent.

### Numbers of Deep Coalescences¶

`reconciliation_discordance`

Given two `Tree` objects sharing the same leaf-set, this returns the number of deep coalescences resulting from fitting the first tree (e.g., a gene tree) to the second (e.g., a species tree). This is based on the algorithm described Goodman, et al. (Goodman, et al., 1979. Fitting the gene lineage into its species lineage,a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Zool. 19: 99-113).

`monophyletic_partition_discordance`

Given a `Tree` object as the first argument, and a list of lists of `Taxon` objects representing the expected monophyletic partitioning of the `TaxonNamespace` of the `Tree` as the second argument, this returns the number of deep coalescences found in the relationships implied by the `Tree` object, conditional on the taxon groupings given by the second argument. This statistic corresponds to the Slatkin and Maddison (1989) s statistic, as described here.

### Number of Deep Coalescences when Embedding One Tree in Another (e.g. Gene/Species Trees)¶

Imagine we wanted to generate the distribution of the number of deep coalescences under two scenarios: one in which a population underwent sequential or step-wise vicariance, and another when there was simultaneous fragmentation. In this case, the containing tree and the embedded trees have different leaf sets, and there is a many-to-one mapping of embedded tree taxa to containing tree taxa.

The `ContainingTree` class is designed to allow for counting deep coalescences in cases like this. It requires a `TaxonNamespaceMapping` object, which provides an association between the embedded taxa and the containing taxa. The easiest way to get a `TaxonNamespaceMapping` object is to call the special factory function `create_contained_taxon_mapping`. This will create a new `TaxonNamespace` to manage the gene taxa, and create the associations between the gene taxa and the containing tree taxa for you. It takes two arguments: the `TaxonNamespace` of the containing tree, and the number of genes you want sampled from each species. If the gene-species associations are more complex, e.g., different numbers of genes per species, we can pass in a list of values as the second argument to `create_contained_taxon_mapping`. This approach should be used with caution if we cannot be certain of the order of taxa (as is the case with data read in Newick formats). In these case, and in more complex cases, we might need to directly instantiate the `TaxonNamespaceMapping` object. The API to describe the associations when constructing this object is very similar to that of the `TaxonNamespacePartition` object: you can use a function, attribute or dictionary.

The `ContainingTree` class has its own native contained coalescent simulator, `embed_contained_kingman`, which simulates and embeds a contained coalescent tree at the same time.

```#! /usr/bin/env python
# -*- coding: utf-8 -*-

import dendropy
from dendropy.simulate import treesim
from dendropy.model import reconcile

# simulation parameters and output
num_reps = 10

# population tree descriptions
stepwise_tree_str = "[&R](A:120000,(B:80000,(C:40000,D:40000):40000):40000):100000;"
frag_tree_str = "[&R](A:120000,B:120000,C:120000,D:120000):100000;"

# taxa and trees
containing_taxa = dendropy.TaxonNamespace()
stepwise_tree = dendropy.Tree.get(
data=stepwise_tree_str,
schema="newick",
taxon_namespace=containing_taxa)
frag_tree = dendropy.Tree.get(
data=frag_tree_str,
schema="newick",
taxon_namespace=containing_taxa)

# taxon set association
genes_to_species = dendropy.TaxonNamespaceMapping.create_contained_taxon_mapping(
containing_taxon_namespace=containing_taxa,
num_contained=8)

# convert to containing tree
stepwise_tree = reconcile.ContainingTree(stepwise_tree,
contained_taxon_namespace=genes_to_species.domain_taxon_namespace,
contained_to_containing_taxon_map=genes_to_species)
frag_tree = reconcile.ContainingTree(frag_tree,
contained_taxon_namespace=genes_to_species.domain_taxon_namespace,
contained_to_containing_taxon_map=genes_to_species)

# for each rep
for rep in range(num_reps):
stepwise_tree.embed_contained_kingman(default_pop_size=40000)
frag_tree.embed_contained_kingman(default_pop_size=40000)

# write results

# returns dictionary with contained trees as keys
# and number of deep coalescences as values
stepwise_deep_coals = stepwise_tree.deep_coalescences()
stepwise_out = open("stepwise.txt", "w")
for tree in stepwise_deep_coals:
stepwise_out.write("%d\n" % stepwise_deep_coals[tree])

# returns dictionary with contained trees as keys
# and number of deep coalescences as values
frag_deep_coals = frag_tree.deep_coalescences()
frag_out = open("frag.txt", "w")
for tree in frag_deep_coals:
frag_out.write("%d\n" % frag_deep_coals[tree])

```

If you have used some other method to simulate your trees, you can use `embed_tree` to embed the trees and count then number of deep coalescences.

```#! /usr/bin/env python
# -*- coding: utf-8 -*-

import dendropy
from dendropy.simulate import treesim
from dendropy.model import reconcile

# simulation parameters and output
num_reps = 10

# population tree descriptions
stepwise_tree_str = "[&R](A:120000,(B:80000,(C:40000,D:40000):40000):40000):100000;"
frag_tree_str = "[&R](A:120000,B:120000,C:120000,D:120000):100000;"

# taxa and trees
containing_taxa = dendropy.TaxonNamespace()
stepwise_tree = dendropy.Tree.get(
data=stepwise_tree_str,
schema="newick",
taxon_namespace=containing_taxa)
frag_tree = dendropy.Tree.get(
data=frag_tree_str,
schema="newick",
taxon_namespace=containing_taxa)

# taxon set association
genes_to_species = dendropy.TaxonNamespaceMapping.create_contained_taxon_mapping(
containing_taxon_namespace=containing_taxa,
num_contained=8)

# convert to containing tree
stepwise_tree = reconcile.ContainingTree(stepwise_tree,
contained_taxon_namespace=genes_to_species.domain_taxon_namespace,
contained_to_containing_taxon_map=genes_to_species)
frag_tree = reconcile.ContainingTree(frag_tree,
contained_taxon_namespace=genes_to_species.domain_taxon_namespace,
contained_to_containing_taxon_map=genes_to_species)

# for each rep
for rep in range(num_reps):
gene_tree1 = treesim.contained_coalescent_tree(containing_tree=stepwise_tree,
gene_to_containing_taxon_map=genes_to_species,
default_pop_size=40000)
stepwise_tree.embed_tree(gene_tree1)
gene_tree2 = treesim.contained_coalescent_tree(containing_tree=frag_tree,
gene_to_containing_taxon_map=genes_to_species,
default_pop_size=40000)
frag_tree.embed_tree(gene_tree2)

# write results

# returns dictionary with contained trees as keys
# and number of deep coalescences as values
stepwise_deep_coals = stepwise_tree.deep_coalescences()
stepwise_out = open("stepwise.txt", "w")
for tree in stepwise_deep_coals:
stepwise_out.write("%d\n" % stepwise_deep_coals[tree])

# returns dictionary with contained trees as keys
# and number of deep coalescences as values
frag_deep_coals = frag_tree.deep_coalescences()
frag_out = open("frag.txt", "w")
for tree in frag_deep_coals:
frag_out.write("%d\n" % frag_deep_coals[tree])

```

For more details on simulating contained coalescent trees and counting numbers of deep coalescences on them, see “Multispecies Coalescent (“Contained Coalescent” or “Censored Coalescent”) Trees” or “Simulating the Distribution of Number Deep Coalescences Under Different Phylogeographic History Scenarios”.