Collections of Trees

Collections of Trees: The TreeList Class

TreeList objects are collections of Tree objects constrained to sharing the same TaxonNamespace. Any Tree object added to a TreeList will have its taxon_namespace attribute assigned to the TaxonNamespace object of the TreeList, and all referenced Taxon objects will be mapped to the same or corresponding Taxon objects of this new TaxonNamespace, with new Taxon objects created if no suitable match is found. Objects of the TreeList class have an “annotations” attribute, which is a AnnotationSet object, i.e. a collection of Annotation instances tracking metadata. More information on working with metadata can be found in the “Working with Metadata Annotations” section.

Reading and Writing TreeList Instances

The TreeList class supports the “get” factory class method for simultaneously instantiating and populating TreeList instances, taking a data source as the first argument and a schema specification string (“nexus”, “newick”, “nexml”, “fasta”, or “phylip”, etc.) as the second:

import dendropy
treelist = dendropy.TreeList.get(path='pythonidae.mcmc.nex', schema='nexus')

The “read” instance method can be used to add trees from a data source to an existing TreeList instance:

import dendropy

trees = dendropy.TreeList()
trees.read(path="sometrees.nex", schema="nexus", tree_offset=10)
trees.read(data="(A,(B,C));((A,B),C);", schema="newick")

A TreeList object can be written to an external resource using the “write” method:

import dendropy
treelist = dendropy.TreeList.get(
    path="trees1.nex",
    schema="nexus",
    )
treelist.write(
    path="trees1.newick",
    schema="newick",
    )

It can also be represented as a string using the “as_string” method:

import dendropy
treelist = dendropy.TreeList.get(
    path="trees1.nex",
    schema="nexus",
    )
print(treelist.as_string(schema="newick",)

More information on reading operations is available in the Reading and Writing Phylogenetic Data section.

Using and Managing the Collections of Trees

A TreeList behaves very much like a list, supporting iteration, indexing, slices, removal, indexing, sorting, etc.:

import dendropy
from dendropy.calculate import treecompare

trees = dendropy.TreeList.get(
        path="pythonidae.random.bd0301.tre",
        schema="nexus")

for tree in trees:
    print(tree.as_string("newick"))

print(len(trees))

print(trees[4].as_string("nexus"))
print(treecompare.robinson_foulds_distance(trees[0], trees[1]))
print(treecompare.weighted_robinson_foulds_distance(trees[0], trees[1]))

first_10_trees = trees[:10]
last_10_trees = trees[-10:]

# Note that the TaxonNamespace is propogated to slices
assert first_10_trees.taxon_namespace is trees.taxon_namespace
assert first_10_trees.taxon_namespace is trees.taxon_namespace


print(id(trees[4]))
print(id(trees[5]))
trees[4] = trees[5]
print(id(trees[4]))
print(id(trees[5]))
print(trees[4] in trees)

trees.remove(trees[-1])
tx = trees.pop()
print(trees.index(trees[0]))

trees.sort(key=lambda t:t.label)
trees.reverse()
trees.clear()
The TreeList class supports the native Python list interface methods of adding individual Tree instances through
append, extend, insert, and other methods, but with the added aspect of taxon namespace migration:
import dendropy
from dendropy.calculate import treecompare

trees = dendropy.TreeList.get(
        path="pythonidae.random.bd0301.tre",
        schema="nexus")

print(len(trees))

tree = dendropy.Tree.get(path="pythonidae.mle.nex", schema="nexus")

# As we did not specify a |TaxonNamespace| instance to use above, by default
# 'tree' will get its own, distinct |TaxonNamespace|
original_tree_taxon_namespace = tree.taxon_namespace
print(id(original_tree_taxon_namespace))
assert tree.taxon_namespace is not trees.taxon_namespace

# This operation adds the |Tree|, 'tree', to the |TreeList|, 'trees',
# *and* migrates the |Taxon| objects of the tree over to the |TaxonNamespace|
# of 'trees'. This will break things if the tree is contained in another
# |TreeList| with a different |TaxonNamespace|!
trees.append(tree)

# In contrast to before, the |TaxonNamespace| of 'tree' is not the same
# as the |TaxonNamespace| of 'trees. The |Taxon| objects have been imported
# and/or remapped based on their label.
assert tree.taxon_namespace is trees.taxon_namespace
print(id(original_tree_taxon_namespace))

Cloning/Copying a TreeList

You can make a shallow-copy of a TreeList calling dendropy.datamodel.treecollectionmodel.TreeList.clone with a “depth” argument value of 0 or by slicing:

import dendropy

# original list
s1 = "(A,(B,C));(B,(A,C));(C,(A,B));"
treelist1 = dendropy.TreeList.get(
        data=s1,
        schema="newick")

# shallow copy by calling Tree.clone(0)
treelist2 = treelist1.clone(depth=0)

# shallow copy by slicing
treelist3 = treelist1[:]

# same tree instances are shared
for t1, t2 in zip(treelist1, treelist2):
    assert t1 is t2
for t1, t2 in zip(treelist1, treelist3):
    assert t1 is t2

# note: (necessarily) sharing same TaxonNamespace
assert treelist2.taxon_namespace is treelist1.taxon_namespace
assert treelist3.taxon_namespace is treelist1.taxon_namespace

With a shallow-copy, the actual Tree instances are shared between lists (as is the TaxonNamespace).

For a taxon namespace-scoped deep-copy, on the other hand, i.e., where the Tree instances are also cloned but the Taxon and TaxonNamespace references are preserved, you can call dendropy.datamodel.treecollectionmodel.TreeList.clone with a “depth” argument value of 1 or by copy construction:

import dendropy

# original list
s1 = "(A,(B,C));(B,(A,C));(C,(A,B));"
treelist1 = dendropy.TreeList.get(
        data=s1,
        schema="newick")

# taxon namespace-scoped deep copy by calling Tree.clone(1)
# I.e. Everything cloned, but with Taxon and TaxonNamespace references shared
treelist2 = treelist1.clone(depth=1)

# taxon namespace-scoped deep copy by copy-construction
# I.e. Everything cloned, but with Taxon and TaxonNamespace references shared
treelist3 = dendropy.TreeList(treelist1)

# *different* tree instances
for t1, t2, t3 in zip(treelist1, treelist2, treelist3):
    assert t1 is not t2
    assert t1 is not t3
    assert t2 is not t3

# Note: TaxonNamespace is still shared
# I.e. Everything cloned, but with Taxon and TaxonNamespace references shared
assert treelist2.taxon_namespace is treelist1.taxon_namespace
assert treelist3.taxon_namespace is treelist1.taxon_namespace

Finally, for a true and complete deep-copy, where even the Taxon and TaxonNamespace references are copied, call copy.deepcopy:

import copy
import dendropy

# original list
s1 = "(A,(B,C));(B,(A,C));(C,(A,B));"
treelist1 = dendropy.TreeList.get(
        data=s1,
        schema="newick")

# Full deep copy by calling copy.deepcopy()
# I.e. Everything cloned including Taxon and TaxonNamespace instances
treelist2 = copy.deepcopy(treelist)

# *different* tree instances
for t1, t2 in zip(treelist1, treelist2):
    assert t1 is not t2

# Note: TaxonNamespace is also different
assert treelist2.taxon_namespace is not treelist1.taxon_namespace
for tx1 in treelist1.taxon_namespace:
    assert tx1 not in treelist2.taxon_namespace
for tx2 in treelist2.taxon_namespace:
    assert tx2 not in treelist1.taxon_namespace

Efficiently Iterating Over Trees in a File

If you need to process a collection of trees defined in a file source, you can, of course, read the trees into a TreeList object and iterate over the resulting collection:

import dendropy
trees = dendropy.TreeList.get(path='pythonidae.beast-mcmc.trees', schema='nexus')
for tree in trees:
    print(tree.as_string('newick'))

In the above, the entire data source is parsed and stored in the trees object before being processed in the subsequent lines. In some cases, you might not need to maintain all the trees in memory at the same time. For example, you might be interested in calculating the distribution of a statistic over a collection of trees, but have no need to refer to any of the trees after the statistic has been calculated. In this case, it will be more efficient to use the yield_from_files function. This takes a list or any other iterable of file-like objects or strings (giving filepaths) as the first argument (“files”) and a mandatory schema specification string as the second argument (“schema). Additional keyword arguments to customize the parsing are the same as that for the general “get” and “read” methods. For example, the following script reads a model tree from a file, and then iterates over a collection of MCMC trees in a set of files, calculating and storing the symmetric distance between the model tree and each of the MCMC trees one at time:

#! /usr/bin/env python

import dendropy
from dendropy.calculate import treecompare

distances = []
taxa = dendropy.TaxonNamespace()
mle_tree = dendropy.Tree.get(
    path='pythonidae.mle.nex',
    schema='nexus',
    taxon_namespace=taxa)
burnin = 20
source_files = [
        open("pythonidae.mcmc1.nex", "r"), # Note: for 'Tree.yield_from_files',
        open("pythonidae.mcmc2.nex", "r"), # sources can be specified as file
        "pythonidae.mcmc3.nex", "r",       # objects or strings, with strings
        "pythonidae.mcmc4.nex", "r",       # assumed to specify file paths
        ]
tree_yielder = dendropy.Tree.yield_from_files(
        files=source_files,
        schema='nexus',
        taxon_namespace=taxa,
        )
for tree_idx, mcmc_tree in enumerate(tree_yielder):
    if tree_idx < burnin:
        # skip burnin
        continue
    distances.append(treecompare.symmetric_difference(mle_tree, mcmc_tree))
print("Mean symmetric distance between MLE and MCMC trees: %d"
        % float(sum(distances)/len(distances)))

Note how a TaxonNamespace object is created and passed to both the get and the yield_from_files functions using the taxon_namespace keyword argument. This is to ensure that the corresponding taxa in both sources get mapped to the same Taxon objects in DendroPy object space, so as to enable comparisons of the trees. If this was not done, then each tree would have its own distinct TaxonNamespace object (and associated Taxon objects), making comparisons impossible.

When the number of trees are large or the trees themselves are large or both, iterating over trees in files using yield_from_files is almost always going to give the best performance, sometimes orders of magnitude faster. This is due to avoiding the Python virtual machine itself from slowing down due to memory usage.