# Taxon Namespaces and Taxon Management¶

## Conceptual Background¶

Many elements of phylogenetic and, more generally, evo- or bioinformatic data are associated with some element in the real world. For example, a leaf node on a tree or a sequence in a character matrix is typically associated with an individual biological organism, a population or deme, a species or higher-level taxonomic group, and so on. In classical phylogenetic literature, this referent is termed an “operational taxonomic unit” or OTU. In DendroPy, we use the term “taxon”. Regardless of whether the referent of the data represents an individual organism (or even a less distinct subunit, e.g., a fragment from a shotgun assay) or an actual taxonomic group, we apply the term “taxon”. We assign a (string) label to the concept of the entity (individual, sub-individual, or group) represented by a “taxon”, which allows us to relate different elements of data to the same or different real-world referent. These collections of labels representing taxon concepts are organized into “taxon namespaces”.

The concept of a “taxon namespace” is fundamental to managing data in DendroPy. A “taxon namespace” represents a self-contained universe of names that map to operational taxonomic unit concepts. Operational taxonomic unit concepts are essentially names for groups of organisms in the “real world”. Operational taxonomic unit concepts are organized into taxonomic namespaces. A taxonomic namespace is a self-contained and functionally-complete collection of mutually-distinct operational taxonomic unit concepts, and provide the semantic context in which operational taxonomic units from across various data sources of different formats and provenances can be related through correct interpretation of their taxon labels.

## Management of Shared Taxon Namespaces¶

Operational taxonomic units in DendroPy are represented by Taxon objects, and distinct collections of operational taxonomic units are represented by TaxonNamespace objects. Two distinct Taxon objects are considered distinct entities, even if they share the same label. Understanding this is crucial to understanding management of data in DendroPy. Many operations in DendroPy are based on the identity of the Taxon objects (e.g., counting of splits on trees). Many errors by novices using DendroPy come from inadventently creating and using multiple Taxon objects to refer to the same taxon concept.

Every time a definition of taxa is encountered in a data source, for example, a “TAXA” block in a NEXUS file, a new TaxonNamespace object is created and populated with Taxon objects corresponding to the taxa defined in the data source. Some data formats do not have explicit definition of taxa, e.g. a Newick tree source. These nonetheless can be considered to have an implicit definition of a collection of operational taxonomic units given by the aggregate of all operational taxonomic units referenced in the data (i.e., the set of all distinct labels on trees in a Newick file).

Every time a reference to a taxon is encountered in a data source, such as a taxon label in a tree or matrix statement in a NEXUS file, the current TaxonNamespace object is searched for corresponding Taxon object with a matching label (see below for details on how the match is made). If found, the Taxon object is used to represent the taxon. If not, a new Taxon object is created, added to the TaxonNamespace object, and used to represent the taxon.

If multiple data sources are read, then with TreeList or TreeArray the TaxonNamespace instance associated with the collection through the taxon_namespace attribute will always be used to manage the Taxon objects, resulting in correct association of labels with Taxon objects across multiple reads. So, for example, the following:

import dendropy

tree_str1 = "((A,B),C);"

tree_list = dendropy.TreeList()
print(tree_list.taxon_namespace)
print(tree_list.taxon_namespace)
for nd1, nd2 in zip(tree_list[0], tree_list[1]):
assert nd1.taxon is nd2.taxon # OK



results in:

['A', 'B', 'C']
['A', 'B', 'C']


Note how the total number of taxa is three, and there is full correspondence between the taxa. That is, the taxa referenced by “A”, “B”, and “C” in the second read operation were correctly mapped to the taxa from the second read operation.

With DataSet instances, however, each independent read operation will, by default, be managed under a new (i.e., independent and different) TaxonNamespace.

import dendropy
from dendropy.calculate import treecompare

tree_str1 = "((A,B),C);"
tree_str2 = "((A,B),C);"

ds = dendropy.DataSet()

print(len(ds.taxon_namespaces))
print(ds.tree_lists[0].taxon_namespace is ds.tree_lists[1].taxon_namespace)
print(ds.tree_lists[0].taxon_namespace[0] is ds.tree_lists[1].taxon_namespace[0])

# Results in:
# 2
# False
# False


So, if reading data from multiple data sources using a DataSet instance that should all be managed under the same taxon namespace, then the TaxonNamespace instance to use should be explicitly passed in using the “taxon_namespace” keyword argument:

import dendropy
from dendropy.calculate import treecompare

tree_str1 = "((A,B),C);"
tree_str2 = "((A,B),C);"

ds2 = dendropy.DataSet()
data=tree_str1,
schema="newick",
taxon_namespace=ds2.tree_lists[0].taxon_namespace)

print(len(ds2.taxon_namespaces))
print(ds2.tree_lists[0].taxon_namespace is ds2.tree_lists[1].taxon_namespace)
print(ds2.tree_lists[0].taxon_namespace[0] is ds2.tree_lists[1].taxon_namespace[0])

# Results in:
# 1
# True
# True

ds2 = dendropy.DataSet()
tns = ds2.new_taxon_namespace()
data=tree_str1,
schema="newick",
taxon_namespace=tns)
data=tree_str1,
schema="newick",
taxon_namespace=tns)

print(len(ds2.taxon_namespaces))
print(ds2.tree_lists[0].taxon_namespace is ds2.tree_lists[1].taxon_namespace)
print(ds2.tree_lists[0].taxon_namespace[0] is ds2.tree_lists[1].taxon_namespace[0])

# Results in:
# 1
# True
# True



While each TreeList manages all its member Tree objects under the same TaxonNamespace reference, if two different TreeList instances have different TaxonNamespace references, then the Taxon objects read/managed by them will be necessarily different from each other, even if the labels are the same.

import dendropy
from dendropy.calculate import treecompare

tree_str1 = "((A,B),C);"

tree_list1 = dendropy.TreeList()
tree_list2 = dendropy.TreeList()

for taxon in tree_list1.taxon_namespace:
if taxon in tree_list2.taxon_namespace:
# this branch is never visited
print("Taxon '{}': found in both trees".format(taxon.label))

## Following will result in:
## dendropy.utility.error.TaxonNamespaceIdentityError: Non-identical taxon namespace references: ...
# print(treecompare.symmetric_difference(tree_list1[0], tree_list2[0]))


Again, this can be addressed by ensuring that the TaxonNamespace reference is the same for TreeList instances that need to interact:

import dendropy
from dendropy.calculate import treecompare

tree_str1 = "((A,B),C);"

tree_list1 = dendropy.TreeList()
tree_list2 = dendropy.TreeList(taxon_namespace=tree_list1.taxon_namespace)

# Results in: 0
print(treecompare.symmetric_difference(tree_list1[0], tree_list2[0]))


The same obtains for Tree and CharacterMatrix-derived instances: if the associated TaxonNamespace references are different, then the associated Taxon objects will be different, even if the labels are the same. This will make comparison or any operation between them impossible:

import dendropy

tree_str1 = "((A,B),C);"

tree1 = dendropy.Tree.get(data=tree_str1, schema="newick")
tree2 = dendropy.Tree.get(data=tree_str1, schema="newick")
print(tree1.taxon_namespace is  tree2.taxon_namespace) # False
for nd1, nd2 in zip(tree1, tree2):
assert nd1.taxon is nd2.taxon # Assertion Error



So, if taxa are shared, then the TaxonNamespace to use should be passed in explicitly to ensure that each Tree or CharacterMatrix-derived instance also share the same TaxonNamespace:

import dendropy

tree_str1 = "((A,B),C);"

tree1 = dendropy.Tree.get(data=tree_str1, schema="newick")
tree2 = dendropy.Tree.get(
data=tree_str1,
schema="newick",
taxon_namespace=tree1.taxon_namespace)
print(tree1.taxon_namespace is  tree2.taxon_namespace) # True
for nd1, nd2 in zip(tree1, tree2):
assert nd1.taxon is nd2.taxon # OK



## Managing Taxon Name Mapping Within a Taxon Namespace¶

DendroPy maps taxon definitions encountered in a data source to Taxon objects by the taxon label. The labels have to match exactly for the taxa to be correctly mapped. By default, this matching is case-insensitive, though case-sensitivity can be set by specifying “case_sensitive_taxon_labels=True”.

Some quirks may arise due to some schema-specific idiosyncracies. For example, the NEXUS standard dictates that an underscore (“_”) should be substituted for a space in all labels. Thus, when reading a NEXUS or Newick source, the taxon labels “Python_regius” and “Python regius” are exactly equivalent, and will be mapped to the same Taxon object.

However, this underscore-to-space mapping does not take place when reading, for example, a FASTA schema file. Here, underscores are preserved, and thus “Python_regius” does not map to “Python regius”. This means that if you were to read a NEXUS file with the taxon label, “Python_regius”, and later a read a FASTA file with the same taxon label, i.e., “Python_regius”, these would map to different taxa! This is illustrated by the following:

#! /usr/bin/env python

import dendropy

nexus1 = """
#NEXUS

begin taxa;
dimensions ntax=2;
taxlabels Python_regius Python_sebae;
end;

begin characters;
dimensions nchar=5;
format datatype=dna gap=- missing=? matchchar=.;
matrix
Python_regius ACGTA
Python_sebae   ACGTA
;
end;
"""

fasta1 = """
>Python_regius
AAAA
>Python_sebae
ACGT
"""

d = dendropy.DataSet()
tns = d.new_taxon_namespace()
d.attach_taxon_namespace(tns)
print(d.taxon_namespaces[0].description(2))


Which produces the following, almost certainly incorrect, result:

TaxonNamespace object at 0x43b4e0 (TaxonNamespace4437216): 4 Taxa
[0] Taxon object at 0x22867b0 (Taxon36202416): 'Python regius'
[1] Taxon object at 0x2286810 (Taxon36202512): 'Python sebae'
[2] Taxon object at 0x22867d0 (Taxon36202448): 'Python_regius'
[3] Taxon object at 0x2286830 (Taxon36202544): 'Python_sebae'


Even more confusingly, if this file is written out in NEXUS schema, it would result in the space/underscore substitution taking place, resulting in two pairs of taxa with the same labels.

If you plan on mixing sources from different formats, it is important to keep in mind the space/underscore substitution that takes place by default with NEXUS/Newick formats, but does not take place with other formats.

You could simply avoid underscores and use only spaces instead:

#! /usr/bin/env python

import dendropy

nexus1 = """
#NEXUS

begin taxa;
dimensions ntax=2;
taxlabels 'Python regius' 'Python sebae';
end;

begin characters;
dimensions nchar=5;
format datatype=dna gap=- missing=? matchchar=.;
matrix
'Python regius' ACGTA
'Python sebae'   ACGTA
;
end;
"""

fasta1 = """
>Python regius
AAAA
>Python sebae
ACGT
"""

d = dendropy.DataSet()
tns = d.new_taxon_namespace()
d.attach_taxon_namespace(tns)
print(d.taxon_namespaces[0].description(2))


Which results in:

TaxonNamespace object at 0x43b4e0 (TaxonNamespace4437216): 2 Taxa
[0] Taxon object at 0x22867b0 (Taxon36202416): 'Python_regius'
[1] Taxon object at 0x2286810 (Taxon36202512): 'Python_sebae'


Or use underscores in the NEXUS-formatted data, but spaces in the non-NEXUS data:

#! /usr/bin/env python

import dendropy

nexus1 = """
#NEXUS

begin taxa;
dimensions ntax=2;
taxlabels Python_regius Python_sebae;
end;

begin characters;
dimensions nchar=5;
format datatype=dna gap=- missing=? matchchar=.;
matrix
Python_regius ACGTA
Python_sebae   ACGTA
;
end;
"""

fasta1 = """
>Python regius
AAAA
>Python sebae
ACGT
"""

d = dendropy.DataSet()
tns = d.new_taxon_namespace()
d.attach_taxon_namespace(tns)
print(d.taxon_namespaces[0].description(2))


Which results in the same as the preceding example:

TaxonNamespace object at 0x43b4e0 (TaxonNamespace4437216): 2 Taxa
[0] Taxon object at 0x22867b0 (Taxon36202416): 'Python regius'
[1] Taxon object at 0x2286810 (Taxon36202512): 'Python sebae'


You can also wrap the underscore-bearing labels in the NEXUS/Newick source in quotes, which preserves them from being substituted for spaces:

#! /usr/bin/env python

import dendropy

nexus1 = """
#NEXUS

begin taxa;
dimensions ntax=2;
taxlabels 'Python_regius' 'Python_sebae';
end;

begin characters;
dimensions nchar=5;
format datatype=dna gap=- missing=? matchchar=.;
matrix
'Python_regius' ACGTA
'Python_sebae'   ACGTA
;
end;
"""

fasta1 = """
>Python_regius
AAAA
>Python_sebae
ACGT
"""

d = dendropy.DataSet()
tns = d.new_taxon_namespace()
d.attach_taxon_namespace(tns)
print(d.taxon_namespaces[0].description(2))


Which will result in:

TaxonNamespace object at 0x43c780 (TaxonNamespace4441984): 2 Taxa
[0] Taxon object at 0x2386770 (Taxon37250928): 'Python_regius'
[1] Taxon object at 0x2386790 (Taxon37250960): 'Python_sebae'


Finally, you can also override the default behavior of DendroPy’s NEXUS/Newick parser by passing the keyword argument preserve_underscores=True to any “read_from_*()” or “get_from_*()” method. For example:

#! /usr/bin/env python

import dendropy

nexus1 = """
#NEXUS

begin taxa;
dimensions ntax=2;
taxlabels Python_regius Python_sebae;
end;

begin characters;
dimensions nchar=5;
format datatype=dna gap=- missing=? matchchar=.;
matrix
Python_regius ACGTA
Python_sebae   ACGTA
;
end;
"""

fasta1 = """
>Python_regius
AAAA
>Python_sebae
ACGT
"""

d = dendropy.DataSet()
tns = d.new_taxon_namespace()
d.attach_taxon_namespace(tns)
print(d.taxon_namespaces[0].description(2))



will result in:

TaxonNamespace object at 0x43c780 (TaxonNamespace4441984): 2 Taxa
[0] Taxon object at 0x2386770 (Taxon37250928): 'Python_regius'
[1] Taxon object at 0x2386790 (Taxon37250960): 'Python_sebae'


This may seem the simplest solution, in so far as it means that you need not maintain lexically-different taxon labels across files of different formats, but a gotcha here is that if writing to NEXUS/Newick schema, any label with underscores will be automatically quoted to preserve the underscores (again, as dictated by the NEXUS standard), which will mean that: (a) your output file will have quotes, and, as a result, (b) the underscores in the labels will be “hard” underscores if the file is read by PAUP* or DendroPy. So, for example, continuing from the previous example, the NEXUS-formatted output would look like:

>>> print(d.as_string('nexus'))
#NEXUS

BEGIN TAXA;
TITLE TaxonNamespace5736800;
DIMENSIONS NTAX=2;
TAXLABELS
'Python_regius'
'Python_sebae'
;
END;

BEGIN CHARACTERS;
TITLE DnaCharacterMatrix37505040;
DIMENSIONS  NCHAR=5;
FORMAT DATATYPE=DNA GAP=- MISSING=? MATCHCHAR=.;
MATRIX
'Python_regius'    ACGTA
'Python_sebae'      ACGTA
;
END;

BEGIN CHARACTERS;
TITLE DnaCharacterMatrix37504848;
DIMENSIONS  NCHAR=4;
FORMAT DATATYPE=DNA GAP=- MISSING=? MATCHCHAR=.;
MATRIX
'Python_regius'    AAAA
'Python_sebae'      ACGT
;
END;


Note that the taxon labels have changed semantically between the input and the NEXUS output, as, according to the NEXUS standard, “Python_regius”, while equivalent to “Python regius”, is not equivalent to “‘Python_regius’”. To control this, you can pass the keyword argument quote_underscores=False to any write_to_*, or as_string method, which will omit the quotes even if the labels contain underscores:

>>> print(d.as_string('nexus', quote_underscores=False))
#NEXUS

BEGIN TAXA;
TITLE TaxonNamespace5736800;
DIMENSIONS NTAX=2;
TAXLABELS
Python_regius
Python_sebae
;
END;

BEGIN CHARACTERS;
TITLE DnaCharacterMatrix37505040;
DIMENSIONS  NCHAR=5;
FORMAT DATATYPE=DNA GAP=- MISSING=? MATCHCHAR=.;
MATRIX
Python_regius    ACGTA
Python_sebae      ACGTA
;
END;

BEGIN CHARACTERS;
TITLE DnaCharacterMatrix37504848;