Data Sets

The DataSet class provides for objects that allow you to manage multiple types of phylogenetic data.

It has three primary attributes:

taxon_namespaces
A list of all TaxonNamespace objects in the DataSet, in the order that they were added or read, include TaxonNamespace objects added implicitly through being associated with added TreeList or CharacterMatrix objects.
tree_lists
A list of all TreeList objects in the DataSet, in the order that they were added or read.
char_matrices
A list of all CharacterMatrix objects in the DataSet, in the order that they were added or read.

DataSet Creation and Reading

Reading and Writing DataSet Objects

You can use the get factory class method for simultaneously instantiating and populating DataSet object, taking a data source as the first argument and a schema specification string (“nexus”, “newick”, “nexml”, “fasta”, “phylip”, etc.) as the second:

>>> import dendropy
>>> ds = dendropy.DataSet.get(
    path='pythonidae.nex',
    schema='nexus')

The read instance method for reading additional data into existing objects are also supported, taking the same arguments (i.e., a data source, a schema specification string, as well as optional :keyword arguments to customize the parse behavior):

import dendropy

# Create the DataSet to store data
ds = dendropy.DataSet()

# Set it up to manage all data under a single taxon namespace.
# HIGHLY RECOMMENDED!
taxon_namespace = dendropy.TaxonNamespace()
ds.attach_taxon_namespace(taxon_namespace)

# Read from multiple sources

# Add a collection of trees
ds.read(
    path='pythonidae.mle.nex',
    schema='nexus',)

# Add a collection of characters from a Nexus source
ds.read(
    path='pythonidae.chars.nexus',
    schema='nexus',)

# Add a collection of characters from a FASTA source
# Note that with this format, we have to explicitly provide the type of data
ds.read(
    path='pythonidae_cytb.fasta',
    schema='fasta',
    data_type="dna")

# Add a collection of characters from a PHYLIP source
# Note that with this format, we have to explicitly provide the type of data
ds.read(
    path='pythonidae.chars.phylip',
    schema='phylip',
    data_type="dna")

# Add a collection of continuous characters from a NeXML source
ds.read(
    path='pythonidae_continuous.chars.nexml',
    schema='nexml',)



Note

Note how the attach_taxon_namespace method is called before invoking any “read” statements, to ensure that all the taxon references in the data sources get mapped to the same TaxonNamespace instance. It is HIGHLY recommended that you do this, i.e., manage all data with the same DataSet instance under the same taxonomic namespace, unless you have a special reason to include multiple independent taxon “domains” in the same data set.

The “write” method allows you to write the data of a DataSet to a file-like object or a file path The following example aggregates the post-burn in MCMC samples from a series of NEXUS-formatted tree files into a single TreeList, then, adds the TreeList as well as the original character data into a single DataSet object, which is then written out as NEXUS-formatted file:

import dendropy
taxa = dendropy.TaxonNamespace()
trees = dendropy.TreeList(taxon_namespace=taxa)
trees.read(path='pythonidae.mb.run1.t', schema='nexus', tree_offset=10)
trees.read(path='pythonidae.mb.run2.t', schema='nexus', tree_offset=10)
trees.read(path='pythonidae.mb.run3.t', schema='nexus', tree_offset=10)
trees.read(path='pythonidae.mb.run4.t', schema='nexus', tree_offset=10)
ds = dendropy.DataSet([trees])
ds.read(path='pythonidae_cytb.fasta',
        schema='fasta',
        data_type='dna',
        )
ds.write(path='pythonidae_combined.nex', schema='nexus')

If you do not want to actually write to a file, but instead simply need a string representing the data in a particular format, you can call the instance method as_string, passing a schema specification string as the first argument:

import dendropy
ds = dendropy.DataSet()
ds.read_from_path('pythonidae.cytb.fasta', 'dnafasta')
s = ds.as_string('nexus')

or:

dna1 = dendropy.DataSet.get(file=open("pythonidae.nex"), schema="nexus")
s = dna1.as_string(schema="fasta")
print(s)

In addition, fine-grained control over the reading and writing of data is available through various keyword arguments. More information on reading operations is available in the Reading and Writing Phylogenetic Data section.

Creating a New DataSet from Existing TreeList and CharacterMatrix Objects

You can add independentally created or parsed data objects to a DataSet by passing them as unnamed arguments to the constructor:

import dendropy
treelist1 = dendropy.TreeList.get(
        path='pythonidae.mle.nex',
        schema='nexus')
cytb = dendropy.DnaCharacterMatrix.get(
    path='pythonidae_cytb.fasta',
    schema='fasta')
ds = dendropy.DataSet([cytb, treelist1])
ds.unify_taxon_namespaces()

Note how we call the instance method unify_taxon_namespaces after the creation of the DataSet object. This method will remove all existing TaxonNamespace objects from the DataSet, create and add a new one, and then map all taxon references in all contained TreeList and CharacterMatrix objects to this new, unified TaxonNamespace.

Adding Data to an Exisiting DataSet

You can add independentally created or parsed data objects to a DataSet using the add method:

.. literalinclude:: /examples/ds4.py

Here, again, we call the unify_taxon_namespaces to map all taxon references to the same, common, unified TaxonNamespace.

Taxon Management with Data Sets

The DataSet object, representing a meta-collection of phylogenetic data, differs in one important way from all the other phylogenetic data objects discussed so far with respect to taxon management, in that it is not associated with any particular TaxonNamespace object. Rather, it maintains a list (in the property taxon_namespaces) of all the TaxonNamespace objects referenced by its contained TreeList objects (in the property tree_lists) and CharacterMatrix objects (in the property char_matrices).

With respect to taxon management, DataSet objects operate in one of two modes: “detached taxon set” mode and “attached taxon set” mode.

Detached (Multiple) Taxon Set Mode

In the “detached taxon set” mode, which is the default, DataSet object tracks all TaxonNamespace references of their other data members in the property taxon_namespaces, but no effort is made at taxon management as such. Thus, every time a data source is read with a “detached taxon set” mode DataSet object, by default, a new TaxonNamespace object will be created and associated with the Tree, TreeList, or CharacterMatrix objects created from each data source, resulting in multiple TaxonNamespace independent references. As such, “detached taxon set” mode DataSet objects are suitable for handling data with multiple distinct sets of taxa.

For example:

>>> import dendropy
>>> ds = dendropy.DataSet()
>>> ds.read(path="primates.nex", schema="nexus")
>>> ds.read(path="snakes.nex", schema="nexus")

The dataset, ds, will now contain two distinct sets of TaxonNamespace objects, one for the taxa defined in “primates.nex”, and the other for the taxa defined for “snakes.nex”. In this case, this behavior is correct, as the two files do indeed refer to different sets of taxa.

However, consider the following:

>>> import dendropy
>>> ds = dendropy.DataSet()
>>> ds.read(path="pythonidae_cytb.fasta", schema="fasta", data_type="dna")
>>> ds.read(path="pythonidae_aa.nex", schema="nexus")
>>> ds.read(path="pythonidae_morphological.nex", schema="nexus")
>>> ds.read(path="pythonidae.mle.tre", schema="nexus")

Here, even though all the data files refer to the same set of taxa, the resulting DataSet object will actually have 4 distinct TaxonNamespace objects, one for each of the independent reads, and a taxon with a particular label in the first file (e.g., “Python regius” of “pythonidae_cytb.fasta”) will map to a completely distinct Taxon object than a taxon with the same label in the second file (e.g., “Python regius” of “pythonidae_aa.nex”). This is incorrect behavior, and to achieve the correct behavior with a multiple taxon set mode DataSet object, we need to explicitly pass a TaxonNamespace object to each of the read_from_path statements:

>>> import dendropy
>>> ds = dendropy.DataSet()
>>> ds.read(path="pythonidae_cytb.fasta", schema="fasta", data_type="dna")
>>> ds.read(schema="pythonidae_aa.nex", "nexus", taxon_namespace=ds.taxon_namespaces[0])
>>> ds.read(schema="pythonidae_morphological.nex", "nexus", taxon_namespace=ds.taxon_namespaces[0])
>>> ds.read(schema="pythonidae.mle.tre", "nexus", taxon_namespace=ds.taxon_namespaces[0])
>>> ds.write_to_path("pythonidae_combined.nex", "nexus")

In the previous example, the first read statement results in a new TaxonNamespace object, which is added to the taxon_namespaces property of the DataSet object ds. This TaxonNamespace object gets passed via the taxon_namespace keyword to subsequent read_from_path statements, and thus as each of the data sources are processed, the taxon references get mapped to Taxon objects in the same, single, TaxonNamespace object.

While this approach works to ensure correct taxon mapping across multiple data object reads and instantiation, in this context, it is probably more convenient to use the DataSet in “attached taxon set” mode. In fact, it is highly recommended that DataSet instances always use the “attached taxon set” mode, as, conceptually there are very few cases where a collection of data should span multiple independent taxon namespaces.

Attached (Single) Taxon Set Mode

In the “attached taxon set” mode, DataSet objects ensure that the taxon references of all data objects that are added to them are mapped to the same TaxonNamespace object (at least one for each independent read or creation operation). The “attached taxon set” mode is activated by calling the attach_taxon_namespace method on a DataSet and passing in the TaxonNamespace to use:

>>> import dendropy
>>> ds = dendropy.DataSet()
>>> taxa = dendropy.TaxonNamespace(label="global")
>>> ds.attach_taxon_namespace(taxa)
>>> ds.read_from_path("pythonidae_cytb.fasta", "dnafasta")
>>> ds.read_from_path("pythonidae_aa.nex", "nexus")
>>> ds.read_from_path("pythonidae_morphological.nex", "nexus")
>>> ds.read_from_path("pythonidae.mle.tre", "nexus")

Switching Between Attached and Detached Taxon Set Modes

As noted above, you can use the attached_taxon_namespace method to switch a DataSet object to attached taxon set mode. To restore it to multiple taxon set mode, you would use the detach_taxon_namespace method:

>>> import dendropy
>>> ds = dendropy.DataSet()
>>> taxa = dendropy.TaxonNamespace(label="global")
>>> ds.attach_taxon_namespace(taxa)
>>> ds.read_from_path("pythonidae_cytb.fasta", "dnafasta")
>>> ds.read_from_path("pythonidae_aa.nex", "nexus")
>>> ds.read_from_path("pythonidae_morphological.nex", "nexus")
>>> ds.read_from_path("pythonidae.mle.tre", "nexus")
>>> ds.detach_taxon_namespace()
>>> ds.read_from_path("primates.nex", "nexus")

Here, the same TaxonNamespace object is used to manage taxon references for data parsed from the first four files, while the data from the fifth and final file gets its own, distinct, TaxonNamespace object and associated Taxon object references.