dendropy.datamodel.datasetmodel: Datasets – Aggregate Collections of Taxon, Character, and Tree Data

class dendropy.datamodel.datasetmodel.DataSet(*args, **kwargs)[source]

A phylogenetic data object that coordinates collections of TaxonNamespace, TreeList, and (various kinds of) CharacterMatrix objects.

A DataSet has three attributes:

taxon_namespaces
A list of TaxonNamespace objects, each representing a distinct namespace for operational taxononomic unit concept definitions.
tree_lists
A list of TreeList objects, each representing a collection of Tree objects.
char_matrices
A list of CharacterMatrix-derived objects (e.g. DnaCharacterMatrix).

Multiple TaxonNamespace objects within a DataSet are allowed so as to support reading/loading of data from external sources that have multiple independent taxon namespaces defined within the same source or document (e.g., a Mesquite file with multiple taxa blocks, or a NeXML file with multiple OTU sections). Ideally, however, this would not be how data is managed. Recommended idiomatic usage would be to use a DataSet to manage multiple types of data that all share and reference the same, single taxon namespace.

This convention can be enforced by setting the DataSet instance to “attached taxon namespace” mode:

ds = dendropy.DataSet()
tns = dendropy.TaxonNamespace()
ds.attach_taxon_namespace(tns)

After setting this mode, all subsequent data read or created will be coerced to use the same, common operational taxonomic unit concept namespace.

Note that unless there is a need to collect and serialize a collection of data to the same file or external source, it is probably better semantically to use more specific data structures (e.g., a TreeList object for trees or a DnaCharacterMatrix object for an alignment). Similarly, when deserializing an external data source, if just a single type or collection of data is needed (e.g., the collection of trees from a file that includes both trees and an alignment), then it is semantically cleaner to deserialize the data into a more specific structure (e.g., a TreeList to get all the trees). However, when deserializing a mixed external data source with, e.g. multiple alignments or trees and one or more alignments, and you need to access/use more than a single collection, it is more efficient to read the entire data source at once into a DataSet object and then independently extract the data objects as you need them from the various collections.

The constructor can take one argument. This can either be another DataSet instance or an iterable of TaxonNamespace, TreeList, or CharacterMatrix-derived instances.

In the former case, the newly-constructed DataSet will be a shallow-copy clone of the argument.

In the latter case, the newly-constructed DataSet will have the elements of the iterable added to the respective collections (taxon_namespaces, tree_lists, or char_matrices, as appropriate). This is essentially like calling DataSet.add on each element separately.

__delattr__

x.__delattr__(‘name’) <==> del x.name

__format__()

default object formatter

__getattribute__

x.__getattribute__(‘name’) <==> x.name

__reduce__()

helper for pickle

__reduce_ex__()

helper for pickle

__repr__
__setattr__

x.__setattr__(‘name’, value) <==> x.name = value

__sizeof__() → int

size of object in memory, in bytes

__str__
add(data_object, **kwargs)[source]

Generic add for TaxonNamespace, TreeList or CharacterMatrix objects.

add_char_matrix(char_matrix)[source]

Adds a CharacterMatrix or CharacterMatrix-derived instance to this dataset if it is not already there.

Parameters:char_matrix (CharacterMatrix) – The CharacterMatrix object to be added.
add_taxon_namespace(taxon_namespace)[source]

Adds a taxonomic unit concept namespace represented by a TaxonNamespace instance to this dataset if it is not already there.

Parameters:taxon_namespace (TaxonNamespace) – The TaxonNamespace object to be added.
add_taxon_set(taxon_set)[source]

DEPRECATED: Use add_taxon_namespace() instead.

add_tree_list(tree_list)[source]

Adds a TreeList instance to this dataset if it is not already there.

Parameters:tree_list (TreeList) – The TreeList object to be added.
as_string(schema, **kwargs)

Composes and returns string representation of the data.

Mandatory Schema-Specification Keyword Argument:

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is formatted, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.
attach_taxon_namespace(taxon_namespace=None)[source]

Forces all read() calls of this DataSet to use the same TaxonNamespace. If taxon_namespace If taxon_namespace is None, then a new TaxonNamespace will be created, added to self.taxon_namespaces, and that is the TaxonNamespace that will be attached.

attach_taxon_set(taxon_set=None)[source]

DEPRECATED: Use attach_taxon_namespace() instead.

clone(depth=1)

Creates and returns a copy of self.

Parameters:depth (integer) –

The depth of the copy:

  • 0: shallow-copy: All member objects are references, except for :attr:annotation_set of top-level object and member Annotation objects: these are full, independent instances (though any complex objects in the value field of Annotation objects are also just references).
  • 1: taxon-namespace-scoped copy: All member objects are full independent instances, except for TaxonNamespace and Taxon instances: these are references.
  • 2: Exhaustive deep-copy: all objects are cloned.
copy_annotations_from(other, attribute_object_mapper=None)

Copies annotations from other, which must be of Annotable type.

Copies are deep-copies, in that the Annotation objects added to the annotation_set AnnotationSet collection of self are independent copies of those in the annotate_set collection of other. However, dynamic bound-attribute annotations retain references to the original objects as given in other, which may or may not be desirable. This is handled by updated the objects to which attributes are bound via mappings found in attribute_object_mapper. In dynamic bound-attribute annotations, the _value attribute of the annotations object (Annotation._value) is a tuple consisting of “(obj, attr_name)”, which instructs the Annotation object to return “getattr(obj, attr_name)” (via: “getattr(*self._value)”) when returning the value of the Annotation. “obj” is typically the object to which the AnnotationSet belongs (i.e., self). When a copy of Annotation is created, the object reference given in the first element of the _value tuple of dynamic bound-attribute annotations are unchanged, unless the id of the object reference is fo

Parameters:
  • other (Annotable) – Source of annotations to copy.
  • attribute_object_mapper (dict) – Like the memo of __deepcopy__, maps object id’s to objects. The purpose of this is to update the parent or owner objects of dynamic attribute annotations. If a dynamic attribute Annotation gives object x as the parent or owner of the attribute (that is, the first element of the Annotation._value tuple is other) and id(x) is found in attribute_object_mapper, then in the copy the owner of the attribute is changed to attribute_object_mapper[id(x)]. If attribute_object_mapper is None (default), then the following mapping is automatically inserted: id(other): self. That is, any references to other in any Annotation object will be remapped to self. If really no reattribution mappings are desired, then an empty dictionary should be passed instead.
deep_copy_annotations_from(other, memo=None)

Note that all references to other in any annotation value (and sub-annotation, and sub-sub-sub-annotation, etc.) will be replaced with references to self. This may not always make sense (i.e., a reference to a particular entity may be absolute regardless of context).

detach_taxon_namespace()[source]

Relaxes constraint forcing all

detach_taxon_set()[source]

DEPRECATED: Use detach_taxon_namespace() instead.

classmethod get(**kwargs)[source]

Instantiate and return a new TreeList object from a data source.

Mandatory Source-Specification Keyword Argument (Exactly One Required):

  • file (file) – File or file-like object of data opened for reading.
  • path (str) – Path to file of data.
  • url (str) – URL of data.
  • data (str) – Data given directly.

Mandatory Schema-Specification Keyword Argument:

Optional General Keyword Arguments:

  • exclude_trees (bool) – If True, then all tree data in the data source will be skipped.
  • exclude_chars (bool) – If True, then all character data in the data source will be skipped.
  • taxon_namespace (TaxonNamespace) – The TaxonNamespace instance to use to manage the taxon names. If not specified, a new one will be created.
  • ignore_unrecognized_keyword_arguments (bool) – If True, then unsupported or unrecognized keyword arguments will not result in an error. Default is False: unsupported keyword arguments will result in an error.

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is interpreted and processed, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

Examples:

dataset1 = dendropy.DataSet.get(
        path="pythonidae.chars_and_trees.nex",
        schema="nexus")
dataset2 = dendropy.DataSet.get(
        url="http://purl.org/phylo/treebase/phylows/study/TB2:S1925?format=nexml",
        schema="nexml")
get_from_path(src, schema, **kwargs)

Factory method to return new object of this class from file specified by string src.

Parameters:
  • src (string) – Full file path to source of data.
  • schema (string) – Specification of data format (e.g., “nexus”).
  • **kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.
Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

get_from_stream(src, schema, **kwargs)

Factory method to return new object of this class from file-like object src.

Parameters:
  • src (file or file-like) – Source of data.
  • schema (string) – Specification of data format (e.g., “nexus”).
  • **kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.
Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

get_from_string(src, schema, **kwargs)

Factory method to return new object of this class from string src.

Parameters:
  • src (string) – Data as a string.
  • schema (string) – Specification of data format (e.g., “nexus”).
  • **kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.
Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

get_from_url(src, schema, strip_markup=False, **kwargs)

Factory method to return a new object of this class from URL given by src.

Parameters:
  • src (string) – URL of location providing source of data.
  • schema (string) – Specification of data format (e.g., “nexus”).
  • **kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.
Returns:

pdo (phylogenetic data object) – New instance of object, constructed and populated from data given in source.

get_tree_list(label)[source]

Returns a TreeList object specified by label.

new_char_matrix(char_matrix_type, *args, **kwargs)[source]

Creation and accession of new CharacterMatrix (of class char_matrix_type) into chars of self.”

new_taxon_namespace(*args, **kwargs)[source]

Creates a new TaxonNamespace object, according to the arguments given (passed to TaxonNamespace()), and adds it to this DataSet.

new_taxon_set(*args, **kwargs)[source]

DEPRECATED: Use new_taxon_namespace() instead.

new_tree_list(*args, **kwargs)[source]

Creates a new TreeList instance, adds it to this DataSet.

Parameters:
  • *args (positional arguments) – Passed directly to TreeList constructor.
  • **kwargs (keyword arguments, optional) – Passed directly to TreeList constructor.
Returns:

t (|TreeList|) – The new TreeList instance created.

read(**kwargs)[source]

Add data to self from data source.

Mandatory Source-Specification Keyword Argument (Exactly One Required):

  • file (file) – File or file-like object of data opened for reading.
  • path (str) – Path to file of data.
  • url (str) – URL of data.
  • data (str) – Data given directly.

Mandatory Schema-Specification Keyword Argument:

Optional General Keyword Arguments:

  • exclude_trees (bool) – If True, then all tree data in the data source will be skipped.
  • exclude_chars (bool) – If True, then all character data in the data source will be skipped.
  • taxon_namespace (TaxonNamespace) – The TaxonNamespace instance to use to manage the taxon names. If not specified, a new one will be created unless the DataSet object is in attached taxon namespace mode (self.attached_taxon_namespace is not None but assigned to a specific TaxonNamespace instance).
  • ignore_unrecognized_keyword_arguments (bool) – If True, then unsupported or unrecognized keyword arguments will not result in an error. Default is False: unsupported keyword arguments will result in an error.

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is interpreted and processed, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

Examples:

ds = dendropy.DataSet()
ds.read(
        path="pythonidae.chars_and_trees.nex",
        schema="nexus")
ds.read(
        url="http://purl.org/phylo/treebase/phylows/study/TB2:S1925?format=nexml",
        schema="nexml")
read_from_path(src, schema, **kwargs)

Reads data from file specified by filepath.

Parameters:
  • filepath (file or file-like) – Full file path to source of data.
  • schema (string) – Specification of data format (e.g., “nexus”).
  • **kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.
Returns:

n (tuple [integer]) – A value indicating size of data read, where “size” depends on the object:

read_from_stream(src, schema, **kwargs)

Reads from file (exactly equivalent to just read(), provided here as a separate method for completeness.

Parameters:
  • fileobj (file or file-like) – Source of data.
  • schema (string) – Specification of data format (e.g., “nexus”).
  • **kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.
Returns:

n (tuple [integer]) – A value indicating size of data read, where “size” depends on the object:

read_from_string(src, schema, **kwargs)

Reads a string.

Parameters:
  • src_str (string) – Data as a string.
  • schema (string) – Specification of data format (e.g., “nexus”).
  • **kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.
Returns:

n (tuple [integer]) – A value indicating size of data read, where “size” depends on the object:

read_from_url(src, schema, **kwargs)

Reads a URL source.

Parameters:
  • src (string) – URL of location providing source of data.
  • schema (string) – Specification of data format (e.g., “nexus”).
  • **kwargs (keyword arguments, optional) – Arguments to customize parsing, instantiation, processing, and accession of objects read from the data source, including schema- or format-specific handling. These will be passed to the underlying schema-specific reader for handling.
Returns:

n (tuple [integer]) – A value indicating size of data read, where “size” depends on the object:

unify_taxon_namespaces(taxon_namespace=None, case_sensitive_label_mapping=True, attach_taxon_namespace=True)[source]

Reindices taxa across all subcomponents, mapping to single taxon set.

write(**kwargs)

Writes out self in schema format.

Mandatory Destination-Specification Keyword Argument (Exactly One of the Following Required):

  • file (file) – File or file-like object opened for writing.
  • path (str) – Path to file to which to write.

Mandatory Schema-Specification Keyword Argument:

Optional Schema-Specific Keyword Arguments:

These provide control over how the data is formatted, and supported argument names and values depend on the schema as specified by the value passed as the “schema” argument. See “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” for more details.

Examples

d.write(path="path/to/file.dat",
        schema="nexus",
        preserve_underscores=True)
f = open("path/to/file.dat")
d.write(file=f,
        schema="nexus",
        preserve_underscores=True)
write_to_path(dest, schema, **kwargs)

Writes to file specified by dest.

write_to_stream(dest, schema, **kwargs)

Writes to file-like object dest.