Population Genetic Summary Statistics¶
dendropy.calculate.popgenstat module provides functions that calculate some common population genetic summary statistics.
For example, given a
DnaCharacterMatrix as an argument, the
num_segregating_sites function returns the raw number of segregating sites,
average_number_of_pairwise_differences returns the average number of pairwise differences, and
nucleotide_diversity returns the nucleotide diversity.
More complex statistics are provided by the
Objects of this class are instantatiated with two lists of
DnaCharacterDataSequence objects as arguments, each representing a sample of DNA sequences drawn from two distinct but related populations.
Once instantiated, the following attributes of the
PopulationPairSummaryStatistics object are available:
- The average number of pairwise differences between every sequence across both populations.
- The average number of pairwise differences between every sequence between both populations.
- The average number of pairwise differences between every sequence within each population.
- The net number of pairwise differences.
- The number of segregating sites.
- Watterson’s theta.
- Wakeley’s psi.
- Tajima’s D.
The following example calculates the suite of population genetic summary statistics for sequences drawn from two populations of sticklebacks.
The original data consists of 23 sequences, with individuals from Eastern Pacific populations identified by their taxon labels beginning with “
EPAC” and individuals from Western Pacific populations identified by their taxon labels beginning with “
The taxon labels thus are used as the basis for sorting the sequences into the required lists of
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
#! /usr/bin/env python import dendropy from dendropy.calculate import popgenstat seqs = dendropy.DnaCharacterMatrix.get( path="orti1994.nex", schema="nexus") p1 =  p2 =  for idx, t in enumerate(seqs.taxon_namespace): if t.label.startswith('EPAC'): p1.append(seqs[t]) else: p2.append(seqs[t]) pp = popgenstat.PopulationPairSummaryStatistics(p1, p2) print('Average number of pairwise differences (total): %s' \ % pp.average_number_of_pairwise_differences) print('Average number of pairwise differences (between populations): %s' \ % pp.average_number_of_pairwise_differences_between) print('Average number of pairwise differences (within populations): %s' \ % pp.average_number_of_pairwise_differences_within) print('Average number of pairwise differences (net): %s' \ % pp.average_number_of_pairwise_differences_net) print('Number of segregating sites: %s' \ % pp.num_segregating_sites) print("Watterson's theta: %s" \ % pp.wattersons_theta) print("Wakeley's Psi: %s" \ % pp.wakeleys_psi) print("Tajima's D: %s" \ % pp.tajimas_d)
Lines 6-12 build up the two lists of
DnaCharacterDataSequence objects by sorting the original sequences into their source populations based on the taxon label (with operational taxonomic units with labels beginning with “
EPAC” coming from the Eastern Pacific, and assigned to the list
p1, while those that begin with “
WPAC” coming from the Western Pacific, and assigned to the list
These lists are then passed as the instantiation arguments to the
PopulationPairSummaryStatistics constructor in line 14.
The calculations are performed immediately, and the results reported in the following lines.