Stochastic Annealing for Determining Species Boundaries
from Variable DNA Sequences




Paul D. Bridge1, Mark A. O’Neill2 & Jamshid Fatehi3


1Mycology Section, The Herbarium, Royal Botanic Gardens, Kew, Richmond, Surrey TW9 3AB and School of Biological & Chemical Sciences, Birkbeck, University of London, Malet St, London WC1E 7HX: e-mail p_bridge@hotmail.com

The molecular systematics of fungal taxa has largely been established through sequence analysis of the ribosomal RNA (rRNA) gene cluster. The cluster consists of genes for the production of the individual 5.8S, small and large ribosomal subunits, separated by internally transcribed spacer (ITS) regions. The individual copies of the cluster are arranged linearly and are separated by an intergenic spacer (IGS). DNA sequence within the subunit genes are in general relatively conserved, although they also contain some variable domains (see Bruns et al., 1992; Hibbert et al., 1992). Gene sequences, particularly from the small subunit have been widely used to establish phylogenetic relationships between genera (e.g. Hausner et al., 1993). The spacer sequences however are less conserved and have proved to be useful for species identification, and for determining relationships between closely related species (e.g. Gardes & Bruns, 1991; Nazar et al., 1991).


Although ITS sequences have proved useful in identification at the species level, the level of variability within these sequences is not consistent between different species. In some fungal species ITS sequences may differ by as little as 0.5% between different isolates, whereas in some other species sequences may vary by 15% or more between individual isolates (see Seifert et al., 1995). There has been considerable debate concerning the application of species concepts in fungi; and in the absence of a classical definition, different criteria have been accepted for different groups.


In some fungal groups ITS sequences have successfully been used to determine hierarchical relationships. One example of this is in the genus Colletotrichum, where ITS variation of 2-3% has been considered to be the normal range within a species complex, and sequence variation between species complexes being around 5% or greater (see Sreenivasaprasad et al., 1996). Although clear cut-offs can be used to describe species and species complexes in some groups, the situation may become very complicated when considering a grouping of related species with differing levels of sequence variation. Under these circumstances within species sequence variation might be expected to approach, or even overlap with, between species variation. This sort of inter-specific overlap can also give rise to problems when trying to classify groups of higher organisms which are in the process of speciation. For example, the Hyles euphorbiae complex [Lepidoptera: sphingidae] and the Bombus, Megabombus and Colletes complexes [Hymenoptera: apidae] (see for example Pittaway, 1993). A further problem can arise when comparing groups of the same taxonomic rank, which contain widely different levels of intra-group variation. The entomopathogenic fungus Metarhizium consists of both host specific and generalised pathogens, and these have been grouped at both species and variety level (Driver & Milner, 1997). Hierarchic methods were shown to provide inaccurate representations of the delineation of these taxa and the possible relationships between them and the non-hierarchic principle component analysis proved more appropriate (Bridge, 1998).


One problem that may be encountered at the level of species and variety is the direction of current and recent linear changes. The use of heirarchic mathematical approaches, such as those represented by dendrograms or cladograms, are appropriate methods to model evolutionary progressions, particularly as in most instances evolution can be assumed to have occurred in a single direction. Hennig (1966) pointed out that phylogenetic analyses were inappropriate for currently interbreeding populations, and this may be particularly relevant among organisms such as bacteria, yeasts and fungi, which are often considered to contain genetically plastic species and varieties, where lateral gene transfer may occur through a range of mechanisms from transposons to interspecific mating (see Woo et al., 1998; Brasier, 1997). A complication of lateral transfer is that it will occur between groups placed at the end points of hierarchic representations, and so it is not accurately described by extending the linear mathematics of the evolutionary line. This has recently been demonstrated by Wang & Zhang (2000) who showed that multiple lateral transfers within the rRNA genes of actinomycete bacteria gradually destroyed the evolutionary history recorded in the nucleotide sequence. There is therefore a need to investigate the use of non-directional mathematics when investigating relationships between closely related, possibly rapidly changing, taxa. Within the fungi these are often taxa described at the species, variety of pathogenic form level.

The generally established Phylogenetic Species Concept (Nixon & Wheeler, 1990) does not adequately cope with the levels of plasticity that are becoming apparent in such cases (see Goldstein & de Salle, 2000), and application of Population Aggregation Analysis (Davis & Nixon, 1992) would lead to the sub-division of many existing species. Non-hierarchic methods such as principle component analysis are basically linear in their initial comparison of variance, but provide a non-directional representation of the final results. Non-linear mathematical approaches are by definition direction independent. Specifically, they provide a method of data analysis which is independent of any a priori assumption, including that of divergent evolution from a single common progenitor. In this way they are particularly suited for the analysis of fuzzy data that is obtained from complex biological relationships. They have proven to be particularly powerful in the field of neuroinformatics (see Young et al., 1996; Hilgetag et al., 1995) and may provide a significant tool for the analysis of DNA sequence data.



Stochastic annealing with closely related DNA sequences


The legume Ascochyta complex is a group of species of the genera Ascochyta and Phoma that are significant pathogens of many leguminous hosts. Ascochyta and Phoma are primarily differentiated on the basis of their conidia. In Ascochyta the conidia are predominantly septate, whereas in Phoma the conidia are largely non septate (Boerema & Bollen, 1975). These are not exclusive characters and refer only to the majority of the conidia. In one instance, A. suboltshauseri, septate, Ascochyta like conidia are formed in planta, and non-septate, Phoma like conidia are produced in culture. Ascochyta and Phoma are anamorphic genera, and teleomorphs, where known, have been placed in the genus Didymella (For full review see Fatehi, 2000).


Various species concepts have been applied to these taxa at different times. Characteristics that have been used include original host, size and type of lesion, production of a particular metabolite and spore size and shape (e.g. Jones, 1927; Sutton, 1980). There is evidence of host preference among some isolates, but host specificity of individual species is debatable. There have also been some discrepancies in the ranks applied to taxa, with some groups considered different species by some authors, and as special pathogenic forms of a single species by others (e.g. Gossen et al., 1986). Determination of relationships in this group by molecular methods is complicated by the presence of more than one type of ITS sequence in some isolates. These multiple forms may be due to non-conserved mutation in some copies of the rRNA gene cluster, or may alternatively be the result of heterokaryotic multinucleate conidia (Fatehi & Bridge, 1998).


The complete ITS1-5.8S-ITS2 sequences from 14 isolates of Phoma and Ascochyta species associated with legumes, and two isolates of the presumed teleomorph Didymella were obtained (Fatehi, 2000), and the data was analysed by four different approaches, UPGMA cluster analysis, parsimony, principle component analysis and stochastic annealing.

Cluster analysis: The sequence divergence between sequences was calculated for all isolates by pairwise comparison of sequences. Sequence divergence was calculated as the percentage of bases different between each pair of sequences. The matrix of sequence divergences was then clustered by UPGMA average linkage through the package MVSP (Kovach Computing Services, Anglesea).

Parsimony: The most parsimonious tree was determined by analysis of the aligned sequence data through PAUP (Swofford, 1993). An unrooted tree was constructed for maximum parsimony using the outgroup method.

Principle component analysis: Principle component analysis was carried out on sequences after calculation of divergence through the program CLUSTALW. Principle component ordinations were then obtained through the JALVIEW utility available on-line at the European Bioinformatics Institute (EBI).

Stochastic annealing: Stochastic annealing was undertaken through the program seriate running under the P3) platform. For this operation transitions (e.g. purine to purine or pyramidine to pyramidine) were given a weight of 0.5, in comparison to transversions (pyramidine to purine or purine to pyramidine) which were given a weight of 1.



Initial indications


Four of the 16 sequences were identical (A. pisi (is2), A. pisi (CH3) A. pisi (CBS), A. pisi (IMI) and D. viciae), and the divergence between the others varied between 0.36% and 8.89% (Table 1). Plotting the frequency of the different levels of sequence divergence did not identify any clear cutoff values that could be investigated for species or species complex delineation. The dendrogram obtained from the sequence divergence gave a simple structure of one group of 10 organisms, with the remaining organisms linked to this in increasing order of divergence. Although there was some indication of the limit of the core group of organisms, there was no clear delineation. The sequence obtained from D. fabae, the teleomorph of A. fabae was linked singly to the core group, which in turn contained a sequence from the anamorphic A. fabae, despite the true difference between these isolates being only 0.36%. The parsimony tree gave a very similar arrangement of the organisms, and showed P. rabiei and P. medicaginis as closely linked to the core group.


The ordination in the first 3 principle components (Fig 2c) gave a core group of organisms that corresponded to part of the large cluster seen in the cluster analysis. This group contained 10 organisms (1-10; Table 1), with the remaining 6 organisms dispersed both from this group and from each other. The ordination placed P. rabiei outside this main group. The seriation programme gave a series of reordered matrices, each of which showed minimum binding energy. These were all largely consistent, and in every case placed the same 10 organisms grouped by PCA together at the top of the matrix. The consensus reordering obtained from the seriate analysis is that given in Table 1.



Implications


The data set analysed in this study is unusual in that it contains a number of different issues that are often encountered in systematic studies. In this case the sampling sizes, while similar in terms of specific names, can be seen to be very unequal, with comparisons being made between a group of 10 and 6 single organisms. This is due in part to uncertainties and inconsistencies within this group of fungi. The generic division between Ascochyta and Phoma has been questioned, and in the past, different species epithets have been used to distinguish between very similar organisms isolated from different hosts (see Punithalingam, 1979; Boerema et al., 1993). Previous molecular studies within this group of fungi have suggested the possibility of recent divergence or genetic exchange among some host recognised species, particularly A. pisi and A. fabae. This was seen as the possession of two distinct ITS sequences within these isolates, which appeared after PCR amplification to comprise of a major and minor type (Fatehi & Bridge, 1998). Detailed analysis suggested that the minor ITS type in some isolates was the same as the major type in others, and vice versa. This may have arisen due to horizontal transfer of one nuclear type in the multinucleate conidia, or may represent a radiating group from a single founder taxon (Fatehi & Bridge, 1998). In this study only the major amplification product has been sequenced. Of the four methods evaluated here, Principle Component Analysis and seriation both give results that stress the close relationship within the A. pisi/A. fabae complex. These species are pathogens on peas and beans respectively, and the species A. vicia-villosa and A. vicia-panonicia were also isolated from Vicia species. This is in contrast to A. rabiei, which is commonly found on chickpeas.


Given the similarities in host, together with the already reported ITS heterogeneity and overlap, it would seem reasonable to conclude that the 10 organisms that grouped together by these methods constituted a single taxon, probably at species level. Once this assumption has been made, it is possible to make a more accurate interpretation of the cluster analysis and the parsimony tree. It is also possible to return to these representations and the original data and determine some cutoff values that could then be applied for the identification of further sequences. This is however only clear when the results of the PCA and the seriation analyses have been taken into account. Although significance levels can be applied to branch points in dendrograms, most analyses assume that the variability behind the branchpoint is normally distributed (see Dudzinski, 1975). While this may be the case for a branch point describing a single species, it is not likely to be the case for branch points differentiating species with different levels of variability. This highlights a problem that can occur when considering unknown sequences, and in this case it was not possible to accurately determine the species boundaries purely from tree-based representations. This raises a question as to how much of the answer needs to be known before the results can be interpreted. In reality it is the empirical judgement of the experienced taxonomist that provides this.


The seriation approach used here is an extension of a methodology originally developed to seriate connection matrix bit-vectors derived from neuroanatomical connection data of the visual areas of the rhesus macaque cortex (see Young et al., 1995;). The basic seriation algorithm was extended to permit the seriation of strings of character data (e.g. FASTA format DNA sequence data) rather than simple bit-vectors. A cost function was included to represent the different costs of transversions and transitions in terms of subsequent DNA binding energy. It is expected that further extensions to the system will permit it to be used for automated sequence alignment using stochastically controlled inflation and erosion of the DNA sequence data. In addition, output data can be generated in a form that can be visualised in a 2- or 3- dimensional cluster space (using for example, an adapted version of the xtopol application which was developed to perform similar sorts of display task for neuroinformatic data) (Figure 1).


The simple version of the seriate application used here did however identify the association within the core A. pisi/A.fabae group through their placement and movement in the lowest energy re-orderings. This approach has some similarities to the Cladistic Haplotype Aggregation as described by Brower (1999), but in this case the distinct aggregations are identified by low energy levels in the annealing, as opposed to topological arrangement. To obtain a separation/grouping comparable to that obtained from PCA with an initial simplified version is encouraging, and suggests that further development will lead to a powerful tool for sequence comparison. The utility of PCA for analysing molecular data in fungi has been suggested previously for band data from closely related taxa of the same level (Bridge, 1998). The extraction of correlated characters as eigenvectors to obtain the principle coordinates acts to bias the final ordination towards that supported by the greatest number of correlated characters. This can act as a filter in situations where correlations are low and there is a high level of random variation, as the small number of correlated characters will form the principle components, ahead of the uncorrelated random variation (Bridge, 1998). In this case the majority of the data is correlated, as the sequences are very similar, but as this similarity extends across the data set, the level of variability in these characters is low, and the bulk of the variability in the data is from the random variation. However, this would appear to be filtered by the methodology, and the result provides a useful method for determining the significance of branches within trees. This appears to be particularly important in situations where significant genetic events have occurred in different directions, and these are in turn masked by multi-directional random variation. Such a situation could be expected where isolation or radiation has occurred from a single meiotic population.


The approach taken here was a pilot study based on fungal sequence data, and was further limited by the relatively low level of sequence variation. The non-hierarchic and non-linear methods are organism independent and so may be expected to provide useful insights into plant or insect groups, such as Brassicas and Hyles, where either interspecific or varietal crosses may occur naturally in the environment.



References


Boerema, G.H. and Bollen, G.J. 1975. Conidiogenesis and conidial septation as differentiating criteria between Phoma and Ascochyta. Persoonia 8: 111-144.

Boerema, G.H., Pieters, R. and Hamers, M.E.C. 1993. Check-list for scientific names of common parasitic fungi. Supplement Series 2c,d (additions and corrections). Fungi on field crops: pulse (legumes), forage crops (herbage legumes), vegetables and cruciferous crops. Netherlands Journal of Plant Pathology 99: Supplement 1, 1-32.

Boerema, G.H., de Gruyter, J., and Noordeloos, M.E. 1997. Contributions towards a monograph of Phoma (coelomycetes) – IV. Persoonia 16: 335-371.

Brasier, C.M. 1997. Fungal species in practice; identifying species units in fungi. In. Pp. 135-170 in: Claridge, M., Dawah, H. & Wilson, M. (ed.), The Units of Biodiversity. Chapman & Hall.

Bridge, P.D. 1998. Numerical analysis of molecular variability, a comparison of hierarchic and nonhierarchic approaches. Pp. 291-308 in: Bridge, P.D., Couteaudier, Y. & Clarkson, J.M. (ed.) Molecular Variability of Fungal Pathogens. CAB International, Wallingford.

Brower, A.V.Z. 1999. Delimitation of phylogenetic species with DNA sequences: a critique of Davis and Nixon’s Population Aggregation Analysis. Systematic Biology 48: 199-213.

Bruns, T.D., Vilgalys, R., Barns, S.M., Gonzalez, D., Hibbett, D.S., Lane, D.J., Simon, L., Stickel, S., Szaro, T.M., Weisburg, W.G. and Sogin, M.L. 1992. Evolutionary relationships within the fungi: analysis of nuclear small subunit rRNA sequences. Molecular Phylogenetics and Evolution 1: 231-241.

Davis, J.I. and Nixon, K.C. 1992. Populations, genetic variation, and the delineation of phylogenetic species. Systematic Biology 41: 421-435.

Driver, F. and Milner, R.J. 1998. PCR applications to the taxonomy of entomopathogenic fungi. Pp.153-186 in: Bridge, P.D., Arora, D.K., Elander, R.P. & Reddy, C.A. (ed.), Applications of PCR in Mycology, CAB International, Wallingford.

Dudzinski, M.L. 1975. Principal components analysis and its use in hypothesis generation and multiple regression. Pp. 85-105 in: Bofinger, V.J. & Wheeler, J.L. (ed), Developments in Field Experiment Design and Analysis. Commonwealth Agricultural Bureaux, Slough.

Fatehi, J. 2000. PhD Thesis, University of Reading, UK

Fatehi, J. and Bridge, P.D. 1998. Detection of multiple rRNA-ITS regions in isolates of Ascochyta. Mycological Research 102: 762-766.

Gardes, M. and Bruns, T.D. 1991. Rapid characterization of ectomycorrhizae using RFLP pattern of their PCR amplified-ITS. Mycological Society Newsletter 41: 14.

Goldstein, P.Z. and DeSalle, R. 2000. Phylogenetic species, nested hierarchies, and character fixation. Cladistics 16: (in press).

Gossen, B.D., Sheard, J., Reauchamp, C.J. and Morrall, R.A.A. 1986. Ascochyta lentis renamed Ascochyta fabae f.sp. lentils. Canadian Journal of Plant Pathology 8: 154-160.

Hausner, G., Reid, J. and Klassen, G.R. 1993. On the phylogeny of Ophiostoma, Ceratocystis s.s., and Microascus, and relationships within Ophiostoma based on partial ribosomal DNA sequences. Canadian Journal of Botany 71: 1249-1265.

Hennig, W. 1966. Phylogenetic Systematics. University of Illinois Press, Urbana, USA

Hibbert, D.S. 1992. Ribosomal RNA and fungal systematics. Transactions of the Mycological Society of Japan 33: 533-556.

Hilgetag C.C., O'Neill, M.A. and Young, M.P. 1996. Optimal hierarchical orderings for the primate visual system using a novel network processor. Science 271: 776-777.

Jones, L.K. 1927. Studies of the nature and control of blight, leaf and pod spot, and footrot of peas caused by species of Ascochyta. Pp.46 in: Bulletin of New York State Agricultural Experimental Station, 547.

Nazar, R.N., Hu, X., Schmidt, J., Culham, D. and Robb, J. 1991. Potential use of PCR amplified ribosomal intergenic sequences in the detection and differentiation of Verticillium wilt pathogen. Physiological and Molecular Plant Pathology 39: 1-11.

Nixon, K.C. and Wheeler, Q.D. 1990. An amplification of the phylogenetic species concept. Cladistics 6: 211-223.

Pittaway A.R. 1993. The Hawkmoths of the Western Palaearctic. Harley Books.

Punithalingam, E. 1979. Graminicolous Ascochyta species. Mycological Papers 142: 1-214.

Seifert, K.A., Wingfield, B.D. and Wingfield, M.J. 1995. A critique of DNA sequence analysis in the taxonomy of filamentous ascomycetes and ascomycetous anamorphs. Canadian Journal of Botany 73 (suppl. 1): S760-767.

Sreenivasaprasad, S., Mills, P.R., Meehan, B.M. & Brown, A.E. 1996. Phylogeny and systematics of 18 Colletotrichum species based on ribosomal DNA spacer sequences. Genome 39: 499-512.

Sutton, B. C. 1980. The Coelomycetes. CAB International Mycological Institute: Kew.

Swofford, D.L. 1993. PAUP: Phylogenetic Analysis Using Parsimony, version 3.1.1. Illinois Natural History Survey, Champaign, Illinois.

Wang, Y. and Zhang, Z. 2000. Comparative sequence analyses reveal frequent occurrence of short segments containing an abnormally high number of non-random base variations in bacterial rRNA genes. Microbiology 146: 2845-2854.

Woo, S.L., Noviello, C. and Lorito, M. 1998. Sources of molecular variability and applications in characterization of the plant pathogen Fusarium oxysporum. Pp. 187-208 in Bridge, P.D., Couteaudier, Y. & Clarkson, J.M. (ed) Molecular Variability of Fungal Pathogens. CAB International, Wallingford

Young M.P., Scannell, J.W., O'Neill, M.A., Hilgetag, C.C., Burns, G. and Blakemore, C.B. 1995. Non metric multi dimensional scaling in the analysis of neuroanatomical connection data. Philosophical Transactions of the Royal Society B 348: 281-308.


Table 1. Sequence divergence



1. A. viciae-villosae

0
















2. D. fabae


2.13

0















3. A. pisi (is2)

0.71

2.13

0














4. A. fabae

0.36

1.42

0.71

0













5. A. viciae-pannonicae

1.07

2.85

0.36

1.07

0












6. A. pisi (CBS)

0.71

2.49

0

0.71

0.36

0











7. A. pisi (CH3)

0.71

2.49

0

0.71

0.36

0

0










8. D. viciae


0.71

2.49

0

0.71

0.36

0

0

0









9. A. pisi (IMI)

0.71

2.49

0

0.71

0.36

0

0

0

0








10. Ascochyta sp.

2.13

3.56

1.42

2.13

1.78

1.42

1.42

1.42

1.42

0







11. A. rabiei


2.85

4.98

2.49

2.85

2.85

2.49

2.49

2.49

2.49

3.21

0






12. P. medicaginis


4.63

6.41

3.91

4.27

3.91

3.91

3.91

3.91

3.91

5.34

3.21

0





13. P. exigua


5.34

6.76

4.63

5.69

4.27

4.63

4.63

4.63

4.63

5.34

4.27

3.91

0




14. A. boltshauseri


5.34

7.11

4.27

5.34

4.98

4.27

4.27

4.27

4.27

4.63

5.34

5.69

4.27

0



15. A. pinodes


4.98

6.41

4.27

4.27

4.98

4.27

4.27

4.27

4.27

5.69

5.34

6.76

6.05

7.11

0


16. P. glomerata

4.98

6.41

4.63

4.63

4.98

3.91

3.91

3.9

3.91

5.34

5.69

7.11

6.05

8.89

4.63

0



1


2


3


4


5


6


7

8


9


10


11


12


13


14


15


16


Figure 1. Xtopol visulisation of consensus seriation.






























1 Mycology Section, The Herbarium, Royal Botanic Gardens, Kew, Surrey TW9 3AH

2Bee Systematics and Biology Unit, Hope Entomological Collections, Parks Road, Oxford OX1 3PW

3Department of Forest Mycology and Pathology, Swedish University of Agricultural Sciences, Box 7026, 750 07 Uppsala, Sweden