ARTICLE IN PRESS
Journal of Theoretical Biology 225 (2003) 301–325
Comparison methods for branching and axillary flowering sequences Y. Gue! dona,*, P. Heureta, E. Costesb a
Unit!e Mixte de Recherche CIRAD/CNRS/INRA/Universit!e Montpellier II, Botanique et Bioinformatique de l’Architecture des Plantes, TA 40/PS2, 34398 Montpellier Cedex 5, France b Unit!e Mixte de Recherche INRA/AgroM/CIRAD/IRD BDPPC, Equipe Architecture et Fonctionnement des Esp"eces Fruiti"eres, 2, Place Viala, 34060 Montpellier Cedex 1, France Received 11 December 2002; received in revised form 16 June 2003; accepted 20 June 2003
Abstract Comparing branching and axillary flowering patterns accurately is a major issue both in botany and in various agronomic contexts. Data take the form of sequences which naturally represent the underlying structural information of branching and axillary flowering patterns. Various comparison methods are proposed based either on sequence alignment or on the computation of dissimilarity measures between (hidden) Markovian models built from sets of sequences. Sequence alignment is a natural complement to the exploratory tools and statistical models proposed in Gu!edon et al. (J. Theor. Biol. 212 (2001) 481) with the distinctive feature of applying to individual sequences. Comparison methods may also be used to reveal some grouping within a set of sequences or to evaluate the strength of a predefined grouping of sequences. The proposed approach is illustrated by examples corresponding to different plant species and different biological or agronomic objectives. r 2003 Elsevier Ltd. All rights reserved. Keywords: Branching pattern; Clustering; Markovian models; Plant architecture; Sequence alignment
1. Introduction The main outcome of the architectural analysis initially proposed by Halle! and Oldeman (1970) (see also Halle! et al., 1978) is to provide the biologist with the main concepts and tools to tackle the study of plant architecture development, whatever the development stage or the species considered. In this context, the analysis of branching and axillary flowering structures is a key element. Branching expression is naturally represented by sequences whose index parameter is the node rank, the attached variables qualifying either the parent entity (phyllotaxis, length of the internode) or the offspring entities (type of axillary production such as immediate vs. 1-year-delayed offspring shoot). Most of the measured variables are discrete (nominal, ordinal or interval-scaled) and correspond to morphological information related to branching expression. In Gue! don et al. (2001), the focus was on exploratory analysis and statistical modeling for analysing branching and axillary flowering patterns. The main objective was
to throw light on structures or regularities not directly apparent in the data and thereby deepen our biological understanding of the underlying mechanisms that control the branching and the axillary flowering of plants over time and space. Two main categories of branching/axillary flowering structures were identified at two distinct scales: * *
To further our biological knowledge, it is thus natural to try to compare these branching/axillary flowering structures within the architecture, i.e. between different categories of macroscopic botanical entities (growth units, annual shoots or axes), or between cultivars or provenances in more agronomic contexts. In this paper, we therefore investigate comparison methods that apply in two main contrasted contexts: *
*
*Corresponding author. Tel.: 467616578; fax: 467615668. E-mail addresses:
[email protected] (Y. Gu!edon),
[email protected] (P. Heuret),
[email protected] (E. Costes). 0022-5193/$ - see front matter r 2003 Elsevier Ltd. All rights reserved. doi:10.1016/S0022-5193(03)00249-2
repeated patterns at a local scale, succession of homogeneous zones at a global scale.
a local context where two individual sequences are aligned and the objective, for instance, is to locate homologous zones along the sequences being aligned, a global context where the aim is to reveal a grouping within a set of sequences or to evaluate the strength of a predefined grouping of sequences. This grouping
ARTICLE IN PRESS Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
302
may be based on a distance matrix between sequences resulting from their pairwise alignment. In this case, the local and the global analysis contexts share the same methodological basis. We also investigate another approach which relies on a first step of statistical modeling of sub-samples of sequences. Sequence alignment is a natural complement to the exploratory tools and statistical models presented in Gue! don et al. (2001) for the investigation of biological questions related to branching and axillary flowering while the research of some grouping in a set of sequences, or the evaluation of a predefined grouping, is mainly useful in agronomic studies where axis categories within the architecture or between clones, cultivars or species may be compared on the basis of their branching patterns. Sequence alignment is one of the main tools used in computational molecular biology; see Waterman (1995) and Gusfield (1997). With respect to this context, the differences are numerous: *
*
branching and axillary flowering sequences are short (less than a few hundred nodes) particularly compared to DNA sequence which may be very long, branching and axillary flowering sequences may be multivariate and the variables of interest are not always nominal (or qualitative) like in molecular biology where the two main alphabets are the four nucleotides for DNA and the 20 amino acids for proteins. In our context, the variables of interest may be nominal (e.g. different types of axillary productions which cannot be ordered in a meaningful way), ordinal (e.g. latent bud (0), aborted branch (1) and developed branch (2)) or interval-scaled (e.g. the number of axillary flowers).
Hence, in this paper, we will investigate adaptations and specializations of sequence alignment methods to our particular context. The remainder of this paper is organized as follows. The main biological concepts of plant architecture and morphology are introduced in Section 2. In Section 3, a typology of sequences extracted from plant architectures is proposed which relies both on the type of dynamics exhibited by the sequences and the types of variables measured for each elementary botanical entity. Different categories of comparison methods are introduced using selected examples of biological interest in Section 4. Section 5 addresses various issues concerning the different applications of sequence comparison methods.
2. Biological framework The architectural development of a plant is the result of many elementary morphological events distributed
over time and space. Whatever the size, shape and complexity of the resulting architecture, only a few categories of homologous botanical entities can be identified in the caulinar shoot system of an adult plant (Barthe! le! my, 1991). These types of botanical entities (Fig. 1) can be ordered from the most elementary which is the node—or more precisely the set {node, internode, leaf(ves) and axillary bud(s)} referred to as metamer; see White (1979)—to more macroscopic entities such as the axis (i.e. succession of metamers built up by a single meristem). Plant development is the result of two basic mechanisms: the apical growth process and the branching process. Each of these two basic mechanisms corresponds to a particular meristem activity. An apical meristem is defined as a set of embryonic cells located in the apical part of each axis; for a recent review, see Lyndon (1998). These particular cells are always able to divide and thus to generate various tissues (pith, epiderm, vascular bundles, etc.) or organs (leaf, flower, etc.). The apical meristem activity also generates small lateral sets of embryonic cells, the axillary meristems, which may give rise to axillary or offspring shoots. In the case of the apical growth process, a succession of metamers is built up by a single apical meristem. If no resting periods are observed during the elongation, the growth is qualified as continuous and leads to the edification of an axis made up of a succession of metamers. If the apical growth takes the form of a succession of elongation and resting periods (with annual periodicity for temperate species), the growth is qualified as rhythmic. This temporal organization translates into a spatial organization which may be identified by morphological markers—e.g. scale leaves and short internodes corresponding to slowdowns or stops of growth; see Caraglio and Barthe! le! my (1997). In this way, different types of macroscopic botanical entities may be defined such as growth units (portion of leafy axis, i.e. succession of metamers built up between two resting phases; see Halle! and Martin (1968)) or annual shoots (portion of a leafy axis built up over a single year for temperate species, and which may be composed of one or several successive growth units; see Barthe! le! my (1991) and Godin and Caraglio (1998)). In the case of the branching process, an offspring entity is built up by an axillary meristem. This offspring entity, which may be either a flower—resulting from the transformation of this meristem—or a shoot, will be globally referred to as an axillary production. The entities borne on the successive nodes of a given ‘‘parent’’ entity may exhibit a remarkable structure. For example, many temperate species exhibit gradients on growth units, for instance vigor gradients associated with acrotony, i.e. a decrease in the dimensions (length, diameter) of the offspring entities from the top to the base of the parent growth unit.
ARTICLE IN PRESS Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
303
Top
leaf
metamer
limit between two successive growth units
(a)
(b)
Two-year-old M7 rootstock
Base
(c)
... 3 4 4 4 4 4 4 4 4 4 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 1 2 1 0 0 1 1 1 0 1 0 1 2 0 0 0 0 0 0 0
Fig. 1. Apple tree: (a) Growth unit organization results from the succession of metamers. (b) Trunk grafted on M7 rootstock (cultivar ‘Reinette B.’). (c) For each node, the nature of the axillary production was recorded (0: latent bud; 1: 1-year-delayed short shoot; 2: 1-year-delayed long shoot; 3: 1-year-delayed flowering shoot; 4: immediate shoot).
Both the apical growth and the branching processes induce sequential structures at the node succession scale along growth units (sequence of internode lengths or sequence of axillary productions). Hence, the node constitutes a natural index parameter from which sequences can be built. Depending on the rhythmic or continuous character of apical growth, these sequences may correspond to growth units, annual shoots or axis. It should be noted that the year of a shoot or the rank of a growth unit are also valid index parameters from which more macroscopic sequences can be built.
3. Characterization of branching/flowering sequences Branching/flowering sequences can be characterized according to the three following criteria: *
The level of description of the plant which defines the index parameter of the sequence. Our focus will be on sequences whose index parameter is the node (or metamer). In what follows, sequences corresponding
*
*
to growth units, annual shoots and leafy axes are used as examples. The number and the types of the measured variables attached to each successive botanical entity in the sequence. The dynamic properties of the sequences: stationary sequences vs. nonstationary sequences, e.g. sequences exhibiting a succession of well-differentiated transient phases.
3.1. Types of variable A measured variable qualifies either the successive botanical entities that compose a sequence (e.g. phyllotaxis, length of the internode) or the offspring entities borne by these successive parent entities (e.g. types of axillary production). These latter variables are the main focus in most of this paper. When one wants to qualify an entity or more generally a branched system borne by a given entity in a sequence, most of the relevant variables are discrete. Hence, in the context of the analysis of branching and
ARTICLE IN PRESS 304
Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
axillary flowering patterns, discrete variables, which may be either nominal or interval-scaled, outweigh continuous variables. Examples of nominal variables including binary variables are numerous: absence/ presence of an offspring shoot or an axillary flower. Discrete interval-scaled variables also occur frequently (e.g. number of axillary flowers). In this case the values are ordered and follow a linear scale. Continuous variables (which are always interval-scaled) express mainly geometrical features such as the length of the internode. At first sight, a discrete variable is either nominal or interval-scaled. When the values taken by a nominal variable can be ordered in some meaningful way, this variable can be considered as ordinal. The possible axillary productions chosen among bud (0), aborted shoot (1) and developed shoot (2), can thus be considered as an ordinal variable. A filtering process can also be applied to an interval-scaled variable. When the interval between values is meaningless and the only trustworthy information lies in the value order, an interval-scaled variable can be considered as ordinal. In the former case (nominal - ordinal), an order structure is added in some subjective way while, in the latter case (interval-scaled - ordinal), the implicit metric corresponding to the interval between values is replaced by the rank metric while the order structure is preserved. Both interval-scaled and ordinal variables are inherently quantitative in the sense that each possible value in the scale can be compared in terms of whether it corresponds to a greater or smaller magnitude than another value. This is a direct consequence of the order structure shared by these two types of variable. 3.2. Types of dynamics At first sight, stationary sequences should be distinguished from nonstationary sequences. In the case of nonstationary sequences, composition changes may be abrupt or progressive. In the latter case, we speak of trend (long-term or smoothed change). For branching/ flowering sequences, composition changes are often abrupt (for instance, a long offspring shoot zone at the top of a parent shoot following a zone with a mixture of latent buds and short offspring shoots) and the sequences may be viewed as a succession of homogeneous zones or segments where the composition properties do not change substantially within each zone, but change markedly between zones.
4. Comparison methods Comparison methods are natural complements to the exploratory tools and statistical modeling approach described in Gue! don et al. (2001). In agronomic
contexts where the understanding of the biological processes becomes less important than optimization processes, for example in breeding programs, comparison methods play a central role. A first family of methods applies directly to the original data and can be considered as model-free while a second family of methods relies on an intermediate stage of model building. The outputs of these two families of methods share the same structure: a dissimilarity matrix where each entry takes the form of a rough dissimilarity normalized by a length. This particular form of dissimilarities requires some adaptations of the usual clustering algorithms. The presentation of the various methods for comparing sequences will be illustrated using four contrasted data samples introduced below. Example 1—branching of apple tree trunk annual shoots (Gue! don et al., 2001; Costes and Gue! don, 2002). Seven apple cultivars (Malus domestica Borkh, Rosaceae) chosen for their diverse growth and fruiting habits were planted in Montpellier (south of France). The seven cultivars were: ‘Belre" ne’, ‘Elstar’, ‘Fuji’, ‘Imperial Gala’, ‘Granny Smith’, ‘Reinette blanche du Canada’ and ‘McIntosh Wijcik’. Twenty trees per cultivar, grafted on rootstock M.7, were planted in the field and cut back to one bud 1 year after transplantation. The trees were then allowed to develop without pruning. The location of the immediate offspring shoots (shoots developed without delay with respect to the parent node establishment date) was recorded after 1 year of growth while the location of 1-year-delayed offspring shoots was recorded after 2 years of growth. Among these 1-yeardelayed offspring shoots, we distinguished between short shoots (poor elongation of the internodes), long shoots (important elongation of the internodes) and flowering shoots. The first annual shoots of the trunks were described node by node where, for each node, the type of axillary production chosen among latent bud (0), 1-year-delayed short shoot (1), 1-year-delayed long shoot (2), 1-year-delayed flowering shoot (3), and immediate shoot (4) was recorded (see Fig. 1). It should be noted that since these measured annual shoots were monocyclic, they could also be qualified as growth units. The branching structure of the first annual shoot of the trunks after 2 years of growth, which is assumed to be a good predictor of the adult structure of the tree, may serve as a basis for the early detection of interesting tree structures. The measurements are illustrated by the three ‘Reinette B.’ sequences described from the top to the base in Fig. 2. These three sequences exhibit a succession of six well-differentiated zones; these zones can be obtained as a byproduct of a statistical modeling approach presented in Gue! don et al. (2001) and Costes and Gue! don (2002). Each of these six zones is characterized by a given mixture of axillary productions:
ARTICLE IN PRESS Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
305
2221133030030030034444444440000000000010000100110012100111010120000000
zone 1
zone 2
zone 3
zone 4
zone 5
zone 6
22230333030330030444444444404000000000001211011022121102000000000000
zone 1
zone 2
zone 3
zone 4
zone 5
zone 6
22303003030300300444444444444000000000200002111110012201012020000000
zone 1
zone 2
zone 3
zone 4
zone 5
zone 6
0: latent bud, 1: one-year-delayed short shoot, 2: one-year-delayed long shoot, 3: one-year-delayed flowering shoot, 4: immediate shoot. Fig. 2. Apple tree, cultivar ‘Reinette B.’: optimal segmentation of three observed sequences.
(1, 2) for the first zone, (0, 3) for the second, 4 for the third, 0 for the fourth, (0, 1, 2) for the fifth and 0 for the sixth. The focus in this paper will be on the sample of sequences corresponding to the cultivar ‘Reinette B.’ and, to a lesser extent, on that corresponding to the cultivar ‘Belre" ne’. The comparison of the branching structures of the seven cultivars, based either on the pairwise alignment of all the sequences or on the building of a statistical model for each cultivar, is reported in Gue! don and Costes (1999). Example 2—branching and axillary flowering of Cecropia trunk and branches (Heuret et al., 2002). The Cecropia individuals (Cecropia obtusa Tre! cul, Cecropiaceae) studied belong to a naturally regenerated stand at the entrance to the Paracou site (5 18 N; 52 55 W) set up by CIRAD-For#et in French Guyana, some 40 km from Kourou. French Guyana is subject to seasonal variations punctuated by a 3-month dry season from midAugust to mid-November and a rainy season for the remaining 9-months. In some years, an irregular short dry season may also occur around March. Cecropias are pioneer trees that colonize cleared areas and open areas with high light levels. In many tropical plants, including Cecropia, the absence of obvious temporal morphological markers such as cataphyll scars of temperate species renders the retrospective analysis of plant development difficult. Our objective is thus to infer some knowledge relative to the temporal succession of events on the basis of a posteriori measurements of plant development. There are three lateral buds at each node (Fig. 3a). The central bud is vegetative and may potentially give rise to an offspring shoot. The buds on either side correspond to two proximal axillary buds of the central bud, and may give rise to inflorescences, which are thus structured in pairs. Consequently, sequences measured
on Cecropia axes are bivariate (Fig. 3b), the first variable qualifying central bud development (branching variable), the second the development of the two lateral buds (flowering variable). The branching variable, with possible outcomes bud (0), aborted shoot (1) and developed shoot (2), can be considered as an ordinal variable since the three possible values can be ordered in a meaningful way (corresponding to the code order). This reasoning transposes to the flowering variable with possible (ordered) outcomes: no inflorescence (0), aborted inflorescence (1) and developed inflorescence (2). The analysis is illustrated with a single individual but has been performed on 10 individuals found to exhibit similar behaviors. Example 3—branching and axillary flowering of apricot tree growth units (Gue! don et al., 2001). A sample of 48 growth units (first in the annual shoots which may be composed of several growth units) of apricot tree (Prunus armeniaca L., Rosaceae), cultivar ‘Lambertin’, grafted on rootstock ‘Manicot’ was described node by node from the base to the top. The type of axillary production—chosen among latent bud (0), 1-yeardelayed short shoot (1), 1-year-delayed long shoot (2) and immediate shoot (3)—and the number of associated flowers were recorded for each node (Fig. 4). The branching variable can be considered as ordinal, the types of axillary production being ordered by increasing vigor. The branching and flowering variables correspond to events that do not occur simultaneously in plant development and were thus measured at two different dates (beginning of the growth period for the flowering and end of the growth period for the branching). These are nevertheless assumed to be strongly related since the flowers are always borne by the offspring shoots in positions corresponding to prophylls (the first two foliar organs of an offspring
ARTICLE IN PRESS Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
306
abrs
ss
dbr yl c ss
ls
dis ls 2 cm
(a)
0 1 0 0 0 0
2
0
2
0
2 1 0 0 0 0 0 0 0 branching
ib vb
0 0 1 1 1 1
7.5 cm
(b)
0 1 1 1 2 2 2 2 2 flowering
Fig. 3. (a) Morphological markers in Cecropia: yl = young leaf, ls=leaf scar, c=calyptra (stipule), ss = stipule scar, vb = vegetative bud, ib=inflorescence bud, abrs = aborted branch scar, dbr=developed branch, dis=developed inflorescence scar. (b) Axis described as a sequence of nodes where the branching and the flowering states are deduced from the observed scares.
shoot). The establishment of these growth units can be roughly described as follows: a short preformed part composed of approximately five nodes elongates rapidly at the beginning of the growth period followed by a neoformed part with a highly variable number of nodes. It may be assumed that the number of nodes of this neoformed part is roughly proportional to its elongation duration. The aim of this study was to characterize both flower distribution along the growth units and the association of flowers with axillary shoots. Example 4—branching of Arizona cypress axes. Branching sequences were observed on different axis categories of 4-year-old Arizona cypresses (Cupressus arizonica Greene, Cupressaceae). These trees were planted at the ‘Aix les Milles’ nursery in the south-east of France. Only axes with an opposite decussate leaf arrangement (phyllotaxis 2) were retained for analysis. The measured variable for each node, i.e. the number of offspring shoots could thus take only three possible values: 0, 1 and 2. The observed sequences exhibited remarkable patterns such as 101001 and 2002. Among the axes with an opposite decussate leaf arrangement, we distinguished on the basis of architectural analysis criteria (Barthe! le! my et al., 1999) which do not include branching patterns—see Fig. 5 * *
*
first-order axis (A1), second-order axis borne on the upper/median/lower third of the first-order axis (A2.up/A2.median/ A2.1ow), third-order axis borne on the 25th node counted from the base of a second-order axis, itself borne on the upper/median/lower third of the first-order axis (A3.p25.up/A3.p25.median/A3.p25.low),
*
*
*
third-order axis borne on the median part of a second-order axis, itself borne on the upper/median/ lower third of the first-order axis (A3.median.up/ A3.median.median/A3.median.low), third-order axis borne on the basal part of a secondorder axis, itself borne on the upper/median/lower third of the first-order axis (A3.base.up/A3.base.median/A3.base.low), fourth-order axis borne on the median part of a thirdorder axis, itself borne on the median part of a second-order axis, itself borne on the upper/median/ lower third of the first-order axis (A4.median. median.up / A4.median.median.median / A4.median. median.low).
It should be noted that A3.p25.up and A3.median.up correspond to very close axes in terms of position in the tree. The characteristics of these different samples of branching sequences are summarized in Table 1. The objective of this study was to compare the different axis categories on the basis of their branching patterns. These four examples include both stationary (Arizona cypress) and nonstationary sequences (apple tree, apricot tree and Cecropia) and the particular case of cyclic or seasonal behaviors is illustrated by the Cecropia example. These sequences correspond to growth units (apricot tree) or annual shoots (apple tree) in the case of the rhythmic growth of temperate species, and to axes (the growth of Arizona cypress axes is affected by a slowdown in winter while the growth of the Cecropia axes can be considered as continuous). Nominal (apple tree), ordinal (Cecropia and apricot tree) and intervalscaled variables (apricot tree, Arizona cypress) are represented and multivariate sequences (Cecropia and apricot tree) illustrate different combinations of
ARTICLE IN PRESS Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
Top
Base
... ... 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 2 0 2 0 0 0 0 0 0 0 0 0 0 branching
2 3 3 3 4 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 1 1 1 1 1 2 2 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 flowering
Fig. 4. Apricot tree: Growth unit of cultivar ‘Lambertin’ where the nature of the axillary production (0: latent bud; 1: 1-year-delayed short shoot; 2: 1-year-delayed long shoot; 3: immediate shoot) and the number of associated flowers were recorded for each successive node.
variables. The basis of sequence alignment is illustrated by the apple tree example (univariate nonstationary sequences built from a nominal variable) while both the alignment of multivariate sequences and the extremityspace free variant of sequence alignment for the study of synchronisms are illustrated by the Cecropia and the apricot tree examples. The Arizona cypress example (univariate stationary sequences built from an intervalscaled variable) illustrates the use of model comparison methods. 4.1. Sequence alignment Sequence alignment relies on dynamic programming algorithms to find the optimal alignment between two sequences and is widely used in computer science to compare symbol strings, in speech processing to
307
compare segments of speech signals and in molecular biology to analyse either nucleic acids or proteins (Sankoff and Kruskal, 1999; Waterman, 1995; Gusfield, 1997). The principle of sequence alignment is to seek the appropriate correspondence between two sequences of different lengths by optimizing over all possible correspondences that satisfy suitable conditions, such as preserving the order of the elements in the sequences. If we compare two sequences, the most obvious type of difference between them is the substitution of one element for another at the same rank in the sequence. Other types of difference include deletion and insertion of elements. These are local differences. In the most classical formulation, a dissimilarity between two sequences is defined as the minimum sum of the costs attached to the three possible elementary operations, namely deletion, insertion and substitution, required to transform one sequence into another (Gusfield, 1997; Sankoff and Kruskal, 1999). Let | denote the null element. The three possible elementary operations can be denoted in the rewriting rule form by deletion of x: x-| dðx; |Þ; insertion of y: |-y dð|; yÞ; substitution of x by y: x-y dðx; yÞ; where x and y are arbitrary elements and d is the cost attached to the corresponding elementary operation. In what follows, we assume that d is defined for all elementary operations and all possible elements including the null element |: If d satisfies the metric properties, then the dissimilarity between sequences D built from d also satisfies the metric properties (Sankoff and Kruskal, 1999). In particular, d satisfies dðx; |Þ þ dð|; yÞXdðx; yÞ for all x and y (triangle inequality). This means, in sequence alignment, that we cannot find an insertion followed by a deletion or vice versa. In the sequel, we assume that d and by extension D satisfy the metric properties and will thus speak of distances between elements or sequences. t1 Let xt1 abbreviates x0 ; y; xt1 Þ and y0Z1 be 0 ðx0 two sequences of lengths t and Z; respectively. Let xt0 and yu0 designate the initial segments (including the case where the initial segment has length 0, namely x1 or y1 ; which is simply the empty sequence). In the first stage, we work forward recursively and compute distance Dðxt0 ; yu0 Þ for successively larger t and u: The recursion of this first stage initialized by Dðx1 ; y1 Þ ¼ 0; Dðxt0 ; y1 Þ ¼ Dðxt1 0 ; y1 Þ þ dðxt ; |Þ
‘deletion’;
Dðx1 ; yu0 Þ ¼ Dðx1 ; yu1 0 Þ þ dð|; yu Þ
‘insertion’;
ð1Þ
ARTICLE IN PRESS Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
308
A1
3. 5. p2 up
A3.median.up
A3.base.up
A A2.up
A4.median.median.up A3.base.median
A 3. 5. p2 ed m
A2.median
n ia
A3.median.median
0 1 0 1 0 1 0 1 0 1 0 0 0
A4.median.median.median
3.
A3.base.up
A 5. p2
A2.low
w lo
A3.median.low A4.median.median.low
(a)
(b)
Fig. 5. (a) The different axis categories measured on individuals of Arizona cypress. (b) Axis described as a sequence of nodes where the number of branches is recorded for each node.
Table 1 Arizona cypress example: characteristics of the samples corresponding to each axis category and Markov chain orders; ðm; sÞ are for the sequence lengths No. sequences
m
s
Cumul. Model length order
Al
73
60.3 13.6 4403
6
A2.up A2. median A2.1ow
38 38 28
64 9.2 2431 76.7 11 2914 75.8 10.5 2122
6 6 6
65 67 59 101 97 106 73 105 73
32.6 7.2 2116 53.1 7.2 3557 51.6 9.9 3046 37.8 8.5 3821 55.7 6.1 5399 58.7 10.3 6219 37.1 7.6 2705 38.1 7.3 3999 38.3 7.8 2798
3 6 6 3 6 6 3 3 3
13 29.4 30.7
3 3 3
A3.p25.up A3.p25.median A3.p25.low A3.median.up A3.median.median A3.median.low A3.base.up A3.base.median A3.base.low
A4.median.median.up 30 A4.median.median.median 96 A4.median.median.low 109
2.4 390 5 2825 9.1 3346
is given by Dðxt0 ; yu0 Þ
8 t1 u ‘deletion’; > < Dðx0 ; y0 Þ þ dðxt ; |Þ t u1 ¼ min Dðx0 ; y0 Þ þ dð|; yu Þ ‘insertion’; ð2Þ > : t1 u1 Dðx0 ; y0 Þ þ dðxt ; yu Þ ‘substitution’:
These distances are stored in an array of size ðt þ 1Þ ðZ þ 1Þ: A ‘‘backpointer’’, the choice of which depends on the minimizing entry above, is stored. This backpointer gives the optimal preceding cell chosen between ðt 1; uÞ; ðt; u 1Þ and ðt 1; u 1Þ for each cell ðt; uÞ: A second stage, often referred to as ‘‘backtracking’’, consists of tracing backward, i.e. from cell ðt 1; Z 1Þ to cell ð1; 1Þ along the backpointers to retrieve the optimal alignment between sequences xt1 0 and yZ1 0 : In our context, the choice of the reference sequence and the test sequence is purely arbitrary (recall that D satisfies the symmetry property): if the roles of the two sequences were exchanged, deletions would become insertions and vice versa. Hence, these two operations are globally referred to as ‘‘indel’’ (insertion–deletion).
ARTICLE IN PRESS Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
In the presentation of the results, we make the distinction for substitutions between matches or identities ðxt ¼ yu Þ and mismatches ðxt ayu Þ because of the weight of variables defined on small sets of possible values (which may be nominal, ordinal or intervalscaled). When dealing with multivariate sequences built from variables of different types, an important issue is the standardization of these elementary variables (Kaufman and Rousseeuw, 1990). The objective of standardization is to avoid dependence on the variable type and, for interval-scaled variables, on the choice of measurement units by converting the original variables to unitless variables. A unified approach to standardization is described in Appendix A which applies to nominal, ordinal and interval-scaled variables. The merit of this approach is that a measure of dispersion is derived for each type of variable in an homogeneous manner; see Hastie et al. (2001) where this type of approach is outlined. Closed-form expressions leading to efficient implementations are derived in the different cases (nominal, ordinal and interval-scaled variables combined with the L1 and L2 metrics) and connections to usual dispersion measures (Gini index for nominal variables, (rank) variance) are established. In the examples of alignment of multivariate sequences, the L1 metric—which is less sensitive to outliers—is systematically used. The set of possible operations can be either reduced or extended. In the former case, only matches and indels are allowed (mismatches are not allowed). This purely structural approach to sequence alignment is particularly pertinent for discrete sequences defined on a small set of possible values since it permits to extract as a byproduct the longest common subsequence of the two sequences. In the latter case, the transpositions (i.e. the interchange of adjacent elements in the sequence with the additional constraint that each element can participate in no more than one transposition) are added to deletions, insertions and substitutions. In the rewriting rule form, a transposition can be written as xy-yx ðwith xayÞ: If t > 0; u > 0 and xt1 xt ¼ yu yu1 with xt1 ayu1 and consequently xt ayu ; the following entry is then added to the minimization in Eq. (2) u2 t u Dðxt2 0 ; y0 Þ þ dðxt1 ; yu1 Þ:
An algorithm for sequence alignment with a more general definition of transpositions is described in details in Oommen and Loke (1997). The definition of indel and transposition costs is discussed in Appendix B. For the presentation of the sequence alignments in Figs. 6, 7, 10–12, we adopted the following convention: the upper row consists of the ‘‘test’’ sequence possibly interspersed with null elements. The middle row consists of the sequence of elementary operations (‘d’ for deletion, ‘i’ for insertion, ‘s’ for substitution of one
309
element by another (mismatch), two successive ‘t’ for a transposition, matches are represented by ‘j’). The lower row consists of the ‘‘reference’’ sequence possibly interspersed with null elements. The null elements are represented by dashes (dashes in the test sequence correspond to deletions while dashes in the reference sequence correspond to insertions). The alignment of two close apple tree, cultivar ‘Reinette B.’, sequences (Gue! don and Costes, 1999) with different parameterizations is shown in Fig. 6. In the case where only matches and indels are allowed (Fig. 6a), the indels are primarily located in the alignment of the zones characterized by a mixing of latent buds (0), 1-year-delayed short shoots (1) and 1year-delayed long shoot (2) (zone 5 in Figs. 2 and 6a), and secondarily in the alignment of the zones characterized by a mixing of latent buds (0) and 1-year-delayed flowering shoots (3) (zone 2 in Figs. 2 and 6a), and in the alignment of the basal unbranched zone. If mismatches are allowed (Fig. 6b), they occur mainly in the alignment of the zones characterized by a mixing of latent buds (0), 1-year-delayed short shoots (1) and 1-year-delayed long shoot (2) (zone 5 in Figs. 2 and 6b). In this example, we favored as much as possible indel operations (a ¼ 0:51 where the coefficient a relates substitution costs to indel costs; see Appendix B). To illustrate the interest of transpositions, we chose to show an example where the transposition does not generate supplementary cost with respect to two successive matches (hence b ¼ 0 where the coefficient b relates transposition costs to substitution and indel costs; see Appendix B). The number of transpositions is therefore maximized and this fits to the assumption of sequences made of a succession of homogeneous zones, each characterized by a given mixture of types of axillary productions. In Fig. 6c, the transpositions are primarily located in the alignment of the zones characterized by a mixing of latent buds (0) and 1year-delayed flowering shoots (3) (zone 2 in Figs. 2 and 6c), and in the alignment of the zones characterized by a mixing of latent buds (0), 1-year-delayed short shoots (1) and 1-year-delayed long shoot (2) (zone 5 in Figs. 2 and 6c). In this last zone, the use of the transposition operation allow to reduce the number of mismatches with respect to the alignment without transpositions. The introduction of mismatches and transpositions refines our interpretation of the alignment of the two sequences, with in particular the assessment of mixing hypothesis within zones on the basis of transpositions. Because of the fixed cost of substitution and consequently the fixed cost of indels, the global cost is directly expressed in the number of indels and mismatches. In the second example in Fig. 7, the two sequences are far more contrasted with no flowering zone in the first (this generates many deletions) and no immediate shoot zone in the second (this generates many insertions). This
ARTICLE IN PRESS Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
310
(a) 13 deletions, 13 insertions, 55 matches -2230300-3030-30030044444444444400-00000002000021-11-110012201012--02-----0000000 d|||||iid||||d||||i|ii||||||||||i|d|||||||i||||i|d||d||i|i||iii||dd||ddddd||||||| 222303--3303033003-0--4444444444-040000000-0000-1211011-0-22---121102000000000000
zone 2
zone 5
(b) 6 deletions, 6 insertions, 55 matches, 7 mismatches -22303003030-300300444444444444000000000200002111110-012---201012020000000 d|||||is||||d||||i|||||||||||s||i||||||||i|||s|||s|||ds||ddd |i|is|s||||||| 222303-3303033003- 0444444444404-00000000-00012110110221211020-0-0000000000
zone 2
zone 5
(c) 5 deletions, 5 insertions, 45 matches, 4 mismatches, 14 transpositions -223030030303003004444444444440000000002000021111100122-01012---020000000 d|||||stttts||||i|i||||||||||tti|||||||i||||tti||tttt||ds|tt|ddd|s||||||| 2223033303033003-0-444444444404-0000000 -000012-11011022121102000000000000
zone 2
zone 5
0: latent bud, 1: one-year-delayed short shoot, 2: one-year-delayed long shoot, 3: one-year-delayed flowering shoot, 4: immediate shoot. Fig. 6. Apple tree, cultivar ‘Reinette B.’: Sequence 15 (length 68) aligned on sequence 14 (length 68); The most different zones are underlined.
Fig. 7. Apple tree, cultivar ‘Reinette B.’: Sequence 3 (length 74) aligned on sequence 9 (length 68) (17 deletions, 23 insertions, 45 matches, 4 mismatches, 2 transpositions); The most different zones are underlined.
ARTICLE IN PRESS Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
311
1
Probability
0.8
0.6
indel match mismatch transposition
0.4
0.2
0 0
5
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 Edit operation rank
Fig. 8. Apple tree, cultivar ‘Reinette B.’: probabilities of the elementary operations computed from the pairwise alignment of sequences.
example can be used to illustrate the fact that the cost of aligning two ‘reversed’ sequences (i.e. sequences described from the base to the top instead of from the top to the base) is equal to the cost of aligning the two original sequences even if the succession of elementary operations is not the reverse of the original succession of elementary operations (the reason is that there is not a single but many optimal alignments which are close variants of each other). This example highlights an important difference with the building of statistical models from nonstationary sequences where the direction of description of sequences is a critical choice in the stage of model specification; see Gue! don et al. (2001). It is possible to give an overview of the pairwise alignment of a set of sequences by extracting the elementary operation sequences corresponding to each alignment. This point is illustrated in Fig. 8 where the probabilities of the different elementary operations are extracted for each successive rank in the alignments (intensity point of view; see Gue! don et al. (2001)). It should be noted that the transposition operation which affects two successive ranks is counted once for each rank. In Fig. 8, four successive zones can be roughly distinguished: *
The first zone up to rank 25 with about 50% of match corresponds to the three first branching zones (zones 0 to 2 in Fig. 2). The very beginning, with a high percentage of indel, corresponds to the short unbranched apical zone which is only present on
*
*
*
some sequences (but not present in the three example in Fig. 2). The second zone (roughly from rank 25 to rank 45) with about 80% of match corresponds to the immediate branching zone and the subsequent latent bud zone (zones 3 and 4 in Fig. 2). These two successive zones are characterized by little mixing of types of axillary productions. The third zone (roughly from rank 45 to rank 70) with about 50% of match and the highest percentage of transposition corresponds to the branching zone characterized by a mixture of latent buds, 1-yeardelayed short shoots and 1-year-delayed long shoots (zone 5 in Fig. 2). The final zone where the match percentage tends towards 100% corresponds to the basal unbranched zone (zone 6 in Fig. 2).
Hence, the more or less mixed composition of a zone is directly expressed in the elementary operation distributions as a function of the rank of the operation. Extremity-space free alignment A commonly used variant of sequence alignment called extremity-space free alignment (Gusfield, 1997) may be very useful for comparing sequences of different lengths especially when one can assume that two sequences coincide at one extremity while they do not coincide at the other. Enabling free spaces at one extremity or the other encourages one sequence to align in the interior of the other. We do not consider here the
ARTICLE IN PRESS Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
312
case of free spaces at both extremities which is useful when the prefix of one sequence corresponds to the suffix of the other. Free beginning spaces: To implement free spaces at the beginning of the sequences, it is simply necessary to modify the base condition (1) while the global recurrence is unchanged Dðxt0 ; y1 Þ ¼ Dðxt1 0 ; y1 Þ
‘deletion’;
Dðx1 ; yu0 Þ
‘insertion’:
¼
Dðx1 ; yu1 0 Þ
Hence, Dðxt0 ; y1 Þ ¼ 0;
Dðx1 ; yu0 Þ ¼ 0:
Free end spaces: To implement free spaces at the end of the sequences, recursion (2) should be modified for the deletion when u ¼ Z 1; Dðxt0 ; yZ1 0 Þ 8 Z1 > Dðxt1 > 0 ; y0 Þ < ¼ min Dðxt0 ; yZ2 0 Þ þ dð|; yZ1 Þ > > : t1 Z2 Dðx0 ; y0 Þ þ dðxt ; yZ1 Þ
‘deletion’; ‘insertion’; ‘substitution’;
and for the insertion when t ¼ t 1; u Dðxt1 0 ; y0 Þ 8 t2 u ‘deletion’; > < Dðx0 ; y0 Þ þ dðxt1 ; |Þ t1 u1 ¼ min Dðx0 ; y0 Þ ‘insertion’; > : t2 u1 Dðx0 ; y0 Þ þ dðxt1 ; yu Þ ‘substitution’:
It is also possible to keep recursion (2). However, in this case, the distance corresponding to the optimal alignment is not necessarily found in cell ðt 1; Z 1Þ: Rather, the distance corresponding to the optimal alignment with free end spaces is the maximum distance over all cells in row t 1 or column Z 1 (Gusfield, 1997). Both standardization and beginning-space free alignment can be illustrated using the Cecropia example. In this example, the focus is on the synchronism of branching and axillary flowering between a parent shoot and its offspring shoots. It should be noted that flowering occurs each year just before the dry period. A first hypothesis consists of assuming that the two aligned sequences, one corresponding to the trunk (more precisely the part of the trunk located above the insertion of the selected branch; see Fig. 9) and the other to a branch, are globally synchronized, and this can be analysed using the most common alignment algorithm, i.e. without beginning-space free. Previous studies (Heuret et al., 2002) suggested that the rhythm of emission of the first internodes of branches was higher than the rhythm of emission of the internodes of the trunk at the same time (and these first internodes of branches are particularly long). Thereafter, the rhythm of emission on the branches decreases rapidly and
Fig. 9. Habit of Cecropia: The alignment of a branch on the part of the trunk located above the insertion point of the branch is presented in Fig. 10 (a and b in the diagram) and the alignment of two successive branches is presented in Fig. 11 (b and c in the diagram).
stabilizes at the base regime corresponding to the rhythm of emission on the trunk. This second hypothesis can be investigated using the beginning-space free variant of the sequence alignment. The alternative hypotheses, with or without beginning-space free, only affect the beginning of the alignment before the second flowering zone; see Fig. 10. In the case of the alignment without beginning-space free, the first flowering zone is shifted by at least five internodes between the trunk and the branch (while the subsequent flowering zones match very well). This shift corresponds to approximately 2 months of growth on the basis of an average rhythm of emission of a new metamer every 10 days. This entails many mismatches at the beginning of the alignment. In the case of the beginning-space free variant of the alignment, the three flowering zones match very well. The free beginning of the branch is composed of 12 (unbranched and nonflowered) internodes. This should be interpreted as an upper bound to the run length of long internodes emitted over a short time period at the beginning of growth since the alignment of the first
ARTICLE IN PRESS Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
313
Fig. 10. Cecropia: Alignment of a branch (length 78) on the part of the trunk (length 82) located above the insertion point of the branch (without beginning-space free: 8 deletions, 4 insertions, 49 matches, 25 mismatches; with beginning-space free: 16 deletions, 0 insertion, 55 matches, 11 mismatches, 12 beginning spaces).
Fig. 11. Cecropia: Alignment of two successive branches of lengths 78 and 77 (0 deletion, 1 insertion, 65 matches, 12 mismatches).
branching zone of the trunk generates many deletions. An initial run length of say 6 seems to be a more realistic estimate but the alignment method does not allow to estimate this length precisely. The desynchronism of solely the first flowering zone highlighted by the alignment without the beginningspace free does not seem realistic to us from a biological point of view. This is notably confirmed by the alignment of the branch previously aligned on the part of the trunk located above its insertion with the branch immediately after (without beginning-space free) which shows perfect synchronism between the three flowering zones (Fig. 11). Both standardization and the end-space free variant can be illustrated using the apricot tree example. It is
interesting to apply the end-space free variant to detect the commonality of branching and flowering patterns between growth units of different lengths. The two examples shown in Fig. 12 illustrate this point. It should be noted that the two growth units aligned in Fig. 12a belong to the same tree. If the branching variable is considered as nominal instead of ordinal, the alignment differs substantially: (2 deletions, 4 insertions, 44 matches, 13 mismatches) instead of (0 deletion, 2 insertions, 40 matches, 19 mismatches) and 11 end spaces in both cases. The main difference concerns the lower number of mismatches, compensated for by a higher number of indels. The reason is that the replacement of a 1-year-delayed short shoot (1) by either a latent bud (0) or a 1-year-delayed long shoot
ARTICLE IN PRESS 314
Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
Fig. 12. Apricot tree: alignment of sequences with the end-space free variant.
(2) is far less expensive in the ordinal case than in the nominal case. Among the 19 mismatches in the alignment above, 12 require a replacement for the branching variable every one involving a 1-year-delayed short shoot. The two examples shown in Fig. 12 illustrate the type of difference encountered between sequences of different lengths within these apricot tree branching and axillary flowering sequences. These sequences differ mainly by the underlying growth duration of the parent shoot. Hence, the branching and axillary flowering structure on the common part may be very similar even if the lengths of the two sequences are very different. Growth cessation cannot be predicted on the basis of these morphological characters and is more likely to be influenced by competition between primary and secondary growth on the one hand, and fruiting on the other; see Costes et al. (2000). 4.2. Adaptation of clustering methods to unequal length sequence grouping The aim of cluster analysis is to group data into homogeneous groups or clusters (Kaufman and Rousseeuw, 1990; Gordon, 1999). Cluster analysis can be viewed as an exploratory tool of a dissimilarity
matrix which enables to reveal a structure embedded in the data. The two main categories of clustering methods are partitioning methods and hierarchical methods. In the former case, the resulting set of clusters, whose number is fixed by the user, satisfies the requirements of a partition: * *
each cluster must contain at least one object, each object must belong to exactly one cluster.
In the latter case, a hierarchically-nested set of partitions is constructed; the respective advantages of these two categories of clustering methods are thoroughly discussed in Kaufman and Rousseeuw (1990). The distance between two sequences resulting from their alignment has the form * yÞ ¼ Dðx; yÞ Dðx; Lðx; yÞ which can be interpreted as a ‘rough’ distance Dðx; yÞ between the two objects x and y normalized by a length Lðx; yÞ: For the distance between the two sequences xt1 0 and yZ1 of lengths t and Z; respectively, we have 0 Z1 Dðx; yÞ ¼ Dðxt1 0 ; y0 Þ and Lðx; yÞ ¼ t þ Z: This particular form of distance can either be ignored or taken into account in the calculations involved in the clustering algorithms. In the first case, averaging is
ARTICLE IN PRESS Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
directly performed on the normalized distances * yÞ þ Dðx * 0 ; y0 Þ Dðx; 2 while in the second case, the elementary distances are combined as follows: Dðx; yÞ þ Dðx0 ; y0 Þ Lðx; yÞ * yÞ ¼ Dðx; 0 0 Lðx; yÞ þ Lðx ; y Þ Lðx; yÞ þ Lðx0 ; y0 Þ Lðx0 ; y0 Þ * 0 ; y0 Þ: þ Dðx Lðx; yÞ þ Lðx0 ; y0 Þ In this latter case, if we assume a fixed cost for an indel operation of a given element in a sequence, an indel will keep the same weight whatever its occurrence context (for instance in the alignment of two ‘short’ sequences or in the alignment of two ‘long’ sequences). Distances combined in this way correspond actually to the concatenation of alignments. Hence, we adopted this latter solution. With the proposed convention, within/ between-cluster distances are interpretable as elementary distances resulting from the alignment of two sequences in terms of elementary operations. Each cluster is represented by all its objects (and not by the most centrally-located object or more generally by a subset of the most centrally-located objects) and an object is assigned to a cluster on the basis of its average dissimilarity to the objects of a given cluster C; P Dðx; yÞ * CÞ ¼ PyAC;yax Dðx; : ð3Þ yAC;yax Lðx; yÞ By construction, this kind of partitioning method attempts to find ‘spherical’ clusters, i.e. clusters that are roughly bell-shaped. The interpretation of the clustering results relies mainly on object–cluster dissimilarities (3) or betweencluster dissimilarities P P xAA yAB;yax Dðx; yÞ * DðA; BÞ ¼ P P ; ð4Þ xAA yAB;yax Lðx; yÞ where A and B are two clusters. For instance, the average dissimilarities from objects x of cluster C to other objects of C can be ordered to give an idea of the ‘density’ of the cluster, or the withinand between-cluster dissimilarities can be computed for each cluster P x;yAC;xay Dðx; yÞ D* within ðCÞ ¼ P ; x;yAC;xay Lðx; yÞ P P Dðx; yÞ xAC * PyeC Dbetween ðCÞ ¼ P : xAC yeC Lðx; yÞ Common measures of adequacy of a partition are the diameter of a cluster, i.e. the dissimilarity between the * yÞ; and the separamost dissimilar object maxx;yAC Dðx; tion (Kaufman and Rousseeuw, 1990) or split (Gordon, 1999) of a cluster, i.e. the smallest dissimilarity between
315
an object in the cluster and an object outside the cluster * yÞ: minxAC;yeC Dðx; In the case of both partitioning methods and hierarchical methods, the adaptation of both the algorithms and diagnostic tools to the specific combinations of dissimilarities normalized by lengths is direct. In the case of hierarchical clustering, we selected the unweighted pair-group average method (Kaufman and Rousseeuw, 1990). This agglomerative method is initialized with all objects apart (all the initial clusters are thus composed of a single object) and at each step, the two closest clusters are merged on the basis of the smallest between-cluster dissimilarity (4). The strength of the clustering structure obtained by an agglomerative method can be quantified by the agglomerative coefficient defined as (Kaufman and Rousseeuw, 1990) AC ¼
n 1X lmax lðiÞ ; n i¼1 lmax
where n is the size of the data set, lðiÞ is the dissimilarity corresponding to the first merging of object i with another object/cluster and lmax is the dissimilarity corresponding to the final merging. The AC is a dimensionless quantity between 0 and 1. The larger the value of AC; the stronger the clustering structure. The application of clustering methods to a distance matrix resulting from the pairwise alignment of sequences can be illustrated using the apple tree branching sequences corresponding to the cultivars ‘Belre" ne’ and ‘Reinette B’. When clustering sequences on the basis of the distance matrix resulting from their pairwise alignment, it is important to contrast the distances as much as possible. To do this, it is important to allow every possible elementary operations, including transpositions, that in this case are fully justified on a biological basis. We also chose to introduce an explicit distance matrix between possible outcomes (substitution matrix) to express the biological assumption that immediate branching and 1-year-delayed branching should be clearly distinguished. Hence, the cost of replacing a 1-year-delayed short shoot (1) by a 1-yeardelayed long shoot (2) should be for instance lower than the cost of replacing a 1-year-delayed short shoot (1) by an immediate shoot (4). We thus introduce the following distances between possible outcomes: dðx; yÞ ¼ l;
x; y ¼ 0; 1; 2; 3; xay;
dðx; 4Þ ¼ 2l;
xa4;
where l is a constant. This type of scheme for weighting substitution is also common in molecular biology with, in particular, the use of the amino acid substitution matrices such as PAM or BLOSUM (Gusfield, 1997). The agglomerative coefficient is then 0.49 instead of 0.44 for the hierarchical clustering based on the distance
ARTICLE IN PRESS Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
316
Table 3 Apple tree example: interpretation of the within-/between-cluster distances in terms of elementary operations
Belr"ene-Belr"ene Reinette B.-Reinette B. Belr"ene-Reinette B.
No. of operations
Indel (%)
Match (%)
Mismatch (%)
Trans. (%)
20550 18646 21849
17 12.1 21.5
60.5 59.4 49.9
17.2 20.1 24.8
5.3 8.4 3.8
Table 4 Apple tree example: partition into three clusters on the basis of the distance matrix resulting from the pairwise alignment of ‘Belr"ene’ and ‘Reinette B.’ sequences (global within-cluster distance: 0.186, global between-cluster distance: 0.278, ratio: 0.667)
1 2 3
Fig. 13. Apple tree, cultivars ‘Belr"ene’ and ‘Reinette B.’: dendrogram obtained when clustering sequences.
Table 2 Apple tree example: partition into two clusters on the basis of the distance matrix resulting from the pairwise alignment of ‘Belr"ene’ and ‘Reinette B.’ sequences (global within-cluster distance: 0.201, global between-cluster distance: 0.279, ratio: 0.722)
Belr"ene Reinette B.
Belr"ene
Reinette B.
Diameter
Separation
0.214 0.279
0.279 0.187
0.367 0.319
0.17 0.17
matrix computed from the purely structural alignment with only indels and matches. The dendrogram (Fig. 13) shows clearly two groups corresponding to the two cultivars ‘Belre" ne’ and ‘Reinette B.’ and a third group with two outlying sequences in the sense that sequence 7 for ‘Belre" ne’ and sequence 9 for ‘Reinette B.’ have the largest average distance to other sequences of the cultivar. For instance, sequence 9 is the only ‘Reinette B.’ sequence without an immediate branching zone. One should remark that sequences 14 and 15 of ‘Reinette B.’, used as examples in Section 4.1, are very close.
1
2
3
Between-cluster distance
Diameter
Separation
0.193 0.277 0.286
0.277 0.178 0.276
0.286 0.276 0.215
0.278 0.277 0.281
0.295 0.25 0.262
0.145 0.17 0.145
Using a partitioning algorithm with the number of clusters fixed at 2, the two clusters obtained correspond to the two cultivars (Table 2). The branching sequences can be ordered by ascending average distance to other sequences of the cultivar, i.e. from the least (most central) to the most outlying (the average distance is given for the most central and the two most outlying sequences): Belre" ne (15 sequences): 15 (0.183), 13, 10, 1, 9, 4, 8, 12, 6, 5, 14, 2, 3, 11 (0.25), 7 (0.3), Reinette B. (16 sequences): 8 (0.158), 14, 7, 15, 11, 16, 10, 13, 1, 6, 4, 12, 3, 5, 2 (0.205), 9 (0.258). This confirms that sequence 7 for ‘Belre! ne’ and sequence 9 for ‘Reinette B.’ are outlying sequences. The within-cluster and between-cluster distances are still interpretable in terms of elementary operations. For instance, the between-cluster distance is characterized by a higher rate of indel and mismatch and a lower rate of match and transposition compared to the within-cluster distances (Table 3). If the number of clusters is fixed at 3 (see Table 4), we obtain a structure very close to that of the dendrogram in Fig. 13, the main difference being the grouping of sequence 3 of Belre" ne with the two outlying sequences: cluster 1 (13 sequences-Belre! ne): 13, 10, 1, 12, 15, 8, 4, 9, 6, 5, 14, 2, 11, cluster 2 (15 sequences-Reinette B.): 8, 14, 7, 15, 11, 16, 13, 10, 4, 12, 1, 6, 3, 5, 2, cluster 3 (3 sequences): 7, 3 (Belre! ne), 9 (Reinette B.).
ARTICLE IN PRESS Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
4.3. Model comparison Analysing branching and axillary flowering patterns often requires the comparison of different axis categories or different plant groups corresponding, for instance, to different cultivars in an agronomic context on the basis of sub-samples of sequences. A first approach described in the previous sections consists of computing the distance matrix resulting from the pairwise alignment of all the sequences, then of evaluating the strength of the a priori known clustering structure corresponding to the axis categories or to the cultivars. A completely different approach relies on a first step of statistical model building for each subsample of sequences. In our case, these models for analysing branching patterns are (hidden) Markovian models as discussed in Gue! don et al. (2001); see Guttorp (1995) for a general reference about statistical methods for Markovian models. The question is then how to compute a dissimilarity between pairs of (hidden) Markovian models. In what follows, we use the term Markovian model in a broad sense and assume that Markovian models include hidden Markovian models. Following Juang and Rabiner (1985), we chose to use the Kullback–Leibler (directed) divergence (Kullback, 1968; Cover and Thomas, 1991) as a dissimilarity measure between pairs of Markovian models. This * 0 ; yÞ has the form dissimilarity measure denoted by Dðy * 0 ; yÞ ¼ 1 flog PðX0t1 ¼ xt1 Dðy 0 ; y0 Þ t log PðX0t1 ¼ xt1 0 ; yÞg Dðy0 ; yÞ ; ð5Þ ¼ t where y0 and y denote the parameters of Markovian models. The observed sequence xt1 is assumed to be 0 generated by the model of parameters y0 and log PðX0t1 ¼ xt1 0 ; y0 Þ is the log-likelihood of the sequence xt1 for the model of parameters y0 : Hence 0 Dðy0 ; yÞ expresses the ‘divergence’ from the reference model y0 to the target model y on the basis of loglikelihoods of the sequence xt1 0 : The likelihood PðX0t1 ¼ xt1 ; y Þ can be interpreted as a within-model 0 0 similarity while the likelihood PðX0t1 ¼ xt1 0 ; yÞ can be interpreted as a between-model similarity (since the more the models y0 and y are alike, the larger the likelihood PðX0t1 ¼ xt1 0 ; yÞ). Hence, Dðy0 ; yÞ can be interpreted as the logarithm of the within/betweenmodel similarity measure ratio. In the case of ergodic models (i.e. irreducible and aperiodic in the case of finite-state Markovian models), for large values of * 0 ; yÞ is the large-sample average Kullback–Leibler t; Dðy (directed) divergence per observation from y0 to y (Leroux, 1992). This dissimilarity is not a metric since both the symmetry and the triangle inequality properties are not verified.
317
For further processing and in particular for cluster analysis, it may be desirable to work with a symmetrized version of dissimilarity (5) Ds ðy1 ; y2 Þ ¼ Ds ðy2 ; y1 Þ Dðy1 ; y2 Þ þ Dðy2 ; y1 Þ ¼ t1 þ t2 t1 * 1 ; y2 Þ þ t2 Dðy * 2 ; y1 Þ; Dðy ¼ t1 þ t2 t1 þ t2
ð6Þ
where t1 and t2 are the (cumulated) lengths of the sequences generated by models y1 and y2 ; respectively. If t1 ¼ t2 ; we indeed obtain Ds ðy1 ; y2 Þ ¼
* 1 ; y2 Þ þ Dðy * 2 ; y1 Þ Dðy : 2
If the pair of models being compared are both ergodic, the most immediate solution to evaluate dissimilarity (5) consists of generating a single sufficiently long sequence with model y0 : If one wants to take into account the effect of the initial state distribution (which is in general different from the stationary state distribution), it may be preferable to generate a sample of sequences of fixed length t: The issue is then how to choose the balance between the transient phase at the beginning of the sequences—corresponding to the change from the initial state distribution to the approximately stationary state distribution—to the following stationary phase. The joint log-likelihoods of the generated sequences for the different models including y0 are then computed (the normalization term is then the cumulated length of the generated sequences) and the resulting dissimilarities are the differences between the normalized joint log-likelihoods for y0 and y: This procedure is repeated for each model. The symmetrized Kullback–Leibler divergence shares the same general form with the distance between two sequences resulting from their alignment * yÞ ¼ Dðx; yÞ: Dðx; Lðx; yÞ For the symmetrized Kullback–Leibler divergence (6) where t1 and t2 are the (cumulated) lengths of the sequences generated by models y1 and y2 ; respectively, we have Dðx; yÞ ¼ Dðy1 ; y2 Þ þ Dðy2 ; y1 Þ and Lðx; yÞ ¼ t1 þ t2 : Hence, the remarks made for clustering methods in Section 4.2 also apply to this type of dissimilarity matrix. For each axis category of Arizona cypress, a highorder Markov chain was estimated from the branching sequences. The order was estimated jointly with the model parameters using model selection methods based on the Bayesian Information Criterion (the Markov chain orders are given in Table 1); for more details, see Gue! don et al. (2001). The branching sequences are roughly stationary except for the second-order axes
ARTICLE IN PRESS 318
Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325 Table 5 Arizona cypress example: distribution of the number of offspring shoots for each sample corresponding to each axis category 0
1
2
Cumul.length
Al
0.605
0.341
0.054
4403
A2.up A2.median A2.1ow
0.608 0.602 0.601
0.307 0.308 0.294
0.085 0.09 0.105
2431 2914 2122
A3.p25.up A3.p25.median A3.p25.low A3.median.up A3.median.median A3.median.low A3.base.up A3.base.median A3.base.low
0.678 0.642 0.629 0.677 0.644 0.634 0.721 0.698 0.638
0.213 0.235 0.248 0.221 0.231 0.237 0.217 0.201 0.28
0.109 0.123 0.123 0.102 0.125 0.129 0.062 0.101 0.082
2116 3557 3046 3821 5399 6219 2705 3999 2798
A4.median.median.up A4.median.median.median A4.median.median.low
0.895 0.866 0.843
0.1 0.126 0.145
0.005 0.008 0.012
390 2825 3346
Fig. 14. Arizona cypress: dendrogram obtained when clustering the high-order Markov chains corresponding to each axis category.
where a basal zone made of a succession of 200200... can be identified on some axes. The Kullback–Leibler divergence between models was estimated on the basis of 10 000 generated sequences of length 100 for each model; this length corresponds to the order of magnitude of the length of the observed sequences from which the models were estimated. The results obtained on the basis of 1000 generated sequences of length 1000 were very similar showing that the nonstationarity of some of the measured sequences used for the estimation of the branching pattern models had only a small effect, the main differences concerning the second-order axes. The agglomerative coefficient AC ¼ 0:78 indicates a strong cluster structure, particularly the clear separation between first-/second-order axes, third- and fourthorder axis (Fig. 14). It should be noted that the axis categories were a priori defined on the basis of criteria (mainly branching order, direction of growth, sexuality, self-pruning) which did not include branching patterns. The branching patterns show marked differences between these categories and hence contribute to an a posteriori validation of the definition of these categories. First- and second-order axis are characterized by the repeated occurrence of the patterns 101001 and 2002. On third-order axis, the pattern 101001 splits into two independent patterns, 101 and 1001 (which do not alternate systematically). Finally, fourth-order axis have only a few axillary shoots (globally, the branching rate as given by the ratio of the number of developed shoots to the number of axillary meristems decreases while the
branching order increases; see Table 5). These branching patterns have a geometrical counterpart which helps to understand the strategy of space occupancy of the plant: Offspring shoots borne by first- and second-order axis are distributed over four main directions and mimic a spiral arrangement while offspring shoots borne by third-order axis tend towards bilateral symmetry (Barthe! le! my et al., 1999). The examination of the marginal distributions (see Table 5) show smaller differences between some third-order axis categories (for instance A3.base.low) and second-order axis with respect to pairs of third-order axis categories (A3.x.low compared to A3.x.up). This confirms that global indicators (corresponding to zero-order Markov chains estimated for each axis category) not reflecting the local arrangement of the offspring shoots, do not enable the axis categories to be accurately clustered.
5. Discussion On the basis of the four examples discussed in the preceding section, three main types of application can be distinguished: (i) Alignment of two sequences: Both the Cecropia and apricot tree examples illustrate the usefulness of extremity-space free alignments in the study of synchronicity between axes growing simultaneously, and this on the basis of a posteriori measurements. The alignment of two sequences is actually a very valuable exploratory tool with the
ARTICLE IN PRESS Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
specificity of applying to individual sequences while most of the exploratory tools described in Gue! don et al. (2001) apply to samples of sequences. (ii) Statistical analysis of a set of alignments: This type of application is illustrated by the analysis of the mixing properties within the branching zones of the apple tree where some statistical information is extracted from the elementary operation sequences corresponding to all the pairwise alignments. Another approach would consist of computing a multiple alignment on the basis of the set of sequences corresponding to a given cultivar. Multiple sequence alignment is a well-known generalization of pairwise sequence alignment. The most commonly used approach to multiple alignment is iterative (Gusfield, 1997) or progressive alignment (Durbin et al., 1998). This works by constructing a succession of pairwise alignments (where two objects aligned may be simple sequences or alignments already computed). The iterative alignment methods are heuristic but the quality of the results depends on a ‘guide tree’ which determines the order of the alignments. A guide tree is a binary tree whose leaves represent sequences and whose interior vertices represent alignments. A very elementary guide tree may be obtained by ordering the sequences by ascending average distance to the other sequences, i.e. from the most central to the most outlying. Another more refined approach consists of using the results of a hierarchical clustering method (such as those shown in Fig. 13) as a guide tree. Multiple alignment is illustrated by the example in Fig. 15a where the alignment of 5 relatively close (according to the dendrogram in Fig. 13) apple tree, cultivar ‘Reinette B’, sequences is shown (including sequences 14 and 15 previously used as example in
319
Section 4.1). The statistical analysis of multiple alignment of sequences consists mainly of extracting the probabilities of the different possible values (in the case of the apple tree example, axillary productions chosen among latent bud, 1-yeardelayed short shoot, 1-year-delayed long shoot, 1year-delayed flowering shoot and immediate shoot augmented with the null element corresponding to indels) for each successive rank. This information is summarized in the so-called consensus sequence which enables to display the commonalities within a set of sequences in a columnwise manner (Gusfield, 1997; Durbin et al., 1998); see Fig. 15b. Multiple alignment may also be particularly useful as a complement to statistical models such as hidden semi-Markov chains for analysing successions of homogeneous zones in sequences. For instance, multiple alignment may be applied to small samples of sequences which cannot be used as a basis for statistical model building, or to subsamples of sequences identified by the statistical modeling approach. (iii) Revealing a grouping among branching sequences or evaluating the strength of a predefined grouping: This type of application is illustrated by the apple tree example for nonstationary sequences in the context of cultivar comparison and the Arizona cypress example for stationary sequences in the context of architectural analysis. Architectural diagnosis relies on the identification of a small set of categories of botanical entities and the grouping of homologous entities on the basis of various criteria, including branching. In this context, the branching is usually qualified as continuous (offspring shoots on each node), rhythmic (clumps of offspring shoots evenly spaced) or diffuse (all the other branching structures). On the basis of this rough criterion, the
Fig. 15. Apple tree, cultivar ‘Reinette B.’: alignment of sequences 10, 11, 12, 14 and 15. Aligned columns are annotated if all values are identical or highly conserved (+ corresponds to four identical values).
ARTICLE IN PRESS 320
Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
branching of all the Arizona cypress axes would be qualified as diffuse and could not be used as a discriminant information. Hence, the proposed approach for comparing branching patterns contributes to an improvement in the architectural analysis of the Arizona cypress. Analysing and comparing branching structure is also of major importance for the study of fruit tree and forest tree species in agronomic contexts. In the former case, flowering location is directly affected by the branching structure and pruning strategies should be devised in accordance with the branching structure. In the latter case, the branching structure has a direct effect on wood quality. More generally, the comparison of branching structures can be used to compare various conditions or treatments related to stand density, thinning or pruning, practices, fertilization, etc. Studying the homogeneity/heterogeneity of a sample of sequences which is useful in itself, is also a prerequisite for statistical modeling such as discussed in Gue! don et al., (2001): The apple tree example illustrates the usefulness of sequence alignment for the identification of outliers. This breaks down into two steps: *
*
On the basis of the distance matrix resulting from the pairwise alignment of the branching sequences of a given cultivar, these sequences are ordered by ascending average distance to other sequences of the sample, i.e. from the most central to the most outlying. The alignments with the outlying sequences can then be examined in details for purposes of biological interpretation.
Clustering sequences is not restricted to the approaches presented in this paper. Another possibility is notably illustrated in Godin et al. (1999) where a hidden semi-Markov chain built from the branching structure of apple tree trunk annual shoots was used as a device for discrimination within a hybrid family. Hence, the objective of the statistical modeling was to identify a discrimination rule to divide the initial sample into two sub-samples corresponding to each parent. It would be also interesting to investigate the type of dissimilarity matrices presented in this paper by multidimensional scaling methods (Cox and Cox, 2001). The basic principle of these methods is to produce a two-dimensional ‘map’ with the aim of spotting clusters and/or outliers. The application scope of sequence alignment on the one hand, and the computation of dissimilarities between Markovian models on the other are clearly separated with little overlap. Sequence alignment is mainly useful for nonstationary sequences, particularly
those exhibiting successions of homogeneous zones. The computation of dissimilarities between Markovian models is only fully justified for models built from stationary sequences. This latter approach can be adapted in particular contexts to models built from nonstationary sequences. This point is illustrated in Gue! don and Costes (1999) where hidden semi-Markov chains built from branching sequences of the first annual shoots of the trunk (including the ‘Belre" ne’ and ‘Reinette B.’ sequences used as examples in this paper) were compared for seven apple tree cultivars. For such ‘left–right’ models, i.e. models composed of a succession of transient states and a final absorbing state, samples of sequences were generated for each model to estimate Kullback–Leibler divergences between models. The relative homogeneity of the sequence lengths (see Gue! don et al. 2001) between the cultivars in this case justified the approach selected. As illustrated by the apple tree and the Cecropia examples, sequence alignment methods match similar homogeneous zones while preserving the order of the elements in the sequences. Similar zones on the two sequences being aligned are synchronized using indel operations while both the indel, substitution and transposition operations may be used to handle the local differences in composition between matched zones. This behavior of sequence alignment is not adapted to stationary sequences made of a succession of short patterns as illustrated by the Arizona cypress example and may lead to irrelevant interpretations and conclusions. The stationarity property means that the ordering of patterns along sequences is not a distinguishing feature of a given sequence. Hence, preserving the order of the elements in the sequences, and therefore the order of the patterns may induce unjustified extra costs (corresponding to the local alignment of different patterns such as 101 aligned on 2002 for instance). The elementary operations do not enable to account properly for the differences between patterns which would require to work on windows of length equal to the length of the patterns. A potential solution would consist of introducing generalized substitutions where a block of m consecutive elements may be replaced by a block of n consecutive elements; see Sankoff and Kruskal (1999). This generalization induces an increase in the complexity of the sequence alignment algorithm but the main problem, in this case, is the definition of appropriate substitution costs for the different possible patterns. About sequence alignment Sequence alignment requires the user to specify the allowed elementary operations, the types of variables or the substitution costs in the case of nominal variables. The biological significance of the resulting alignment may be greatly affected by the choice of parameter
ARTICLE IN PRESS Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
321
Fig. 16. Sequence measured on the trunk of a 75-year-old Corsican pine (66 last years from 1927 to 1992).
settings. This problem is also well known in computational molecular biology; see for instance Gusfield (1997). Comparatively, the computation of dissimilarities between Markovian models is far less dependent on user choices or preferences. The indel costs may be defined with reference to maximum substitution costs either for each element or globally. The former approach, which is adaptative, has the advantage of minimizing the indel costs and consequently of maximizing the occurrence of indels (which are essential for a structural interpretation of the alignment). The main drawback of this approach is the arbitrariness which consists in penalizing more strongly outlying elements in comparison with more ‘central’ elements. The latter approach with a fixed indel cost is highly sensitive to outlying elements (which may define a very high indel cost). In the examples, we applied the adaptative approach. It should be noted that the approach with a fixed indel cost only affects the alignments of apricot tree growth units while the other alignments are unchanged. For sequences built from variables with a high number of possible values (in practice interval-scaled variables), the alignment of two sequences, with mainly mismatches and indels compensating for the difference in length between the two sequences, is not very informative. This point can be illustrated by more macroscopic sequences where the index parameter is not the node rank. An example of such sequences is provided by 6 trunks of 75 year-old Corsican pine (Pinus nigra Arn. ssp. laricio Poir., Pinaceae) planted in the state forest of Orle! ans (center of France). The morphological features were not observable for the nine first years and these years were therefore not considered. The trunks were described, annual shoot by annual shoot, from the base to the top and two interval-scaled variables were recorded for each annual shoot: the length (in cm) and the number of branches per tier. Hence, information from both the parent entity (length of the annual shoot) and the offspring entities (number of branches per tier) were combined in these measurements. An example of such a sequence is shown in
Fig. 16 which is characterized by a rapid increase for the two variables then a slow decrease with inter-annual fluctuations. The pairwise alignments of the six sequences (all of length 66 corresponding to the growth and branching of Corsican pine trunks from 1927 to 1992) is made almost uniquely of mismatches (98.7%) if a ¼ 0:51 where the coefficient a relates substitution costs to indel costs; see Appendix B. One further problem is the fact that indels may not be justified on a biological basis if one considers that aligning a shoot of year t on a shoot of year t þ r with ra0 is irrelevant. In this example, if a > 0:71; there are no indels and the proportion of elementary operations is 99.8% for mismatch and 0.02% for match. While a detailed examination of the alignment of two sequences is not really informative, the distance matrix resulting from the pairwise alignment of such sequences stays indeed informative. Hierarchical clustering enables to identify two main groups which correspond to two plots with different plantation densities (200 and 290 stems/ha) and an outlying sequence (Fig. 17). A detailed examination of the data showed that this outlying sequence corresponded to a highly branched tree with a total of 473 branches over the 66 years measured instead of 310, 346, 323, 326 and 294 branches for the five other trees. Classical methods for analysing this type of sequences belong to the field of time series analysis and rely on a decomposition principle. The different sources of variations which are mainly, in our context, the trend (i.e. a ‘long-term change in the mean level’) and the local fluctuations (residuals from the trend) are separated and then analysed individually. We checked that very similar dendrograms (see Fig. 17) are obtained from the trends (obtained by passing the data through a symmetric smoothing filter) and from the rough sequences. Most of the methods for comparing sequences in dendrochronology (the variables in this case represent tree ring widths) apply to the rapidly varying component (local fluctuations) which is generally assumed to be strongly affected by climatic conditions. For instance, Schweingruber (1988) quantified the concordance of the first-order differenced sequences within a set of sequences
ARTICLE IN PRESS Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
322
*
1
*
2
ordinal (ordered with the rank metric), interval-scaled (ordered with the implicit metric corresponding to the interval between two values).
3 4 6 5
200 stems/ha
290 stems/ha
Fig. 17. 75-year-old Corsican pine: dendrogram obtained when clustering sequences.
A standardization procedure requires constituting the empirical marginal distribution of the variable of interest for the sequences being aligned (to built a marginal distribution, a ‘reference’ sample is constituted from all the elements of the sequences for the variable of interest). The basic idea is to compute a measure of dispersion or ‘spread’ for each elementary variable. For the L1 metric, this measure of dispersion for the e-th elementary variable denoted by absde is given by absde ¼
while Monserud (1986) examined the association between residual sequences using cross-correlation functions. The type of method, presented in Section 4.1, for aligning sequences can be transposed to the more general framework of aligning rooted tree graphs which may represent either branched systems or entire plants. Ferraro and Godin (2000, 2003a, b) developed core algorithms for this task. Their pioneering work suggests many research directions such as the study of the effect of the local distance between elements on the overall distance between rooted tree graphs with reference to a rough distance resulting from a purely structural alignment, or the development of extremity-space free alignments for rooted tree graphs. The remarks made in Section 4.2 for the clustering of sequences transpose directly to the clustering of rooted tree graphs which are assumed to be of unequal size. The comparison methods presented in this paper are fully implemented in the AMAP-mod software (Godin et al., 1997, 1999, http://www.cirad.fr/presentation/ programmes/amap/logiciels/amap mod).
where dðxi;e ; xj;e Þ is the distance between xi;e and xj;e : For the L2 metric, this measure of dispersion denoted by sde is given by sde ¼
The authors thank Avner Bar-Hen, Yves Caraglio and Pascal Ferraro for their helpful comments and a referee for helpful suggestions. Appendix A. Standardization The aim of standardization is to render the local distances independent of the type of the elementary variables (nominal, ordinal or interval-scaled) and, for interval-scaled variables, also independent of the choice of the measurement units. The types of variables considered are: *
nominal (unordered) with the implicit 0/1 metric or with an explicit metric,
n n X X 1 dðxi;e ; xj;e Þ2 : nðn 1Þ i¼1 j¼1
The L1 metric approach is less sensitive to outliers than the L2 metric approach since the basic quantities are not squared in the computations. These measures of dispersion are intuitively appealing since they are proportional to the sum of dissimilarities between all the pairs of elements in the reference sample. In the case of discrete (categorical) interval-scaled variables, the sample dispersions are estimates of X X jy zjpy pz y
z
L1 case ðmean absolute differenceÞ; X X ð y zÞ2 py pz y
Acknowledgements
n n X X 1 jdðxi;e ; xj;e Þj; nðn 1Þ i¼1 j¼1
z
L2 case ðmean square differenceÞ; where fpy g is the probability mass function of the variable of interest. A.1. Nominal variables In the case of a nominal variable with the 0/1 metric, absde yields n n X X 1 Iðxi;e axj;e Þ nðn 1Þ i¼1 j¼1 X 1 ¼ fy ðn fy Þ nðn 1Þ y ( ) X fy 2 n 1 ¼ ; ðn 1Þ n y
absde ¼
ARTICLE IN PRESS Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
where I( ) is the indicator function and fy is the frequency of outcome y in the reference sample of size n: Indeed, we have sde ¼ absde : In this case, absde gives the percentage of mismatches (or disagreements) between elements. This measure of dispersion is known as the Gini index or impurity.
n X 2 iði 1Þ nðn 1Þ i¼2 2 ! nþ1 2 ¼ nðn 1Þ 3
¼
¼ A.2. Interval-scaled variables In the case of an interval-scaled variable with the L2 metric, sde is the sample mean square difference n n X X 1 ðxi;e xj;e Þ2 nðn 1Þ i¼1 j¼1 8 !2 9 < X = n n X 2 n : ¼ x2i;e xi;e ; nðn 1Þ : i¼1 i¼1
sde ¼
Hence, sde is twice the sample variance. A.3. Ordinal variables In the case of an ordinal variable, observed values are replaced by ranked values and the measures of dispersion are computed on the ranked values. Let us denote by yi;e the ranked value corresponding to xi;e : In the case of the L2 metric, sde yields n n X X 1 ð yi;e yj;e Þ2 nðn 1Þ i¼1 j¼1 8 !2 9 < = n n X X 2 n ¼ y2i;e yi;e ; nðn 1Þ : i¼1 i¼1 2 2 n ðn þ 1Þð2n þ 1Þ n2 ðn þ 1Þ2 ¼ nðn 1Þ 6 4 nðn þ 1Þ ¼ 6
sde ¼
if no ties occurs. The dispersion measure sde is indeed twice the sample rank variance. In the case of ties, a correction factor should be included in the computation of sde (Siegel and Castellan, 1988) ( ) X 1 2 2 sde ¼ nðn 1Þ fz ð fz 1Þ ; ðA:1Þ 6ðn 1Þ z where fz is the number of tied ranks in the z-th grouping. In the case of the L1 metric, if we assume that the yi;e are arranged in their natural order, absde yields absde ¼
n i1 X X 2 ð yi;e yj;e Þ nðn 1Þ i¼2 j¼1
323
nþ1 3
if no ties occurs. In the case of ties, a correction factor should be included in the computation of absde ; ! ( !) X fz þ 1 nþ1 2 absde ¼ nðn 1Þ 3 3 z ( ) X 1 2 2 nðn 1Þ ¼ fz ð fz 1Þ : ðA:2Þ 3nðn 1Þ z One should note than the measures of dispersion for the L2 metric (A.1) and the L1 metric (A.2) are related by sde n ¼ : absde 2 A.4. Interval-scaled or ordinal variables In the case of a variable with a total order structure (either interval-scaled or ordinal) and no ties, if we assume that the xi;e are arranged in their natural order, absde yields (Kendall and Stuart, 1977) absde ¼
n1 X 2 iðn iÞðxiþ1;e xi;e Þ: nðn 1Þ i¼1
In the particular case of an ordinal variable, we thus obtain n1 X 2 iðn iÞ nðn 1Þ i¼1 ! nþ1 2 ¼ nðn 1Þ 3
absde ¼
¼
nþ1 : 3
In the case of a discrete (categorical) variable with a total order structure (either interval-scaled or ordinal), we obtain (the assumption of no ties is relaxed) absde ¼
r1 X 2 Fx ðn Fxp Þðxpþ1 xp Þ; nðn 1Þ p¼1 p
where xp is the p-th possible value (respectively the p-th possible rank), i.e. fxp > 0; and Fxp is the cumulated frequency of possible values (respectively ranks) up to xp : Note that the possible values (respectively ranks) xp are not evenly spaced.
ARTICLE IN PRESS 324
Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325
A.5. Mixed variables and explicit weighting
References
Some positive weights we ; that are chosen in a subjective way on the basis of application knowledge, can be introduced. We thus obtain for the overall standardized distance between vectors xi and xj ; P jdðxi;e ; xj;e Þj dðxi ; xj Þ ¼ e we L1 case; absde sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P dðxi;e ; xj;e Þ2 L2 case dðxi ; xj Þ ¼ e we sde P with e we ¼ 1:
Barth!el!emy, D., 1991. Levels of organization and repetition phenomena in seed plants. Acta Biotheoret. 39, 309–323. Barth!el!emy, D., Grosfeld, J., Bouroulet-Hallard, F., Ducatillon, C., 1999. In: Tessier du Cros, E. (Ed.), Cypress: A Practical Handbook. Studio Leonardo, Florence, pp. 26–33. Caraglio, Y., Barth!el!emy, D., 1997. Revue critique des termes relatifs a" la croissance et a" la ramification des tiges des v!eg!etaux vasculaires. In: Bouchon, J., de Reffye, P., Barth!el!emy, D. (Eds.), Mod!elisation et Simulation de l’Architecture des V!eg!etaux, Science Update. INRA e! ditions, Paris, pp. 11–87. Costes, E., Gu!edon, Y., 2002. Modelling branching patterns on 1-yearold trunks of six apple cultivars. Ann. Bot. 89, 513–524 doi:10.1093/aob/mcf078. Costes, E., Fournier, D., Salles, J., 2000. Changes in primary and secondary growth as influenced by crop load effects in ‘Fantasmes ’ apricot trees. J. Hort. Sci. Biotechnol. 75, 510–519. Cover, T.M., Thomas, J.A., 1991. Elements of Information Theory. Wiley, New York. Cox, T.F., Cox, M.A.A., 2001. Multidimensional Scaling, 2nd Edition. Chapman & Hall, London. Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.J., 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge. Ferraro, P., Godin, C., 2000. A distance measure between plant architectures. Ann. For. Sci. 57, 445–461. Ferraro, P., Godin, C., 2003a. An edit distance between quotiented trees. Algorithmica 36, 1–39. Ferraro, P., Godin, C., 2003b. Optimal mappings with minimum number of connected components in tree-to-tree comparison problems. J. Algorithms, in press. Godin, C., Caraglio, Y., 1998. A multiscale model of plant topological structures. J. theor. Biol. 191, 1–46. Godin, C., Gu!edon, Y., Costes, E., 1999. Exploration of a plant architecture database with the AMAPmod software illustrated on an apple tree hybrid family. Agronomie 19, 163–184. Godin, C., Gu!edon, Y., Costes, E., Caraglio, Y., 1997. Measuring and analysing plants with the AMAPmod software. In: Michalewicz, M.T. (Ed.),, Plants to Ecosystems—Advances in Computational Life Sciences, Vol. I. CSIRO Publishing, Collingwood, Victoria, pp. 53–84. Gordon, A.D., 1999. Classification, 2nd Edition. Chapman & Hall, London. Gu!edon, Y., Costes, E., 1999. A statistical approach for analyzing sequences in fruit tree architecture. In: Wagenmakers, P.S., van der Werf, W., Blaise, Ph. (Eds.), Proceedings of the Fifth International Symposium on Computer Modelling in Fruit Research and Orchard Management, Wageningen, The Netherlands, Acta Horticulturae 499, 281–288. Gu!edon, Y., Barth!el!emy, D., Caraglio, Y., Costes, E., 2001. Pattern analysis in branching and axillary flowering sequences. J. theor. Biol. 212, 481–520 doi: 10.1006/jtbi.2001.2392. Gusfield, D., 1997. Algorithms on Strings, Trees, and Sequences— Computer Science and Computational Biology. Cambridge University Press, Cambridge. Guttorp, P., 1995. Stochastic Modeling of Scientific Data. Chapman & Hall, London. Hall!e, F., Martin, R., 1968. Etude de la croissance rythmique chez l’h!ev!ea Hevea brasiliensis Mull.-Arg. Euphorbiac!ees-crotonoid!ees. Adansonia 8, 475–503. Hall!e, F., Oldeman, R.A.A., 1970. Essai sur l’Architecture et la Dynamique de Croissance des V!eg!etaux. Masson, Paris. Hall!e, F., Oldeman, R.A.A., Tomlinson, P.B., 1978. Tropical Trees and Forests: An Architectural Analysis. Springer, Berlin.
Appendix B. Elementary operations and local distance B.1. Indels Let | denote the null element and assume that d is defined for all elementary operations and all possible elements including the null element |: In order to preserve the metric properties of the local distance d in the global distance D between sequences, d should satisfy dðx; |Þ þ dð|; yÞXdðx; yÞ for all x and y (triangle inequality). Hence, the indel cost for x should be related to the maximal substitution cost for x; ! X maxye jdðxe ; ye Þj dðx; |Þ ¼ dð|; xÞ ¼ a we L1 case; absde e sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X maxye dðxe ; ye Þ2 dðx; |Þ ¼ dð|; xÞ ¼ a L2 case w e e sde with a > 0:5: The above indel costs penalize outlying values more strongly than ‘central’ values. B.2. Transpositions The cost for interchanging x and y ðxayÞ should satisfy 0pdðxy; yxÞodðx; yÞ þ dð y; xÞ; 0pdðxy; yxÞodðx; |Þ þ dð y; yÞ þ dð|; xÞ; 0pdðxy; yxÞodð|; yÞ þ dðx; xÞ þ dð y; |Þ: Hence 0pdðxy; yxÞo2 minðdðx; yÞ; dðx; |Þ; dð y; |ÞÞ: The transposition cost is related to the indel and substitution costs in the following manner: dðxy; yxÞ ¼ b minðdðx; yÞ; dðx; |Þ; dð y; |ÞÞ with 0pbo2:
ARTICLE IN PRESS Y. Gu!edon et al. / Journal of Theoretical Biology 225 (2003) 301–325 Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer, New York. Heuret, P., Barth!el!emy, D., Gu!edon, Y., Coulmier, X., Tancre, J., 2002. Synchronization of growth, branching, and flowering processes in the South American tropical tree Cecropia obtusa (Cecropiaceae). Amer. J. Bot. 89 (7), 1180–1187. Juang, B.-H., Rabiner, L.R., 1985. A probabilistic distance measure for hidden Markov models. AT&T Tech. J. 64 (2), 391–408. Kaufman, L., Rousseeuw, P.J., 1990. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York. Kendall, M.G., Stuart, A., 1977. The Advanced Theory of Statistics, Vol. I, Distribution Theory, 4th Edition. Charles Griffin, London. Kullback, S., 1968. Information Theory and Statistics. Dover, New York. Leroux, B.G., 1992. Maximum-likelihood estimation for hidden Markov models. Stochastic Process. Appl. 40, 127–143.
325
Lyndon, R.F., 1998. The Shoot Apical Meristem: Its Growth and Development. Cambridge University Press, Cambridge. Monserud, R.A., 1986. Time-series analyses of tree-ring chronologies. Forest Sci. 32 (2), 349–372. Oommen, B.J., Loke, R.K.S., 1997. Pattern recognition of strings with substitutions, insertions, deletions and generalized transpositions. Pattern Recognition 30 (5), 789–800. Sankoff, D., Kruskal, J.B. (Eds.), 1999. Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison. CSLI Publications, Stanford. Schweingruber, F.H., 1988. Basics and Applications of Dendrochronology. Kluwer Academic Publishers, Dordrecht, Holland. Siegel, S., Castellan, N.J., 1988. Nonparametric Statistics for the Behavioral Sciences, 2nd Edition. McGraw-Hill, New York. Waterman, M.S., 1995. Introduction to Computational Biology— Maps, Sequences and Genomes. Chapman & Hall, London. White, J., 1979. The plant as a metapopulation. Ann. Rev. Ecol. System 10, 109–145.