J. theor. Biol. (2001) 212, 481}520 doi:10.1006/jtbi.2001.2392, available online at http://www.idealibrary.com on
Pattern Analysis in Branching and Axillary Flowering Sequences Y. GUED DON*-, D. BARTHED LED MY*, Y. CARAGLIO*
AND
E. COSTES?
* ;niteH Mixte de Recherche, CIRAD/CNRS/INRA/;niversiteH Montpellier II, Botanique et Bioinformatique de l1Architecture des Plantes, ¹A 40/PS2, 34398 Montpellier Cedex 5, France and ? ;niteH Mixte de Recherche INRA/AgroM/CIRAD/IRD BDPPC Equipe Architecture et Fonctionnement des Espe` ces Fruitie` res, 2, Place
In the architectural approach to the study of plants, a major issue is to analyse branching and axillary #owering patterns. Due to the structured expression of the branching process and the noisy character of the observed patterns, we propose an analysis framework which is both structural and probabilistic. Data take the form of sequences which naturally represent the underlying structural information of branching and axillary #owering patterns and allow the application of a large number of methods ranging from exploratory analysis to stochastic modeling. The primary aim of the proposed analysis methods is to reveal patterns not directly apparent in the data, and thus to deepen our biological understanding of the underlying mechanisms that control the branching and the axillary #owering of plants over time and space. The proposed approach is illustrated using a set of examples corresponding to di!erent plant species and di!erent biological or agronomic objectives. 2001 Academic Press
1. Introduction The main outcome of the architectural analysis initially proposed by HalleH & Oldeman (1970) (see also HalleH et al., 1978) is to provide the biologist with the main concepts and tools to tackle the study of plant architecture development, whatever the development stage or the species considered. A "rst step in quantifying plant architecture is to use the architectural analysis concepts for the sampling of morphologically equivalent entities within plants; see for instance de Re!ye et al. (1991). One shortcoming with this approach is that the structural information, corresponding roughly to the spatial arrangement of botanical entities within the architecture, is only
- Author to whom correspondence should be addressed. E-mail:
[email protected] 0022}5193/01/200481#40 $35.00/0
used in the sampling stage, not in the analysis stage which relies on unstructured samples representing for instance the number of internodes of growth units in selected architectural positions. This is a major drawback particularly when analysing branching and axillary #owering patterns which is the focus of this paper. It is well known that branching generates remarkable patterns (for instance tiers of vigorous branches located below a growth stop or well-de"ned axillary #owering zones). Branching expression is naturally represented by sequences whose index parameter is the node rank, the attached variables qualifying either the parent entity (phyllotaxis, length of the internode) or the o!spring entities (type of axillary production such as immediate vs. 1-year-delayed o!spring shoot). Most of the measured variables are discrete (nominal, ordinal or interval-scaled) and correspond 2001 Academic Press
482
Y. GUED DON E¹ A¸.
to morphological information related to the branching expression. A large panoply of statistical methods are available for sequences, ranging from exploratory analysis to model building (a detailed account of the bibliography is given in Section 4). A large proportion of the models and methods presented in this paper for analysing sequences built from discrete variables are also used for analysing DNA sequences (see Baldi & Brunak, 1998 and Durbin et al., 1998) even if, in this latter case, data consist of a single or a few long sequences instead of a set of short sequences. In particular, our application domain shares with DNA sequence analysis two classes of generic problems, namely the analysis of successions of homogeneous zones and the analysis of repeated patterns which will be discussed on selected examples in this paper. We assume that branching/#owering sequences, observed a posteriori at the macroscopic scale, are the result of many biological phenomena operating at di!erent scales, at di!erent dates (corresponding for instance to the organogenesis and the elongation of the parent shoot and the o!spring shoots), with complex interactions and with the possible occurrence of artifacts. It is therefore impossible to build realistic mechanistic models on the basis of these observed sequences and our objective is rather to provide tools and models to highlight in the branching and axillary #owering of leafy stems of vascular plants remarkable structures or regularities not directly apparent either in the "eld or in the data. The remainder of this paper is organized as follows. The main biological concepts of plant architecture and morphology are introduced in Section 2. In Section 3, a typology of sequences extracted from plant architectures is proposed which relies both on the type of dynamics exhibited by the sequences and on the types of variables measured for each elementary botanical entity. Statistical methods of both an exploratory and inferential nature are introduced in selected examples of biological interest in Section 4. The overall analysis and the biological interpretation of the results are reviewed on an enlarged set of examples in Section 5. Section 6 is devoted to a discussion of various issues concerning both experimental design, sampling and relations to plant functioning.
2. Biological Framework The architectural development of a plant is the result of many elementary morphological events distributed over time and space. Whatever the size, shape and complexity of the resulting architecture, only a few categories of homologous botanical entites can be identi"ed in the caulinar shoot system of an adult plant (BartheH leH my, 1991). These types of botanical entities [Fig. 1(a)] can be ordered from the most elementary which is the node*or more precisely the set +node, internode, leaf(ves) and axillary bud(s), referred to as metamer; see White (1979)*to more macroscopic entities such as the axis (i.e. succession of metamers built up by a single meristem). Plant development is the result of two basic mechanisms: the apical growth process and the branching process. Each of these two basic mechanisms corresponds to a particular meristematic activity. A meristem is de"ned as a set of embryonic cells located in the apical part of each axis; for a recent review, see Lyndon (1998). These particular cells are always able to divide and thus to generate new cells which can then di!erentiate into specialized cells and constitute various tissues (pith, epiderm, vascular bundles, etc.) or organs (leaf, #ower, etc.). The apical meristem activity also generates small lateral sets of embryonic cells, the axillary meristems, which may give rise to axillary or o!spring shoots. In the case of the apical growth process, a succession of metamers is built up by a single apical meristem. This process is a!ected by environmental conditions which induce some rhythmicity in the growth (for example annual periodicity for temperate species). This temporal organization translates into a spatial organization which may be identi"ed by morphological markers *e.g. scale leaves and short internodes corresponding to slowdowns or stops of growth; see Caraglio & BartheH leH my (1997). In this way, di!erent types of macroscopic botanical entities may be de"ned such as growth units (portion of leafy axis, i.e. succession of metamers built up between two resting phases; see HalleH & Martin, 1968) or annual shoots (BartheH leH my, 1991; Godin & Caraglio, 1998). An annual shoot corresponds to the portion of a leafy axis built up over a single year for temperate species and may be composed of one or several successive growth units.
BRANCHING/FLOWERING SEQUENCES
483
FIG. 1. Apple tree. (a) Growth unit organization results from the succession of metamers. (b) Trunk grafted on M7 rootstock (cultivar &&Reinette B.''). (c) For each node, the nature of the axillary production was recorded (0: latent bud; 1: 1-year-delayed short shoot; 2: 1-year-delayed long shoot; 3: 1-year-delayed #owering shoot; 4: immediate shoot).
In the case of the branching process, an o!spring entity is built up by an axillary meristem. This o!spring entity, which may be either a #ower, resulting from the transformation of this meristem, or a shoot, will be globally referred to as an axillary production. The entities borne on the successive nodes of a given &&parent'' entity may exhibit a remarkable structure. For example, many temperate species exhibit gradients on
growth units, for instance vigor gradients associated with acrotony i.e. a decrease in the dimensions (length, diameter) of the o!spring entities from the top to the base of the parent growth unit. The establishment of these remarkable branching/#owering structures is controlled by biochemical signals (see Raven et al., 1999, Chapter 28), sent from the apical meristem to the sub-apical axillary meristems and whose
484
Y. GUED DON E¹ A¸.
intensity decreases as a function of the distance from the apex (Champagnat, 1961; CrabbeH , 1987). Both the apical growth and the branching processes induce sequential structures at the node succession scale along growth units (sequence of internode lengths or sequence of axillary productions). Hence, the node constitutes a natural index parameter from which sequences can be built. It should be noted that the year of a shoot or the rank of a growth unit are also valid index parameters from which more macroscopic sequences can be built. 3. Characterization of Branching/Flowering Sequences Branching/#owering sequences can be characterized according to the following three criteria. E
E
E
The level of description of the plant which de"nes the index parameter of the sequence. Our focus will be on sequences whose index parameter is the node (or metamer). The number and the types of the measured variables attached to each successive botanical entity in the sequence. The dynamic properties of the sequences: stationary sequences vs. non-stationary sequences e.g. sequences exhibiting a succession of welldi!erentiated transient phases or exhibiting a trend i.e. a &&long-term or smoothed change''. 3.1. TYPES OF VARIABLE
A measured variable quali"es either the successive botanical entities which compose a sequence (phyllotaxis, length of the internode) or the o!spring entities borne by these successive parent entities (types of axillary production). The latter variables will be the main focus for most of this paper. When one wants to qualify an entity or more generally a branched system borne by a given entity in a sequence, most of the relevant variables are discrete. Hence, in the context of the analysis of branching and axillary #owering patterns, discrete variables which may be either nominal or interval-scaled outweigh continuous variables. Examples of nominal variables including binary variables are numerous: absence/
presence of an o!spring shoot or an axillary #ower. Discrete interval-scaled variables also occur frequently: number of axillary #owers for instance. In this case the values are ordered and follow a linear scale. Continuous variables (which are always interval-scaled) express mainly geometrical features such as the length of the internode. At "rst sight, a discrete variable is either nominal or interval-scaled. When the values taken by a nominal variable can be ordered in some meaningful way, this variable can be considered as ordinal. The possible axillary productions chosen among latent bud (0), 1-year-delayed short shoot (1), 1-year-delayed long shoot (2) and immediate shoot (3) (shoots developed without delay with respect to the parent node establishment date) can be considered as an ordinal variable, the code order expressing a gradient of vigor. A "ltering process can also be applied to an interval-scaled variable. When the interval between values is meaningless and the only trustworthy information lies in the value order, an interval-scaled variable can be considered as ordinal. In the former case (nominal P ordinal), an order structure is added in some subjective way while, in the latter case (interval-scaled P ordinal), the implicit metric corresponding to the interval between values is replaced by the rank metric while the order structure is preserved. Both interval-scaled and ordinal variables are inherently quantitative in the sense that each possible value in the scale can be compared in terms of whether it corresponds to a greater or smaller magnitude than another value. This is a direct consequence of the order structure shared by these two types of variable. 3.2. TYPES OF DYNAMICS
At "rst sight, stationary sequences should be distinguished from non-stationary sequences. In the case of non-stationary sequences, composition changes may be abrupt or progressive. In the latter case, we speak about trend (long-term or smoothed change). For branching/#owering sequences, composition changes are often abrupt (for instance, a long o!spring shoot zone at the top of a parent shoot following a zone with a mixture of latent buds and short o!spring shoots) and the sequences may be viewed as a succession
BRANCHING/FLOWERING SEQUENCES
485
FIG. 2. Tassili cypress branches. (a) Two opposite axes were described at nodes 20, 40, 60 and 80 from trunk apex. (b) For each node of the selected branch, the number (0, 1 or 2) of o!spring shoots was recorded.
of homogeneous zones or segments where the composition properties do not change substantially within each zone, but change markedly between zones. In some cases, a succession of transient phases is followed by a stationary phase (see the apricot tree example in Section 5.4). 4. Sequence Analysis Methods From the characterization of sequences presented above, it is possible to propose various statistical analysis methods which apply to speci"c categories of sequences. In this section, we will mainly distinguish between descriptive methods (Section 4.1) and inferential methods (Section 4.2) that gives a summary of the information contained in a sample of sequences. In a classical analysis process, the former are employed in a "rst step known as exploratory analysis, while the use of the latter requires to make some assumptions on the basis of the results of an exploratory analysis. Depending on the "nal objective, inferential methods can be used in an exploratory analysis while some descriptive methods involve a preliminary modeling step. This highlights the blurred borderline between descriptive and inferential methods. In this section, the objective is by no way to be exhaustive
but rather to present key methods in each category. The presentation of the various methods for analysing sequences will be illustrated using two data samples which are introduced below. Example 12branching of ¹assili cypress secondorder axes. The experimental sample is composed of 276 second-order axes measured at selected positions on the trunk of 35 3-year-old Tassili cypresses (Cupressus dupreziana A. Camus, Cupressaceae). These trees were planted at the INRA Ruscas nursery in the south-east of France. Two opposite axes, located at 20, 40, 60 and 80 nodes from the apex of the trunk, were described node by node from the base to the top and the number of o!spring shoots was recorded (Fig. 2). Only the basal branched part with an opposite decussate leaf arrangement (phyllotaxis 2) was retained for analysis and the apical unbranched section was removed. O!spring axis elongation occurs after a short delay with respect to the establishment date of the parent node. Hence, when a growing axis is observed at a given data, the apical section is always unbranched. The measured variable for each node i.e. the number of o!spring shoots can take only three possible values: 0, 1 and 2. We have previously veri"ed that the four sub-samples corresponding to the four selected positions on the trunk were
486
Y. GUED DON E¹ A¸.
FIG. 3. Apple tree, cultivar &&Reinette B.'': optimal segmentation of three observed sequences. (0: Latent bud; 1: 1-yeardelayed short shoot; 2: 1-year-delayed long shoot; 3: 1-year-delayed #owering shoot; 4: immediate shoot).
homogeneous and hence could be merged. The observed sequences exhibit remarkable patterns in the succession of the number of o!spring shoots such as 101, 2001 and 2002 (Fig. 2). Example 22branching of apple tree trunk annual shoots (GueH don & Costes, 1999). Seven apple cultivars (Malus domestica Borkh, Rosaceae) chosen for their diverse growth and fruiting habits were planted in Montpellier (south of France). The seven cultivars were: &&Belre`ne'', &&Elstar'', &&Fuji'', &&Imperial Gala'', &&Granny Smith'', &&Reinette blanche du Canada'' and &&McIntosh Wijcik''. Twenty trees per cultivar, grafted on rootstock M.7, were planted in the "eld and cut back to one bud 1 year after transplantation. The trees were then allowed to develop without pruning. The location of the immediate o!spring shoots was recorded after 1 year of growth while the location of 1-yeardelayed o!spring shoots was recorded after 2 years of growth. Among these 1-year-delayed o!spring shoots, we distinguished between short shoots (poor elongation of the internodes), long shoots (signi"cant elongation of the internodes) and #owering shoots. In these measurements, we quali"ed both the immediate branching which follows the establishment growth from the base to the top and the 1-year-delayed branching organized from the top of the parent shoot. After an exploratory analysis of the sample of sequences, we chose to focus on the 1-year-delayed branching structure and describe the "rst annual shoots of the trunks node by node from the top to the base where, for each node, the type of axillary production chosen among latent bud (0), 1-yeardelayed short shoot (1), 1-year-delayed long shoot (2), 1-year-delayed #owering shoot (3), and
immediate shoot (4) was recorded (see Fig. 1). It should be noted that solely the immediate shoots described in the opposite direction, i.e. from the base to the top, were analysed distinguishing immediate short shoots from immediate long shoots in a previous study (Costes & GueH don, 1997). The measurements are illustrated by the three &&Reinette B.'' sequences in Fig. 3. These three sequences exhibit a succession of six welldi!erentiated zones (these zones can be obtained as a byproduct of a statistical modeling approach presented in Section 4.2). Each of these six zones is characterized by a given mixture of axillary productions: (1, 2) for the "rst zone, (0, 3) for the second, 4 for the third, 0 for the fourth, (0, 1, 2) for the "fth and 0 for the sixth. 4.1. EXPLORATORY ANALYSIS
Exploratory tools ranging from simple sequence visualization to the extraction of various characteristics are essential to provide a "rst comprehensive overview of sequence structure. This is reinforced by the intrinsic complexity of the sequence object compared to more simple unstructured objects such as vectors. 4.1.1. Families of Characteristic Distributions for Discrete Sequences To present the exploratory analysis of a sample of sequences built from a discrete (either nominal, ordinal or interval-scaled) variable (discrete sequences for short), we will assume that a discrete sequence s , s , 2 , s , 2 is generated by a dis R crete-time discrete-state-space stochastic process i.e. a sequence of discrete random variables S , S , 2 , S , 2 whose index parameter t is also R
487
BRANCHING/FLOWERING SEQUENCES
discrete. The index parameter t is usually called &&time'' even if it does not have the meaning of physical time. The possible values taken by the discrete random variables S , S , 2 , S , 2 R will be called the states of the process. For instance S "j means that the process S is in state R j at time t. We proposed to transpose the three points of view used for the speci"cation of point processes (Cox & Isham, 1980), i.e. the intensity, interval and counting points of view, to de"ne characteristics of discrete sequences (GueH don et al., 1999; GueH don & Costes, 1999). Note that in the point process context, &&intensity'' refers to conditional distributions, while in our context, the intensity characteristics are unconditional distributions. &&Intensity'' refers to the random state occupied at a "xed time while &&interval'' refers to the random time taken to reach a "xed state or to the random time spent in a "xed state. Finally, &&counting'' refers to the random number of occurrences of a "xed pattern in a sequence of "xed length. These characteristics take the form of families of discrete distributions, one distribution per time step for the intensity point of view and one distribution per state for the interval and counting points of view. The most obvious characteristic distributions are the unconditional distributions of being in state j at successive times t (intensity point of view): (P (S "j); j"0, 1, 2 , J!1), R
h (t)"P (S "j, S Oj, v"1, 2 , t), H R R\T
E
sojourn time in state j (or run length of state j): d (u)"P (S Oj, S "j, H R>S> R>S\T v"0, 2 , u!2"S "j, S Oj), R> R
u"1, 2, 2 .
(4)
For the initial state (sojourn time starting out at time 0), we adopt the following convention: d (u)"P (S Oj, S "j, v"1, 2 , u!1"S "j), H S S\T u"1, 2, 2
(5)
which means that the process enters a &&new'' state at time 0. First passage times and recurrence times are counted in number of transitions (tPt#1, 2 , t#u!1Pt#u for recurrence times) while sojourn times are counted in number of time steps (t#1, 2 , t#u). For each state j, we also de"ne the following two types of counting measure and the associated distributions. E Number of runs (or clumps) of state j per sequence of length q: P (N (q!1)"n)"P H
(2)
recurrence time in state j:
f (u)"P (S "j, S Oj, HH R>S R>S\T v"1, 2 , u!1"S "j), u"1, 2, 2 , R (3)
O\ I (s "j, s Oj) R R\ R
#I (s "j) ,
(1)
where J is the number of states. For each state j, we de"ne the following three types of interval and the associated distributions: E time up to the "rst occurrence of state j (or "rst passage time in state j):
t"0, 1, 2 ,
E
n"0, 2 ,
q#1 , 2
(6)
where I( ) denotes the indicator function. In this de"nition, the starts of runs are counted. Hence, both the complete time intervals, such as de"ned in (4), and the "nal right-censored time intervals are counted. E Number of occurrences of state j per sequence of length q: P (N (q!1)"n)"P H n"0, 2 , q.
O\ I (s "j)"n , R R (7)
In the practical case of a sample of sequences of di!erent lengths, the counting distributions
488
Y. GUED DON E¹ A¸.
FIG. 4. Apple tree, cultivar &&Reinette B.'': extraction of interval and counting data from a sequence. (a) 1-year-delayed #owering shoot (3): 5 runs of length 1, 3, 1, 2, 1; 8 occurrences. (b) Immediate shoot (4): 2 runs of length 10, 1; 11 occurrences. (i) time up to the "rst occurrence; (ii) recurrence time; (iii) sojourn time.
become "nite mixtures where the mixing weights are the probabilities of each possible sequence length
P (N "n)" P (N (q!1)"n"B"q) P (B"q). H H O The empirical equivalents of the intensity [eqn (1)], interval [eqns (2)}(4)] and counting [eqns (6) and (7)] characteristic distributions can be extracted from a sample of discrete sequences. This constitutes a set of exploratory tools (see an example of extraction of interval and counting data in Fig. 4). The families of characteristic distributions play di!erent roles in the exploratory analysis. The probabilities of the states as a function of the index parameter (intensity point of view) give an overview of process &&dynamics''. This overview is complemented for the initial transient phases by the distributions of the time up to the "rst occurrence of a state. The local dependencies are expressed both in the recurrence time distributions, the sojourn time distributions and the distributions of the number of runs of a state per sequence. These three types of characteristic distributions can help highlight the scattered or aggregate distribution of a given state along sequences or some kind of pseudoperiodicity in the successive occurrences of a given state along sequences. The exploratory tools presented above can be illustrated using the Tassili cypress branching sequences where the observed variable is the
number of o!spring shoots per node. The intensity characteristics extracted from the observed sequences (Fig. 5) show oscillations of period 2 around roughly constant values that die out progressively. A deeper understanding of these local oscillations is given by the empirical recurrence time distributions which show contrasted behaviors for the three possible observations (0, 1 or 2 o!spring shoots). In particular, the return to state &&1 shoot'' requires most of the time two transitions (i.e. two internodes) [Fig. 6(a)] while the return to state &&2 shoots'' requires most of the time three transitions [Fig. 6(b)]. It can easily be veri"ed on the measured sequences that in both cases the intermediate observations are nearly always 0. The apple tree sequences (cultivar &&Reinette B.'') show a very contrasted behavior where the probabilities of the di!erent axillary productions as a function of the node rank exhibit a succession of transient phases (Fig. 7). Subsequent to the very beginning of the sequences, which is less immediate to interpret, "ve zones can be identi"ed: E
E E E
centered on rank 11, a zone characterized by a mixture of latent buds and 1-year-delayed #owering shoots, centered on rank 26, an immediate shoot zone, centered on rank 38, a latent bud zone, centered on rank 53, a zone characterized by a mixture of latent buds, 1-year-delayed short shoots and 1-year-delayed long shoots, and
489
BRANCHING/FLOWERING SEQUENCES
FIG. 5. Tassili cypress branches: intensity point of view. ( 2 shoots.
) Observed no-shoot; (
) observed 1 shoot; (
) observed
FIG. 6. Tassili cypress branches: interval point of view; recurrence times: (a) 1 shoot; (b) 2 shoots.
E
a "nal latent bud zone corresponding to the basal part of the main shoot (recall that the sequences are described from the top to the base). 4.1.2. Exploration of ¸ocal Dependencies in Stationary Sequences by Sample Correlation Functions
An important tool to explore stationary sequences built from interval-scaled variables is
provided by a series of quantities called sample autocorrelation coe$cients which measure the correlation between observations at di!erent distances apart. Given q observations in a sequence, x , 2 , x , we can form q!k pairs of observa O\ tions (x , x ), (x , x ), 2 , (x , x ) re I I> O\I\ O\ garding the "rst observation in each pair as one variable and the second observation as a second variable denoted by X and X , respectively. I The autocorrelation coe$cient at lag k for a single sequence of length q is thus given by
490
(
Y. GUED DON E¹ A¸.
FIG. 7. Apple tree, cultivar &&Reinette B.'': intensity point of view: ( ) Observed latent bud; ( ) observed long shoot; ( ) observed #owering shoot; ( ) observed immediate shoot.
(Chat"eld, 1996; Brockwell & Davis, 1996) O\I\ (x !xN ) (x !xN ) R R>I I R , r(k)" ( O\I\ (x !xN ) O\I\ (x !xN ) R R>I I R R k"0, 1, 2 ,
(8)
where xN " O\I\ x /(q!k) is the mean of R R observations the "rst q!k and xN " I O\I\ x /(q!k) is the mean of the last q!k R>I R observations. The rather complicated expression (8) is usually approximated by [see Brockwell & Davis (1996) for mathematical justi"cations of this approximation] O\I\ (x !xN ) (x !xN )/(q!k) R R>I , r(k)" R O\ (x !xN )/q R R k"0, 1, 2 ,
(9)
where x " O\ x /q is the overall mean. GenerR alization to aR sample of sequences is direct. Some authors also drop the factor q/(q!k) in eqn (9) which is close to one for q large compared to k. We do not adopt this convention since we want
) observed short shoot;
to apply this kind of exploratory tool to samples or relatively short sequences. It should be noted that the sample autocorrelation function is an even function of the lag in that r(k)"r(!k). It is obviously desirable to indicate the uncertainty in an autocorrelation coe$cient when examining a sample autocorrelation function. The usual rule of thumb is that under the null condition of no correlation (purely random sequence), the autocorrelation coe$cient at lag k has standard error which is roughly $1.96/(q!k (Diggle et al., 1994). If the marginal distribution of the variable of interest is too far from normal, the sample Pearson (or linear) autocorrelation function (8) cannot be used, especially if one wants to test the null hypothesis that there is no serial correlation in a stationary sequence (Kendall & Stuart, 1979). It is then possible to build a sample autocorrelation function on the basis of either Spearman or Kendall rank correlation coe$cient (Kendall et al., 1983; Siegel & Castellan, 1988). This may also be useful to explore stationary sequences built from an ordinal variable in the case where an order structure is added to a nominal variable (see Section 3.1).
BRANCHING/FLOWERING SEQUENCES
491
The rank correlation coe$cient due to Spearman is the ordinary correlation coe$cient between the ranked values. Hence, while the Pearson correlation coe$cient looks for a linear relation between two variables, the Spearman rank correlation coe$cient looks for a monotone relation. The sample Spearman rank autocorrelation coe$cient at lag k is given by O\I\ (y !yN ) (y !yN ) R R>I I R , r (k)" Q ( O\I\ (y !yN ) O\I\ (y !yN ) R R>I I R R k"0, 1, 2 , where q!k#1 yN "yN " I 2 and y is the ranked value at time t corresponding R to x . The ranks should be recomputed for each R successive autocorrelation coe$cient r (k) both Q for the variables X and X . The ranked values I y on the one hand and the ranked values y on R R>I the other are computed from two distinct rank sets corresponding to the "rst q!k observations and the last q!k observations, respectively. The usual simpli"cations in the calculation of the sample Pearson autocorrelation function (9) can be transposed to the sample Spearman rank autocorrelation function. The ranked values are computed once only for the overall sequence and the rank means yN and yN are replaced by the I overall rank mean yN : O\I\ (y !yN ) (y !yN )/(q!k) R R>I , r (k)" R Q O\ (y !yN )/q R R k"0, 1, 2 ,
(10)
where q#1 . yN " 2 Details about e$cient arrangements in the computations of Spearman rank correlation coe$cients in the case of tied observations can be found in Siegel & Castellan (1988). Since we want to apply this kind of exploratory tool to samples of relatively short sequences, we adopt the same
FIG. 8. Tassili cypress branches: sample Spearman rank ) Observed sequences; ( ) autocorrelation functions: ( randomness 95% con"dence limit.
convention as in eqn (9) i.e. the factor q/(q!k) in eqn (10) is not dropped. To compute con"dence limits, the usual rule of thumb is that under the null condition of no correlation (purely random sequences), the Spearman rank autocorrelation coe$cient at lag k has standard error which is roughly $1.96/(q!k. Another commonly used measure of association for ordinal variables is the Kendall rank correlation coe$cient (Kendall et al., 1983; Agresti, 1984; Siegel & Castellan, 1988) which is based on the numbers of concordant and discordant pairs of observations in a sample. This can also serve as a basis for constructing sample rank autocorrelation functions. The behavior of sample rank autocorrelation functions can be illustrated using the Tassili cypress branching sequences (Fig. 2) where the sample Spearman rank autocorrelation function (the behavior of the sample Kendall rank autocorrelation function would be strictly equivalent) shows a strong alternating e!ect (see Fig. 8) which is also expressed in the empirical probabilities of the possible events, chosen between 0, 1 and 2 o!spring shoots, as a function of the node rank (see Fig. 5). The di!erent behaviors for &&1 shoot'' and &&2 shoots'' highlighted by the
492
Y. GUED DON E¹ A¸.
of statistical models can be organized according to the type of dynamics (either stationary or nonstationary at "rst sight). E
FIG. 9. Tassili cypress branches: sample autocorrelation functions for binary sequences corresponding to 1 shoot or 2 shoots. ( ) 1 shoot; ( ) 2 shoots; ( ) randomness 95% con"dence limit.
corresponding empirical recurrence time distributions extracted from the original sequences (Fig. 6) can also be investigated by means of sample autocorrelation functions computed from binary sequences (Fig. 9*for &&1 shoot'', 0 and 2 are aggregated in a single state while for &&2 shoots'', 0 and 1 are aggregated in a single state). Note that for binary sequences, the sample autocorrelation function (in the Pearson sense) and the sample Spearman rank autocorrelation function are identical. The autocorrelation function for the &&1 shoot'' binary sequences clearly shows an alternating e!ect corresponding to the pattern 101 while the autocorrelation function for the &&2 shoots'' binary sequences shows a pseudo-periodicity of length 3 corresponding to the pattern 2002. 4.2. MODEL BUILDING
Model building consists basically in "nding the &&best approximating model'' from a data sample respecting a principle of parsimony. The primary goal of model building is to extract information which is hidden in the experimental data (because of noise, and combinatorial complexity related to multivariate observations). The di!erent families
E
Stationary discrete sequences: the main objective in the analysis of such sequences is to identify repeated patterns, that is short subsequences occurring very frequently. The basic models are Markov chains of "xed order (Guttorp, 1995) which can be extended in various ways, for instance by allowing a variable order (Weinberger et al., 1995; Ron et al., 1996; BuK hlmann & Wyner, 1999), or which can serve as a basis for more general models such as hidden Markov chains; see Poritz (1988) or Rabiner (1989) for tutorial introductions, see also the monograph by MacDonald & Zucchini (1997). Non-stationary discrete sequences: the analysis of this type of sequence requires the introduction of less classical models such as non-homogeneous Markov chains and especially hidden semi-Markov chains (GueH don, 1998; GueH don et al., 1999; GueH don & Costes, 1999). Hidden semi-Markov chains are particularly suitable for analysing sequences that take the form of successions of homogeneous zones.
This section focuses on families of models for analysing discrete sequences (either stationary or non-stationary, univariate or multivariate) which both are of primary importance for the analysis of branching and axillary #owering patterns and are partly non-classical in the "eld of statistical modeling. These models are generalizations of simple "rst-order time-homogeneous Markov chains (higher-order Markov chains, variable order Markov chains) or integrate Markov chains as basic components (semi-Markov chains, hidden Markov chains, hidden semi-Markov chains). Therefore, these models will be globally referred to as (hidden) Markovian models (see Appendix A for formal de"nitions). In the family of (hidden) Markovian models, (hidden) Markov chains, which are mainly useful for the analysis of stationary sequences, should be distinguished from (hidden) semi-Markov chains, which are mainly useful for the analysis of non-stationary sequences. The implicit geometric distributions of sojourn times (see Appendix A
BRANCHING/FLOWERING SEQUENCES
for a justi"cation) make "rst-order Markov chains too constraining for the modeling of non-stationary sequences (or the geometric tail from sojourn time values greater than or equal to the order in the case of higher-order Markov chains). By contrast, semi-Markov chains do not explicitly represent local dependencies and this is a major drawback for the modeling of repeated patterns in stationary sequences. Hence, (hidden) Markov chains and (hidden) semi-Markov chains should be considered as complementary tools when analysing sequences. The core of the proposed data analysis methodology consists in iterating an elementary loop of model building until a satisfactory result is obtained. This elementary loop decomposes into the following three stages. (i) Model speci,cation. For Markovian models, this stage reduces to the choice of the family of candidate models (either high-order Markov chains with a maximum "xed order or semiMarkov chains). For hidden Markovian models, this stage consists mainly in dimensioning the embedded Markov chain and in making hypotheses on its structural properties on the basis of the characteristic distributions extracted from a sample of sequences (especially intensity characteristics). These hypotheses may be globally stated in some canonical cases (irreductible Markov chains or &&left}right'' Markov chains i.e. Markov chains composed of a succession of transient states and a "nal absorbing state). In the general case, structural constraints are expressed by prohibiting transitions i.e. by setting the corresponding probabilities to zero. Using the same principle, constraints can also be expressed on the initial probabilities and the observation probabilities. (ii) Model inference. An estimate of the parameters of Markov chains h can be obtained directly by maximizing the likelihood of the sequence f (sO\; h) (Billingsley, 1961; Guttorp, 1995); sO\ is a concise notation for s , 2 , s . Let f denote the observed O\ GH frequency of pairs of states in which state i is followed by state j. The maximum likelihood estimate of p is GH f pL " GH with f " f . G GH GH f G H
(11)
493
Hence, the estimates of (p ; i, j"0, 2 , J!1) GH are directly obtained by simple counting along sequences. The maximum likelihood estimation of the parameters of hidden Markovian models requires iterative optimization techniques which are applications of the Expectation}Maximization (EM) algorithm (Baum et al., 1970; Dempster et al., 1977; McLachlan & Krishnan, 1997). In the case of hidden Markovian models, it is often useful to have a knowledge of the state sequence sJ O\ that best &&explains'' the observed sequences xO\. sJ O\"arg max f (sO\, xO\; h). s ,2, s O\ This most likely state sequence can be obtained by a dynamic programming method usually referred to as the Viterbi algorithm; see Rabiner (1989) and MacDonald & Zucchini (1997) for hidden Markov chains and GueH don & Cocozza-Thivent (1990) for hidden semi-Markov chains. The most likely state sequences are particularly relevant for analysing transient phases where most of the likelihood of the observed sequences is captured in the corresponding most likely state sequences. The inference scope can be enlarged by including in the parameters to be estimated some structural parameters (for instance, the order of an irreductible Markov chain or the number of states of an underlying irreductible Markov chain in the case of a hidden Markov chain). This more general inference paradigm is referred to as model selection; see Burnham & Anderson (1998) for a recent review with particular emphasis on information-theoretic criteria. (iii) Model validation. The accuracy of the estimated model is mainly evaluated by the "t of characteristic distributions computed from model parameters to the corresponding empirical characteristic distributions extracted from the observed sequences; see GueH don (1998, 1999) for hidden semi-Markov chains. Recall that the families of characteristic distributions play complementary roles in the validation stage (see Section 4.1 where these roles are discussed in the context of the exploratory analysis).
494
Y. GUED DON E¹ A¸.
The analysis of stationary discrete sequences can be illustrated using the example of the Tassili cypress branching process. This type of stationary sequences frequently exhibits a short-term memory property. If we consider the empirical distribution of the next observation, given the preceding subsequence of some given length, then there exists a length ¸ (the memory length) such that the conditional distribution does not change substantially if we condition it on preceding subsequences of length greater than ¸. The usual approach to study local dependencies (or short-term memory) in stationary sequences consists in jointly estimating the parameters and the order of a model. This requires de"ning a family of competing models and "xing a maximum order. In our case, we chose to start our investigations with simple Markov chains with a maximum order "xed at 5. These three-state Markov chains are irreductible and aperiodic, hence ergodic. The competing models (zero-order Markov chain i.e. simple distribution, "rst-order Markov chain to "fth-order Markov chain) form a nested family. A possible criterion for model selection is the Bayesian Information Criterion (BIC) (Schwarz, 1978; Katz, 1981). For each possible order r, the following quantity is computed BIC(r)"2 log ¸ !JP (J!1) log n, P
(12)
where ¸ is the likelihood of the observed seP quences for the r-th-order estimated Markov chain, n is the cumulative length of the observed sequences and J is the number of states of the Markov chain. Hence JP (J!1) is the number of free parameters of a saturated J-state r-th-order Markov chain. A Markov chain is said to be saturated if all its transition probabilities are strictly positive. Relation (12) can be interpreted as the di!erence between the log-likelihood of the observed sequences evaluated by substituting the maximum likelihood estimates of the parameters and a penalty term which is a function of both the size of the sample and the number of free parameters of the estimated model. The order cannot be estimated using the classical maximum likelihood approach because increasing the order r automatically increases the likelihood. Hence a penalty term, decreasing
TABLE 1 ¹assili cypress example: estimation of the order of a three-state Markov chain
Order 0 1 2 3 4 5
No. of free parameters 2 5 13 31 56 98
2 log ¸ P
BIC (r)
!12175.6 !9885.1 !9164 !8842.8 !8642.6 !8466.8
!12193.3 !9929.5 !9279.4 !9118.1 !9139.9 !9337.1
TABLE 2 ¹assili cypress example: estimation of the order of a two-state Markov chain
Order 0 1 2 3 4 5
No. of free parameters 1 2 4 8 16 32
2 log ¸ P
BIC (r)
!9474.1 !8260 !7679 !7673.2 !7593.3 !7548.9
!9483 !8277.7 !7714.6 !7744.2 !7735.4 !7833.1
in r is added to the maximum likelihood and the resulting quantity is maximized with respect to r. In the case of the Tassili cypress example, the BIC favors the third-order model (Table 1). The rules of thumb of Je!reys (1961, Appendix B) suggest that a di!erence of BIC of at least 2 log 100"9.2 is needed to deem the model with the higher BIC substantially better. This result must be related to the empirical recurrence time distributions extracted from the observed sequences (Fig. 6). The patterns most frequently exhibited are 101, 2001 and 2002. Representing the former requires a second-order model while representing the two latter requires a thirdorder model. This can be further investigated by neutralizing the event &&2 o!spring shoots'' i.e. by aggregating 0 and 2 in a single state. On the resulting binary sequences, the BIC favors the second-order model (Table 2) as expected. This suggests that the memory length is not "xed but depends on the context (i.e. the current subsequence of maximum length 3). This can also be
BRANCHING/FLOWERING SEQUENCES
FIG. 10. Tassili cypress branches: variable order Markov chain. (a) Memory tree: the root is labeled e for empty. The dotted vertices correspond to the memories which are further decomposed while the vertices with no sons represent the memories e!ectively taken into account. (b) Transition graph: each vertex represents a possible memory and each possible transition is labeled with the corresponding output. The dotted arcs correspond to the less probable transitions.
deduced from Table 1 where the number of free parameters is 31 for the estimated third-order Markov chain instead of 54 for a saturated threestate third-order Markov chain. A tentative variable order Markov chain with 13 free parameters is depicted in Fig. 10: Fig. 10(a) represents the memory tree. Note that each interior vertex possesses exactly three sons (corresponding to the three possible outputs in the sequences). The memories +000, 100, 200, 10, 20, 1, 2, are chosen in such a manner to generate
495
the remarkable patterns 2001, 2002 and 101. The graph of possible transitions is represented in Fig. 10(b). The less probable transitions (p(0.05) represented by dotted arcs in Fig. 10(b) help to highlight the remarkable patterns generated by the model. It should be noted that a transition is possible from every state to state 1 and state 2 [except the self-transition in state 2 since there cannot be two consecutive pairs of o!spring shoots; see the sample recurrence time distribution for 2 shoots in Fig. 6(b)]. The occurrence of these transitions corresponds to &®enerative'' times where the memory length goes back to 1 (for instance if the current memory is 200 and a 2 is emitted). Another interesting situation is where the memory length increases by one at each time step (for instance when the current memory is 2 and two 0 are successively emitted which corresponds to a path from state 2 to state 200 through state 20). For the model depicted in Fig. 10, we obtain a BIC of 9117.9 (2 log ¸"9002.4) which is very close to the BIC for the estimated third-order Markov chain (see Table 1). The BIC can be further improved by developing the memory 1 at the second order (i.e. the vertex 1 becomes an interior vertex with three sons corresponding to the memories +01, 11, 21,). In this latter case, the BIC is 9081.5 with 16 free parameters (2 log ¸"8939.4) These results should not be considered as de"nitive since the selection of a variable order Markov chain remains an open problem (Weinberger et al., 1995; Ron et al., 1996; BuK hlmann & Wyner, 1999). When we examine the di!erent "ts of characteristic distributions for the estimated third-order Markov chain, we see on the "t of the intensity characteristics that the amplitudes of the oscillations are not well modeled (Fig. 11). This may be due to a kind of noise-corrupting process. To test this assumption, we estimated a third-order hidden (or noisy) Markov chain. The iterative maximum likelihood estimation procedure was initialized for the underlying process by a &&smoothed version'' of the previously estimated third-order Markov chain where the smallest initial probabilities and transition probabilities were bounded from below at 0.1. The observation distributions were uniformly initialized with a small dissymmetry in favor of the output corresponding to the state. Finally, we obtain the
496
Y. GUED DON E¹ A¸.
FIG. 11. Tassili cypress branches: intensity point of view for an estimated third-order Markov chain: ( ) Observed no-shoot; ( ) theoretical no-shoot; ( ) observed 1 shoot; ( ) theoretical 1 shoot; ( ) observed 2 shoots; ( ) theoretical 2 shoots.
following estimated observation distributions for the three states: state 0: output state 1: output output state 2: output output
0: 0: 2: 0: 2:
1; 0.15, output 1: 0.83, 0.02; 0.05, output 1: 0.16, 0.79.
The oscillations mainly for outputs 0 and 1 are then better modeled than by the simple thirdorder Markov chain ("t of intensity characteristics in Fig. 12 compared to the equivalent "t in Fig. 11) and the BIC is substantially improved (Table 3). This shows that a simple third-order Markov chain under"ts the data and in particular leads to the mixing of structural information and noise in a single set of parameters. In the case of the third-order hidden Markov chain, the other characteristics are well "tted (see for instance the recurrence time distributions in Fig. 13). The sample recurrence time distribution for two shoots appears to be not well "tted by the corresponding theoretical distribution computed from the third-order hidden Markov chain parameters. This is a pitfall related to the fact that the
recurrence times may be of the same order of magnitude as the sequence lengths (this fairly subtle mathematical point is treated in the framework of renewal theory; see for example Cox, 1962). This point should be clari"ed by extracting the recurrence time distribution from a large sample of sequences generated by the estimated third-order hidden Markov chain and respecting the sequence length distribution of the original measured sample [see Fig. 13(d)]. The observed variable &&number of o!spring shoots'' is indeed a discrete interval-scaled variable. However, the contrasted behavior of the three possible outputs (0, 1 or 2 shoots) particularly regarding recurrence time distributions (Fig. 13) leads us to consider it simply as an ordinal variable. Furthermore, we are far from the case of normal or near-normal variations which renders the interpretation of Pearson correlation coe$cients unreliable. Hence, we chose to investigate local dependencies within the observed sequences using the Spearman rank autocorrelation function. The sample Spearman rank autocorrelation functions, where the rank autocorrelation coe$cients r (k) are plotted Q
497
BRANCHING/FLOWERING SEQUENCES
FIG. 12. Tassili cypress branches: intensity point of view for an estimated third-order hidden Markov chain: ( ) Observed no-shoot; ( ) theoretical no-shoot; ( ) observed 1 shoot; ( ) theoretical 1 shoot; ( ) observed 2 shoots; ( ) theoretical 2 shoots.
TABLE 3 ¹assili cypress example: comparison of third-order Markov chain with hidden third-order Markov chain No. of free parameters
2 log ¸
BIC
31 32
!8842.8 !8655.9
!9118.1 !8940
Third-order Markov chain Third-order hidden Markov chain
against the lag k, computed for the observed sequences and for large samples of sequences (10 000 sequences whose empirical length distribution is that of the observed sequences) generated either by the estimated third-order Markov chain or third-order hidden Markov chain are shown in Fig. 14. This con"rms the better modeling of the local dependencies by the hidden Markov chain compared to simple Markov chain even if some observed long-term dependencies are not well modeled. The behavior of the noisy model is e$ciently summarized in the corrections made to the observed sequences. These corrections correspond
to the nodes where the optimal state of the model computed by the Viterbi algorithm di!ers from the corresponding observation. The examples in Fig. 15 clearly show that the corrected sequences express the patterns 101 and 2002 more strictly than the observed sequences. In the case of the apple tree example, the probabilities of the outputs as a function of the node rank exhibit a succession of transient phases (see Fig. 7 for cultivar &&Reinette B.''). In the case of the cultivar &&Reinette B.'', we made the hypothesis of an embedded left}right "rst-order Markov chain composed of six successive transient states and a "nal absorbing state. The hidden semi-Markov
498
Y. GUED DON E¹ A¸.
FIG. 13. Tassili cypress branches: recurrence times: (a) No-shoot; (b) 1 shoot: ( (c) 2 shoots; (d) 2 shoots-detail: ( ) observed; ( ) theoretical; ( ) simulated.
FIG. 14. Tassili cypress branches: sample Spearman rank autocorrelation functions: ( ) Observed sequences; ( ) Markov chain; ( ) hidden Markov chain; ( ) randomness 95% con"dence limit.
) observed; (
) theoretical.
chain built from the &&Reinette B.'' sequences and the &&Belre`ne'' sequences are represented in Fig. 16(a) and (b), respectively. For &&Reinette B.'', the estimation of model parameters conserved only the transitions between consecutive states except the transition between states 2 and 4 (in the initial speci"cation, transitions from a given state to the three following states were allowed). The state occupancy distributions, particularly from state 2, have a low dispersion which expresses strong structuring in the succession of zones along the annual shoots. Each state is markedly di!erentiated from the immediately preceding and following states by the attached observation probabilities. The accuracy of the model is mainly evaluated by the "t of characteristic distributions computed from model parameters to the corresponding empirical characteristic distributions extracted from the observed sequences (Figs 17 and 18). It can be noted that the orders of magnitude of the sample sizes
499
BRANCHING/FLOWERING SEQUENCES
FIG. 15. Tassili cypress branches: correction of observed sequences using the estimated third-order hidden Markov chain. (upper row) Observed sequence (number of axillary shoots); (lower row) (bold) corrections (states of the model that di!er from the observed value at the same rank).
for the characteristics extracted from the sequences are very variable (Fig. 18): while there is one data item per sequence for the counting characteristics [Fig. 18(b)], there are on average many data items per sequence for the interval characteristics [Fig. 18(a)]. Since the optimal state sequences capture most of the likelihood of the observed sequences, the evaluation of model accuracy and the interpretation of the underlying biological phenomena may also rely on the optimal state sequences deduced from the observed sequences by the Viterbi algorithm (GueH don & Cocozza-Thivent, 1990). The optimal segmentation of three &&Reinette B.'' sequences is presented in Fig. 3. States 1}6 clearly correspond to six well-di!erentiated successive zones. The lengths of the segmented zones and the axillary productions observed in these zones re#ect the corresponding state occupancy and observation distributions shown in Fig. 16(a). This modeling approach enables to characterize the structure of the "rst annual shoots of apple tree trunks from the di!erent mixtures of types of axillary production encountered along the measured sequences. 5. Case Studies In this section, the biological application is brought to the forefront and the di!erent steps in the analysis of samples of sequences are reviewed for di!erent representative examples (including the two examples used for the presentation of sequence analysis methods in Section 4). Some new examples are introduced so as to provide, in
conjunction with the two examples already introduced, a more exhaustive coverage of the possible situations encountered in the analysis of branching and axillary #owering patterns. The objective here is to demonstrate the complementarity of the di!erent statistical methods and tools, exemplify the diversity of analysis processes, and discuss possible biological interpretations of the results. 5.1. BRANCHING OF TASSILI CYPRESS SECOND-ORDER AXES (SEE SECTION 4 FOR DESCRIPTION OF THE DATA SAMPLE)
Two hypotheses can be proposed to interpret the rhythmic character of branching (with the remarkable patterns 101, 2001 and 2002, see Figs 2 and 6 for the recurrence times in 1 or 2 shoots). The "rst is of physiological nature while the second, of physical nature, is more speculative. From a physiological point of view, it can be assumed that growth inhibitors di!use from an activated axillary meristem and consequently that the intensity of the inhibitory e!ect is proportional to the number of activated meristems at a given node; see Courtot & Baillaud (1961) and Fulford (1965). From a physical point of view, we can make the assumption that the apical meristem needs to be of a given size before splitting to generate one or two axillary meristems. Hence, the number of unbranched nodes after a branching node is roughly proportional to the time required for the apical meristem to recover a size enabling a new split. This time is
500
Y. GUED DON E¹ A¸.
FIG. 16. Apple tree: estimated hidden semi-Markov chains. Each state is represented by a vertex which is numbered. Vertices representing transient states are edged by a single line while the vertex corresponding to the "nal absorbing state is edged by a double line. The possible transitions between states are represented by arcs with the attached probabilities noted nearby. Arcs entering in states indicate initial states (states 0 and 1 for &&Reinette B.'' and states 0, 1 and 2 for &&Belre`ne''). The attached initial probabilities are noted nearby. The occupancy distributions are "gured above the corresponding vertices. The possible axillary productions observed in each zone are indicated inside the boxes, the font sizes being roughly proportional to the observation probabilities (for state 1 of &&Reinette B.'', these probabilities are 0.45, 0.13 and 0.42 for latent bud, 1-year-delayed short shoot and 1-year-delayed long shoot, respectively): (a) &&Reinette B.''; (b) &&Belere`ne'': (0) latent bud; (1) 1-year-delayed short shoot; (2) 1-year-delayed long shoot; (3) 1-year-delayed #owering shoot; (4) immediate shoot.
indeed longer after a double branching than after a single branching. It is also noteworthy that the noise component of the third-order hidden Markov chain, given by the estimated observation probabilities (see Section 4.2), is structured in a logical manner. Output 0 is always observed in state 0 while output
0 can be observed in state 1 and output 1 (and less frequently output 0) can be observed in state 2. This absence of axillary shoots may be the result of a local inhibitory e!ect due to the organogenesis of either an axillary bud borne on the preceding node or the opposite axillary bud. This hypothesis needs to be con"rmed by
BRANCHING/FLOWERING SEQUENCES
501
FIG. 17. Apple tree: intensity point of view. (a) &&Reinette B.''; (b) &&Belre`ne'': ( ) Observed latent bud; ( ) theoretical latent bud; ( ) observed short shoot; ( ) theoretical short shoot; ( ) observed long shoot; ( ) theoretical long shoot; ( ) observed #owering shoot; ( ) theoretical #owering shoot; ( ) observed immediate shoot; ( ) theoretical immediate shoot.
complementary observations (for instance histological measurements of the size of embryonic leaves, of apical meristems). 5.2. PHYLLOTAXIS AND BRANCHING OF TASSILI CYPRESS FIRST-ORDER AXES
The experimental sample is made of 33 3year-old Tassili cypress trunks (average number
of nodes 107.6 with standard deviation 8.3). These trunks were described node by node from the base to the top and both the phyllotaxis (which changes from four whorled leaves to three whorled leaves) and the number of o!spring shoots were recorded (see Fig. 19). Note that the number of whorled leaves (and hence of axillary buds) de"nes the maximum possible number of
502
Y. GUED DON E¹ A¸.
FIG. 18. Apple tree, cultivar &&Reinette B.'': interval and counting points of view for long shoots. (a) Interval point of view: recurrence time and sojourn time; (b) Counting point of view: number of runs per sequence and number of occurrences per ) observed; ( ) theoretical. sequence: (
o!spring shoots at a given node. We did not consider the two "rst nodes that are always of opposite decussate phyllotaxis (phyllotaxis 2) and the apical unbranched section. Two sub-samples can be identi"ed according to the rank of the transition from phyllotaxis 4 to phyllotaxis 3. These two sub-samples can be visualized on the empirical distribution of the number of transitions before the "rst occurrence of phyllotaxis 3 (see the observed distribution in Fig. 20). The "rst sub-sample (22 trees), characterized by a transition between rank 14 and rank 40 (average rank 27.8 with standard deviation 6.1), corresponds to a change of phyllotaxis between the "rst and the second year of growth while the second sub-sample (11 trees), characterized by a transition after rank 63, corresponds mainly to a change of phyllotaxis between the second and the third year of growth (note that for two trees,
the transition of phyllotaxis 3 is still unobserved after 3 years of growth). The average sequence length of the "rst sub-sample (change of phyllotaxis between the "rst and the second year of growth) is 110.8 nodes with standard deviation 7.6 while the average sequence length of the second sub-sample is 101.2 nodes with standard deviation 5.7. The sequence lengths for the two sub-samples are signi"cantly di!erent (Wilcoxon}Mann}Whitney statistic 3.17 instead of 2.58 for the 0.01 level). This di!erence may be explained by an underlying mechanism which can be roughly described as follows. The size of the meristem increases with the rank of the node during the establishment phase of growth but the size of the embryonic leaves increases even faster, mainly because of the sheath component. Hence, while the most usual behavior is to observe an increase in phyllotaxis in conjunction with an
503
BRANCHING/FLOWERING SEQUENCES
FIG. 19. Tassili cypress trunk. (a) Trunk described from the base to the top; (b) The phyllotaxis and the number of branches were recorded for each successive node.
FIG. 20. Tassili cypress trunk: time up to the "rst occurrence of phyllotaxis 3. (a) ( (b) Cumulative distribution functions: ( ) observed; ( ) theoretical.
) observed; (
) theoretical;
504
Y. GUED DON E¹ A¸.
FIG. 21. Tassili cypress trunk: estimated hidden semi-Markov chain (see the legend of Fig. 16 for the drawing conventions).
increase in the size of the meristem (Douady & Couderc, 1996), for cypresses, the phyllotaxis decreases because of the increasing size of the embryonic sheaths (and even faster for a vigorous stem). Hence, the di!erences in the dates of phyllotaxis change may be related to stem vigor and therefore to the di!erence in number of nodes of the stems. The variable &&phyllotaxis'' expresses a transition between two successive zones: there is no #uctuation of the phyllotaxis, simply a single transition from phyllotaxis 4 to phyllotaxis 3. The variable &&number of o!spring shoots'' expresses mainly local #uctuations. The estimated seven-state hidden semi-Markov chain is shown in Fig. 21. At the beginning of the sequences, the two main modalities are represented by the succession of the two upper-left states (with initial probability 0.74) for the "rst and by the succession of the two lower-left states (with initial probability 0.24) for the second, the latter corresponding to a strongly branched basal section of trunks (before rank 20). The three other states of the model corresponding to the apical section of trunks are mainly organized by the di!erent ranks of transition from phyllotaxis 4 to
phyllotaxis 3. Note that we have "xed in the speci"cation stage the observation of phyllotaxis 4 in the six transient states and the observation of phyllotaxis 3 in the "nal absorbing state. The main reason for this is that the two observed variables are strongly related since the phyllotaxis gives the maximum possible number of o!spring shoots. Hence, it would be nonsense to observe a mixture of phyllotaxes in a given state. Both the phyllotaxis change [see Figs 20 and 22(a)] and the two types of basal branching [Fig. 22(b)] are well modeled by the estimated hidden semi-Markov chain. 5.3. BRANCHING OF APPLE TREE TRUNK ANNUAL SHOOTS (SEE SECTION 4 FOR DESCRIPTION OF THE DATA SAMPLE)
The fact that the trees were grafted and cut back to a single bud 1 year after transplantation explains both the high number of nodes of the "rst annual shoot of the trunks and the small variability of this number of nodes for each cultivar (see Table 4). Due mainly to the 2-year-old root system, branching expression is maximal with both immediate o!spring shoots and 1-yeardelayed o!spring shoots.
BRANCHING/FLOWERING SEQUENCES
FIG. 22. Tassili cypress trunk: intensity point of view. (a) Phyllotaxis: ( ) observed 3; ( ) theoretical 3; ( observed 4; ( ) theoretical 4. (b) Number of axillary shoots: ( ) Observed 0; ( ) theoretical 0; ( ) observed 1; ( theoretical 1; ( ) observed 2; ( ) theoretical 2; ( ) observed 3; ( ) theoretical 3; ( ) observed 4; ( theoretical 4.
The detailed comparison of two apple cultivars (&&Reinette B.'' and &&Belre`ne'') on the basis of model parameters and characteristics is illustrated in Figs 16 and 17. The main di!erence between these two cultivars lies in the location of
505
) ) )
the 1-year-delayed short shoots. For &&Reinette B.'', short shoots are mainly located on the basal part of the main shoot (between ranks 40 and 70 counted from the top) while they are located mainly on the apical part of the main shoot for
506
Y. GUED DON E¹ A¸.
TABLE 4 Apple tree example: means and standard deviations of sequence lengths ( for cultivar 0Elstar1, the values given in brackets correspond to the sample without the outlying sequence of length 95) k Belre`ne Elstar Fuji I. Gala Granny S. Reinette B. Wijcik
89.5 66.1 (64.2) 73.9 70 75.1 73 57.3
TABLE 5 Apple tree example: estimation of the order of a ,ve-state Markov chain from the sequences corresponding to all the cultivars except 0=ijcik1 (the basal unbranched zone corresponding to the preformed part was not taken into account)
p 4.41 9.02 (4.86) 4.56 4.14 5.62 3.78 2.94
&&Belre`ne'' (before rank 25) [Figs 17(a) and (b)]. The structures of the two models are very similar (see Fig. 16), the main di!erences being the supplementary initial state (state 1) for &&Belre`ne'' and the di!erent balances between short and long shoots in the basal and apical zones (state 1 of &&Reinette B.'' compared to state 2 of &&Belre`ne'' and state 5 of &&Reinette B.'' compared to state 6 of &&Belre`ne''). Some of the most typical features of the organization in successive zones shared by cultivars &&Reinette B.'' and &&Belre`ne'' transpose to the other cultivars (except &&Wijcik''): some long shoots are located just below the top of the growth unit while #owering shoots (mixed mostly with latent buds) are located just below this top long shoot zone but above the main immediate shoot zone. It should be noted that the branching structures of the seven cultivars were also globally compared using two di!erent methods. The "rst non-parametric method relies on sequence alignment and clustering techniques applied to the distance matrix resulting from the pairwise alignment of all the sequences corresponding to the seven cultivars. The second, parametric method, is based on the Kullback}Leibler divergence between hidden semi-Markov chains built for each cultivar (GueH don & Costes, 1999; GueH don et al., 2001). This analysis of global organization in successive zones may be complemented by the analysis of the local succession of axillary productions within zones. A way to tackle this question is to estimate the one-step transition probability matrix from the sequence corresponding to all
No. of free parameters
Order 0 1 2 3 4
4 20 93 270 494
2 log ¸ P
BIC (r)
!16357.5 !12484.1 !11314.5 !10566 !9571.4
!16392.3 !12658 !12123.3 !12914.1 !13867.6
the cultivars except &&Wijcik'' (the basal unbranched zone corresponding to the preformed part was not taken into account). Recall that the types of axillary production are chosen among latent bud (0), 1-year-delayed short shoot (1), 1-yeardelayed long shoot (2), 1-year-delayed #owering shoot (3), and immediate shoot (4). The estimated transition probability matrix PK is given by
0.7 0.3 PK " 0.33 0.31 0.22
0.06 0.4 0.12 0.07 0.01
0.1 0.16 0.43 0.05 0.02
0.06 0.1 0.08 0.5 0.02
0.08 0.04 0.04 . 0.07 0.73
(13)
Since the average zone length is of magnitude 10, about 90% of the transitions used for estimating the transition probabilities correspond to transitions within zones while about 10% correspond to transitions between zones (for instance pL , pL and pL correspond more speci"cally to transitions between zones). The "rst point is that immediate shoots are not mixed with 1-year-delayed shoots while some mixing between the different categories of 1-year-delayed shoots may occur particularly between short shoots and long shoots (pL "0.16, pL "0.12). The main type of mixing concerns each category of o!spring shoot with latent bud (see pL with iO0). G The above analysis can be re"ned by estimating jointly the order and the parameters of a "ve-state Markov chain. The BIC favors the second-order model (Table 5). This result must be
507
BRANCHING/FLOWERING SEQUENCES
FIG. 23. Apple tree: "t of sojourn time distributions. (a) 1-year-delayed short shoot; (b) 1-year-delayed long shoot; ) observed; ( ) theoretical. (c) 1-year-delayed #owering shoot; (d) immediate shoot: (
related to the sojourn time distributions. It should be recalled that the sojourn time distributions for an r-th-order Markov chain have a geometric tail from a sojourn time value of r while the shape of the beginning of the distribution up to r!1 is arbitrary. The sojourn time distributions for the di!erent types of o!spring shoots (Fig. 23*the theoretical distributions are computed from the parameters of the estimated second-order Markov chain) clearly show very high frequencies for a sojourn time value of 1 corresponding to isolated o!spring shoots of a given type [and these isolated o!spring shoots are mainly surrounded by latent buds*see the estimated one-step transition probabilities, eqn (13)]. The deviations from the geometric distribution (obtained by continuing the geometric tails up to the sojourn time value of 1) are of the same magnitude for the di!erent types of o!spring shoots: the ratio of the theoretical probability of
a sojourn time value of 1 to the reference probability of a sojourn time value of 1 under the geometric distribution hypothesis is 2.65 for 1year-delayed short shoots [Fig. 23(a)], 3.88 for 1-year-delayed long shoots [Fig. 23(b)], 2.49 for 1-year-delayed #owering shoots [Fig. 23(c)] and 2.79 for immediate shoots [Fig. 23(d)]. This tendency to have many isolated o!spring shoots can be interpreted as the result of a local inhibitory e!ect resulting from the activation of an axillary meristem. 5.4. BRANCHING AND FLOWERING OF APRICOT TREE GROWTH UNITS
A sample of 48 growth units of apricot tree (Prunus armeniaca, Rosaceae), cultivar &&Lambertin'', grafted on rootstock &&Manicot'' was described node by node from the base to the top (Costes & Gue`don, 1996). These apricot trees
508
Y. GUED DON E¹ A¸.
were planted in the Cti# experimental station at Balandran in the south of France. The type of axillary production, chosen among latent bud, 1-year-delayed short shoot, 1-year-delayed long shoot and immediate shoot, and the number of associated #owers (0, 1, 2, 3 #owers or more) were recorded for each node (Fig. 24). In this example, a nominal variable (type of axillary production) is combined with a discrete interval-scaled variable (number of associated #owers). These two variables correspond to events that do not occur simultaneously in plant development and were thus measured at two di!erent dates (beginning of the growth period for the number of #owers and end of the growth period for the type of axillary productions). These events are, nevertheless, assumed to be strongly related since the #owers are always borne by the o!spring shoots in positions corresponding to prophylls (the two "rst foliar organs, termed a and b, of an o!spring shoot). The structure of the estimated hidden semi-Markov chain is represented in Fig. 25. The underlying semi-Markov chain is composed of two transient states followed by a "ve-state recurrent class. An interpretation is associated with each state, summarizing the combination of the estimated observation probabilities. The "rst transient state corresponds to the initial transient phases for both variables (before rank 11) while the second transient state corresponds to the end of the transient phase for the variable &&number of #owers'' (see the "t of the intensity characteristics in Fig. 26). The two less probable states in the recurrent class are the direct expression of biological hypotheses and were a priori de"ned in the speci"cation stage by appropriate constraints on model parameters: the &&resting'' state (unbranched, non-#owered) corresponds to zones of slowdown in the growth of the parent shoot. The immediate branching state corresponds to a rare event in this context and immediate branching follows very di!erent rules compared to 1-yeardelayed branching and, these two types of branching should not therefore be mixed in a given state. The main outcome of this study is that the recurrent class is structured by the variable &&number of #owers''. The number of #owers increases from 1 to 2 and from 2 to 3 #owers but almost never directly from 1 to 3 and, conversely,
FIG. 24. Apricot tree. Growth unit of cultivar &&Lambertin'' where the nature of the axillary production (0: latent bud; 1: 1-year-delayed short shoot; 2: 1-year-delayed long shoot; 3: immediate shoot) and the number of associated #owers were recorded for each successive node.
decreases from 3 to 2 and from 2 to 1 but almost never directly from 3 to 1. This result can be checked precisely by estimating the one-step transition probabilities from the sub-sequences *for the variable &&number of #owers''*corresponding to the recurrent class extracted by the
BRANCHING/FLOWERING SEQUENCES
509
FIG. 25. Apricot tree: structure of the estimated hidden semi-Markov chain (see the legend of Fig. 16 for the drawing conventions). Only the transitions with probability greater than 0.03 are represented. The dotted arcs correspond to the less probable transitions while the dotted vertices correspond to the less probable states.
Viterbi algorithm (GueH don & Cocozza-Thivent, 1990). This dynamic programming algorithm enables to extract the optimal state sequences corresponding to the observed sequences and therefore to remove the initial phases corresponding to the two initial transient states. We thus obtain pL "0.014 and pL "0.038. This result is also expressed in the combination of the transition probabilities between states 4, 5 and 6 (Fig. 25) and the observation probabilities:
state 4 (&&1 #ower''): no-#ower: 0.29, 1 #ower: 0.65, 2 #owers 2: 0.05, 3 #owers: 0.01; state 5 (&&2 #owers''): no-#ower: 0.01, 1 #ower: 0.13, 2 #owers 2: 0.85, 3 #owers: 0.01; state 6 (&&3 #owers''): no-#ower: 0.01, 1 #ower: 0.01, 2 #owers 2: 0.21, 3 #owers: 0.77. It should be noted that 1 and 3 #owers cannot be observed together in a single state. For these three states, the observation distributions for the variable &&type of axillary production'' are far less contrasted with a majority of 1-year-delayed short shoots: state 4 (&&1 #ower''): latent bud: 0.23, short shoot: 0.63, long shoot: 0.14; state 5 (&&2 #owers''): latent bud: 0.11, short shoot 2: 0.82, long shoot: 0.07;
state 6 (&&3 #owers''): latent bud: 0.09, short shoot 2: 0.85, long shoot: 0.06. For the sub-sequences corresponding to the recurrent class, the average number of #owers associated with 1-year-delayed short shoots is 1.85 with standard deviation 0.96 while the average number of #owers associated with 1-year-delayed long shoots is 1.47 with standard deviation 0.92. The numbers of #owers associated with these two categories of 1-year-delayed shoots are signi"cantly di!erent (Wilcoxon} Mann}Whitney statistic 5.38 instead of 2.58 for the 0.01 level). This di!erence may be interpreted as an inhibitory e!ect of the di!erentiated #owers on the vegetative development of the corresponding o!spring shoot (note that #ower di!erentiation occurs before the vegetative development of the o!spring shoot). 5.5. FLOWERING OF VANILLA STEMS
The experimental sample is made of 148 stems of vanilla (
510
Y. GUED DON E¹ A¸.
FIG. 26. Apricot tree: intensity point of view. (a) Type of axillary production: ( ) observed latent bud; ( ) theoretical latent bud; ( ) observed short shoot; ( ) theoretical short shoot; ( ) observed long shoot; ( ) theoretical long shoot; ( ) observed immediate shoot; ( ) theoretical immediate shoot. (b) Number of #owers: ( ) Observed no-#ower; ( ) theoretical no-#ower; ( ) observed 1 #ower; ( ) theoretical 1 #ower; ( ) observed 2 #owers; ( ) theoretical 2 #owers; ( ) observed 3 #owers; ( ) theoretical 3 #owers.
probability of #owering as a function of the node rank decreases progressively to 0 (observed probabilities in Fig. 28). Therefore, modeling by a
two-state ergodic Markov chain is not appropriate since the transient phase corresponding to the change from the initial state distribution to
BRANCHING/FLOWERING SEQUENCES
511
the stationary state distribution is very long. We chose to model these non-stationary binary sequences using a speci"c type of non-homogeneous Markov chain, that is Markov chains where the transition probabilities are not constant but related to the index parameter. Since the way by which the non-homogeneity is introduced is quite speci"c to this example, the resulting model is considered as a variant of time-homogeneous Markov chains rather than a representative of a new generic class of models with a potentially broad range of applications. Let f (t) denote the GH observed frequency of pairs of states, chosen among non-#owering (0) and #owering (1), where state i at time t is followed by state j at time t#1 and f . (t)" f (t) is the observed frequency of G H GH state i at time t. The estimate of p (t) denoted by GG pL (t) can be directly derived from the maximum GG likelihood estimate of p [eqn (11)]: GG f (t) pL (t)" GG . GG f (t) G For the non-#owering state, the estimate shows an increase from pL (0)K0.8 (which cor responds to a mean sojourn time in the non#owering state close to 5) to an asymptote at 1 (which corresponds to an in"nite mean sojourn time in the non-#owering state) (Fig. 29). It should be recalled that the mean sojourn time k G is given by 1 k" G 1!p GG (mean of a &&1-shifted'' geometric distribution of parameter 1!p de"ned on 1, 2, 2 ). GG This &&time-indexed'' estimate is only relevant for the non-#owering state since the #owering state corresponds to a rare event (350 occurrences in the 148 sequences compared with 3549 occurrences for the non-#owering state). In order to obtain a simple parameterization, we chose the monomolecular (or Mitscherlich) function p (t)"a!b exp(!ct) GG FIG. 27. Vanilla stem: the absence (0) or the presence (1) of an axillary #ower was recorded for each successive node from the apex.
with c'0. This function rises from the values a!b when t"0 and steadily approaches an asymptote in a as t becomes large. This nonlinear function
512
Y. GUED DON E¹ A¸.
FIG. 28. Vanilla stems: intensity point of view. ( observed #owering; ( ) theoretical #owering.
) observed non-#owering; (
FIG. 29. Vanilla stems: self-transition in the non-#owering state as a function of the node rank. (䊏) Observed; ( ) theoretical.
(Fig. 29) is estimated by regression (Bates & Watts, 1988) from the estimate *pL (t) taking GG into account the attached frequencies f (t). As G previously described in Section 4.2, the validation relies on a "t principle. A speci"city of nonhomogeneous Markov chains lies in the distributions of the recurrence time and sojourn time in the non-#owering state which cannot be computed from model parameters because of their conditional distribution nature [eqns (3) and (4)].
) theoretical non-#owering; (
)
However, both the state distribution as a function of the index parameter and the counting distributions can be computed from the estimated parameters of the non-homogeneous Markov chain and can be "tted to the corresponding empirical distributions extracted from the data (Figs 28 and 30). The disagreement in the "t of the counting distributions (Fig. 30) highlights a weakness of the estimated model. The model assumes that a certain proportion of the sequences are composed of a single run of non#owering nodes (and consequently 0 runs and 0 occurrences of the #owering state). But there are no purely non-#owering sequences in the experimental sample. Here it should be noted that this diagnosis can only be made on the basis of counting distributions which illustrates the complementarity of the intensity, interval and counting points of view. 6. Discussion 6.1. STATISTICAL MODELING CONTRIBUTION TO THE UNDERSTANDING OF PLANT GROWTH AND FUNCTIONING
As illustrated by the above examples, the main objective of statistical modeling is to highlight
BRANCHING/FLOWERING SEQUENCES
FIG. 30. Vanilla stems: counting point of view. (
) Observed; (
remarkable patterns in plant branching or axillary #owering. In our context, it is not necessary to suppose that the sequences under study are generated by an actual random mechanism. Rather, the model is viewed as a useful tool with which the information contained in the data can be organized and summarized. Due to the inherent complexity of sequences, one of the main results expected from statistical modeling is to throw light on behaviors or regularities not directly apparent in the data. The extraction of some kind of &&hidden'' behaviors is notably illustrated by the following four examples: E E E
the noise component for the branching of second-order Tassili cypress axes, the high proportion of isolated o!spring shoots on apple tree trunk annual shoots, the &&remanent'' character of the #owering along apricot tree growth unit sequences with progressive evolution of the number of #owers,
E
513
) theoretical.
the fact that there is at least one in#orescence on each vanilla stem and that this behavior cannot be obtained as a byproduct of trend modeling by a non-homogeneous two-state Markov chain.
Generally, the application of a small number of statistical models forms the core of the statistical analysis of a set of sequences. But the key results obtained in this way often prompt new questions and new investigations into the data. Hence, a complete analysis is often the result of multiple stages relying on various categories of statistical tools as illustrated by the above examples. Another issue concerns the biological objectives which may be more or less precise. Hence, the expected result of statistical investigations may be a precise answer to a biological question but also provides new insights into a biological mechanism that generate the development of new hypotheses or the formulation of new questions.
514
Y. GUED DON E¹ A¸.
For the biological interpretation of the results of sequence analysis, two main scales should be distinguished. E E
local scale corresponding to a single position or a few successive positions, global scale corresponding to the entire sequence.
The branching of the Tassili cypress secondorder axes (with the remarkable patterns 101, 2001 and 2002), the interaction between the #owering and the vegetative development at each node of apricot tree growth units, and the high proportion of isolated o!spring shoots on apple tree trunk annual shoots can be interpreted at a local scale. The local patterns exhibited by these three examples can be interpreted as an inhibitory e!ect at the time of the di!erentiation and activation of the axillary meristems. In the case of the apple tree, it should be noted that the local branching pattern can only be revealed by an appropriate statistical analysis and, to the best of our knowledge, has not been previously discussed in the literature. At the more macroscopic scale corresponding to the entire sequences, successions of well-contrasted zones should be distinguished from smoothed changes along the sequences. The former situation is mainly illustrated by the apple tree trunk annual shoots while the latter is illustrated by the vanilla stems. It should be pointed out that the change of phyllotaxis of the Tassili cypress "rst-order axes should be seen as a transition between two contrasted zones (since it may be shown that the branching patterns di!er between the phyllotaxis 4 zone and the phyllotaxis 3 zone). This scale of interpretation, even if the exhibited patterns are remarkable by their strong organization, is more di$cult to tackle than the local scale because of the diversity of the underlying biological mechanisms and the complexity of their interactions over time and space. For branching (or #owering) sequences whose index parameter is the node rank, the growth dynamics of the parent shoot (related to the variation in meristematic activity referred to as plastochrone) is always of primary importance in the establishment of the branching (or #owering) pattern. A way to further investigate this role played
by growth dynamics would be to measure as a supplementary variable the lengths of the internodes of the parent shoots since this often appears to be a reliable marker of stem elongation kinetics. The "nal branching sequence is often the result of a succession of branching phases (immediate branching and 1-year-delayed branching for instance) which may follow very di!erent rules. In particular, the winter resting phase leads to a global reorganization of in#uences at the shoot level, and 1-year-delayed branching is the result of a complex combination of factors (distance to roots, distance to the apex, wood storage capacity; see Champagnat, 1965). Hence, it is very di$cult to sort out the respective in#uences of the di!erent underlying biological mechanisms from the a posteriori observation of the "nal branching sequences. This also entails the high diversity of branching expressions which may be observed across species. In some cases, the patterns highlighted by sequence analysis may be directly related to some simple physiological mechanisms. For instance, the local inhibitory e!ect between axillary meristems on apple tree trunk annual shoots, revealed by the high proportion of isolated o!spring shoots (see Section 5.3), may be explained by the emission of growth inhibitors (such as indoleacetic acid, IAA). It can be assumed that these growth inhibitors di!use from the newly activated meristem to the neighboring axillary meristems. The basic mechanism is similar to that of apical dominance illustrated by many experiments (Fulford, 1965; for a recent general overview, see Lyndon, 1998). Another example is provided by the #owering of vanilla stems which is highly related to peroxydase activity (FoucheH , 1992). Peroxydases are considered as a #owering inhibitor (Greppin, 1986). Hence, the #owers are located in zones of low peroxydase activity while the non-#owering zones correspond to areas of high peroxydase activity. The presence of some #owers far from the apical part may be caused by the inverted peroxydase gradient in the case of shading (decreasing from the apex to the base instead of increasing in the case of sunlit stems). Morphological observations may reveal speci"c behaviors caused by underlying physiological mechanisms. Thus, experiments relying on plant
BRANCHING/FLOWERING SEQUENCES
architecture and morphology may help in the design of relevant physiological experiments focusing on more microscopic scales over both time and space. 6.2. EXPERIMENTAL DESIGN
The results obtained depend strongly on the design of the experiment. This argument can also be reversed and some preliminary results of analysis may help to adjust the design of an experiment. De"ning relevant level(s) of organization and the corresponding botanical entities to be observed is above all a matter of experience. The di$culties are numerous. E
E
The small size of the internodes (and consequently their large number) may impose di$cult experimental constraints (for instance on pines and more generally on many conifers). This potential di$culty is notably illustrated by the Tassili cypress second-order axes where the order of magnitude of internode length is 2 mm. A solution consists in de"ning an equally spaced index parameter not related to biological markers. For instance, the Tassili cypress second-order axes should have been described as a succession of segments of given length where the number of o!spring shoots in each segments was counted. However, this approach would not have revealed the remarkable branching patterns. This type of sequence description with an arti"cial index parameter is also applied to the #owering of apricot trees (Clanet & Salles, 1974). Another possible solution consists in de"ning arti"cial entities corresponding to inter-o!spring shoot segments. Hence, the position of successive o!spring shoots constitutes an unequally spaced index parameter from which sequences can be built. The morphological markers corresponding to a given botanical entity are more or less stable and easily discernible (Caraglio & BartheH leH my, 1997). For instance, in some species, the limit between two successive growth units is well marked if it corresponds to a real resting phase (short internodes and scales or cataphylls) while this limit may be di$cult to identify if it corresponds to a simple slowdown of growth. The progressive wearing away of morphologi-
515
cal information due to natural pruning, cambial growth and bark maturation restricts the possibilities of a posteriori observation of branching/#owering sequences to the last few years (Godin & Caraglio, 1998). In sequence description, the variables qualify either the parent entities belonging to the sequence (internode length for example) or the o!spring entities (type of o!spring shoot for example). This latter case can be further extended by qualifying not only the "rst-order borne entities but more generally the entire borne branched systems (possible variables may be the maximum branching order, the total number of #owers in the branched system, etc.). Another issue concerns the choice of the sample size. In particular, the sample size should be chosen in relation to the dynamic characteristics of sequences. In the non-stationary case, the sample size is basically the number of sequences while, in the stationary case, the sample size is the cumulative length of the sequences. Our focus mainly concerns the caulinar shoot system. But some of the proposed methods can be used for analysing the root system. In this case, a potential di$culty lies in the absence of morphological markers such as the node to de"ne equally spaced index parameters. Hence, one can either count the number of o!spring roots in successive segments of given length or de"ne an unequally spaced index parameter as the position of successive o!spring roots. 7. Concluding Remarks The analysis tools discussed in this paper can be useful for agronomic applications. This kind of objective is notably illustrated in Godin et al. (1999) where a hidden semi-Markov chain built from the branching structure of apple tree trunk annual shoots is used as a device of discrimination within a hybrid family. Hence, the objective of the statistical modeling was not to obtain a fairly good "tting of the data sample but rather to identify a discrimination rule to separate the initial sample into two sub-samples corresponding to each parent. The computational and statistical approach presented for a part in this paper for analysing
516
Y. GUED DON E¹ A¸.
plant architectures together with a dedicated database component is fully implemented in the AMAPmod software (Godin et al., 1997, 1999, http://www.cirad.fr/presentation/programmes/ amap/logiciels/amap}mod). The aim of the AMAPmod project is to propose a formal and computational approach for plant architecture analysis relying on an explicit account of the structural information at each step in the analysis process. The two main components of this project are a model of representation of plant architecture, which is the core of a data-base component (Godin & Caraglio, 1998), and a dedicated computational and statistical approach for providing insights into the underlying biological processes that govern the establishment of structures within plants; see GueH don et al. (2001) for a more exhaustive presentation of this approach. The model of representation takes account of plant architectures measured at di!erent scales (node, annual shoot and axis for example) and di!erent dates, whatever the type of features used*morphological such as the type of leaf chosen between cataphyll (or scale leaf ) and foliage leaf, geometrical such as a length or a diameter, physical such as a mechanical stem property, 3D positioning*attached to the botanical entities. One key objective of the computational and statistical approach not highlighted in this paper is to provide various approaches for structure comparisons that may also be useful for agronomic applications; see an illustration in GueH don & Costes (1999). We would like to thank Daniel Auclair for his careful reading of this paper, Patrick Heuret for fruitful discussions and Franc,ois Houllier for his helpful comments. REFERENCES AGRESTI, A. (1984). Analysis of Ordinal Categorical Data. New York: Wiley. BALDI, P. & BRUNAK, S. (1998). Bioinformatics: ¹he Machine ¸earning Approach. Cambridge, MA: MIT Press. BARTHED LED MY, D. (1991). Levels of organization and repetition phenomena in seed plants. Acta Biotheor. 39, 309}323. BATES, D. M. & WATTS, D. G. (1988). Nonlinear Regression Analysis and its Applications. New York: Wiley. BAUM, L. E., PETRIE, T., SOULES, G. & WEISS, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Stat. 41, 164}171.
BILLINGSLEY, P. (1961). Statistical methods in Markov chains. Ann. Math. Stat. 32, 12}40. BROCKWELL, P. J. & DAVIS, R. A. (1996). Introduction to ¹ime Series and Forecasting. New York: Springer. BUG HLMANN, P. & WYNER, A. J. (1999). Variable length Markov chains. Ann. Stat. 27, 480}513. BURNHAM, K. P. & ANDERSON, D. R. (1998). Model Selection and Inference. A Practical Information2¹heoretic Approach. New York: Springer. CARAGLIO, Y. & BARTHED LED MY, D. (1997). Revue critique des termes relatifs a` la croissance et a` la rami"cation des tiges des veH geH taux vasculaires. In: ModeH lisation et Simulation de l1Architecture des
517
BRANCHING/FLOWERING SEQUENCES
GODIN, C. & CARAGLIO, Y. (1998). A multiscale model of plant topological structures. J. theor. Biol. 191, 1}46. GODIN, C., GUED DON, Y., COSTES, E. & CARAGLIO, Y. (1997). Measuring and analysing plants with the AMAPmod software. In: Plants to Ecosystems2Advances in Computational ¸ife Sciences (Michalewicz, M. T., ed.), Vol. I, pp. 53}84. Melbourne, Australia: CSIRO Publishing. GODIN, C., GUED DON, Y. & COSTES, E. (1999). Exploration of a plant architecture database with the AMAPmod software illustrated on an apple tree hybrid family. Agronomie 19, 163}184. GREPPIN, H. (1986). Peroxydase as tool in the study of the #owering process. In: Molecular and Physiological Aspects of Plant Peroxydases (Greppin, H., Penel, C. & Gaspar, T., eds). UniversiteH de Gene`ve, Gene`ve. GUED DON, Y. (1998). Hidden semi-Markov chains. A new tool for analyzing nonstationary discrete sequences. In: Proceedings of the Second International Symposium on Semi-Markov Models: ¹heory and Applications (Janssen, J. & Limnios, N., eds), Compie`gne, France. GUED DON, Y. (1999). Computational methods for discrete hidden semi-Markov chains. Appl. Stochastic Models Business Ind. 15, 195}224. GUED DON, Y. & COCOZZA-THIVENT, C. (1990). Explicit state occupancy modelling by hidden semi-Markov models: Application of Derin's scheme. Comput. Speech ¸anguage 4, 167}192. GUED DON, Y. & COSTES, E. (1999). A statistical approach for analyzing sequences in fruit tree architecture. In: Proceedings of the Fifth International Symposium on Computer Modelling in Fruit Research and Orchard Management (Wagenmakers, P. S., van der Werf, W. & Blaise, Ph., eds), Wageningen, The Netherlands, Acta Hortic. 499, 281}288. GUED DON, Y., BARTHED LED MY, D. & CARAGLIO, Y. (1999). Analyzing spatial structures in forest tree architectures. In: Empirical and Process-based Models for Forest ¹ree and Stand Growth Simulation (Amaro, A. & TomeH , M., eds), pp. 23}42. Lisbon: Edic,o es Salamandra. GUED DON, Y., BARTHED LED MY, D., CARAGLIO, Y. & COSTES, E. (2001). Botanical sequence analysis: a computational and statistical approach to plant architecture. Technical Report CIRAD, Programme ModeH lisation des Plantes. GUTTORP, P. (1995). Stochastic Modeling of Scienti,c Data. London: Chapman & Hall. HALLED , F. & MARTIN, R. (1968). Etude de la croissance rythmique chez l'heH veH a Hevea brasiliensis Mull.-Arg. EuphorbiaceH es-crotonomK deH es. Adansonia 8, 475}503. HALLED , F. & OLDEMAN, R. A. A. (1970). Essai sur l1Architecture et la Dynamique de Croissance des
LYNDON, R. F. (1988). ¹he Shoot Apical Meristem: Its Growth and Development. Cambridge: Cambridge University Press. MACDONALD, I. L. & ZUCCHINI, W. (1997). Hidden Markov and Other Models for Discrete-valued ¹ime Series. London: Chapman & Hall. MCLACHLAN, G. J. & KRISHNAN, T. (1977). ¹he EM Algorithm and Extensions. New York: Wiley. PORITZ, A. B. (1998). Hidden Markov models: A guided tour. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing, New York, pp. 7}13. RABINER, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257}285. RAVEN, P. H., EVERT, R. F. & EICHORN, S. E. eds. (1999). Biology of Plants, 6th edn. New York: W. H. Freeman and Company, Worth Publishers. DE REFFYE, P., ELGUERO, E. & COSTES, E. (1991). Growth units construction in trees: a stochastic approach. Acta Biotheor. 39, 325}342. RON, D., SINGER, Y. & TISHBY, N. (1996). The power of amnesia: learning probabilistic automata with variable memory length. Machine ¸earning 25, 117}149. SCHWARZ, G. (1978). Estimating the dimension of a model. Ann. Stat. 6, 461}464. SIEGEL, S. & CASTELLAN, N. J. (1988). Nonparametric Statistics for the Behavioral Sciences, 2nd edn. New York: McGraw-Hill. WEINBERGER, M. J., RISSANEN, J. J. & FEDER, M. (1995). A universal "nite memory source. IEEE ¹rans. Inform. ¹heory 41, 643}652. WHITE, J. (1979). The plant as a metapopulation. Ann. Rev. Ecol. System 10, 109}145.
APPENDIX A De5nition of (Hidden) Markovian Models A.1. (HIGH-ORDER) MARKOV CHAIN DEFINITION
A "rst-order Markov chain +S , may be deR "ned as a sequence of discrete random variables with the property that the conditional distribution of S given S , 2 , S depends only on the R> R value of S but not further on S , S , 2 , S : R R\ P (S "s "S "s , 2 , S "s ) R> R> R R "P (S "s "S "s ). R> R> R R
(A.1)
A J-state "rst-order time-homogeneous Markov chain is de"ned by the following parameters: E
E
initial probabilities n "P (S "j) with H n "1, H H transition probabilities p "P (S "j"S "i) GH R> R with p "1. GH H
518
Y. GUED DON E¹ A¸.
These transition probabilities can be arranged as a J;J matrix P"(p ; i, j"0, 2 , J!1) with GH all rows summing to one and referred to as the transition probability matrix of the process. The row of P, (p , 2 , p ) constitutes the G G(\ transition distribution attached to state i &&memory''. In the case where all the p are strictly GH positive, the number of independent transition probabilities is thus J (J!1) (due to the unit row sum constraints). In the case of an r-th-order Markov chain +S ,, R the conditional distribution of S given R> S , 2 , S depends only on the values of R S ,2, S but not further on S , 2 , S : R R\P> R\P P (S "s "S "s , 2 , S "s ) R> R> R R "P (S "s "S "s , 2 , S "s ). R> R> R R R\P> R\P> This is a direct generalization of the Markov property (A.1). A J-state r-th-order Markov chain has JP(J!1) independent transition probabilities. For example, the transition probabilities of a second-order Markov chain are given by p "P (S "j"S "i, S "h) with p "1. FGH R> R R\ FGH H These transition probabilities can be arranged as a J;J matrix where the row (p , 2 , p ) FG FG(\ corresponds to the transition distribution attached to the &&state h, state i '' memory. A.2. STRUCTURAL PROPERTIES OF A MARKOV CHAIN
A "rst-order Markov chain can be interpreted as a directed graph, i.e. a set of vertices representing the states and a set of arcs (or directed edges) representing the possible transitions between states (such that p '0). Let us introduce the GH following de"nitions of graph theory. E
E
A path is a sequence of arcs such that for every pair of consecutive arcs, the terminus of the "rst arc and the origin of the second arc coincide. A circuit is a closed path such that the origin of its "rst arc coincides with the terminus of its last arc.
E
A strongly connected component of a graph is a subgraph containing the maximum number of vertices such that, for every pair of distinct vertices i and j, there exists a path from i to j as well as one from j to i. The two distinct vertices i and j are then said to be mutually accessible. A necessary and su$cient condition for a subgraph to be strongly connected is that there exists a circuit containing all of its vertices.
The concept of mutual accessibility or communication is an equivalence relation. Hence the states can be partitioned into equivalence classes (or the vertices into strongly connected components). On the basis of the structural information contained in this graph of possible transitions, the states can be classi"ed into recurrent and transient states. This structural property is in fact de"ned at the class level which means that all the states of a given class are either recurrent or transient. A state i is said to be recurrent if, starting from state i, the probability of returning to state i after a "nite number of transitions is one. In other words, state i is said to be recurrent if there is no path starting from vertex i that leaves the strongly connected component vertex i belongs to. A non-recurrent state is said to be transient. If a recurrent class contains a single state, this state is said to be absorbing and the only possible transition from this state is the self-transition. In other words, state j is said to be absorbing, if after entering this state, the process stays there forever. A Markov chain made of a single recurrent class is said to be irreductible (all the states of an irreductible Markov chain communicate with each other). A.3. SEMI-MARKOV CHAIN DEFINITION
One drawback with "rst-order Markov chains is the in#exibility in describing the time spent in a given state which is geometrically distributed with parameter 1!p : HH d (u)"P (S Oj, S "j, H R>S> R>S\T v"0, 2 , u!2"S "j, S Oj ) R> R
BRANCHING/FLOWERING SEQUENCES
"P (S Oj"S "j) R>S> R>S S\ ; P (S "j " S "j ) R>S\T R>S\T\ T "(1!p ) pS\ . HH HH Even for higher-order (in practice limited at 2 or 3) Markov chains, the implicit state occupancy distributions have geometrical tails from a sojourn time value equal to the order. A possible generalization lies in the class of semi-Markov chains, in which the process moves out of a given state according to a Markov chain with self-transitions in non-absorbing states p "0 HH and where the time spent in a given non-absorbing state is a discrete nonnegative random variable with an arbitrary distribution. A semi-Markov chain is constructed from an embedded "rst-order Markov chain. This J-state "rst-order Markov chain is de"ned by the following parameters: initial probabilities n "P(S "j) with H n "1, H H E transition probabilities: *non-absorbing state i: for each jOi, p " GH P (S "j " S Oi, S "i) with O p R> R> R H G GH "1 and p "0, GG *absorbing state i: p "P (S "i " S "i) GG R> R "1 and for each jOi, p "0. GH This embedded "rst-order Markov chain represents transitions between distinct states except in the absorbing state case. An explicit occupancy (or sojourn time) distribution is attached to each non-absorbing state of the embedded "rst-order Markov chain: E
d (u)"P (S Oj, S "j, H R>S> R>S\T v"0, 2 , u!2"S
"j, S Oj), R> R
u"1, 2, 2 . The whole ("rst-order Markov chain#state occupancy distributions) constitutes a semiMarkov chain. If the process starts out at t"0 in a given nonabsorbing state j, the following relation is
519
veri"ed: P (S Oj, S "j, v"1, 2 , t)"d (t) n . R R\T H H
(A.2)
Relation (A.2) means that the process enters a &&new'' state at time 0 [this can be seen as a direct consequence of convention (5)]. The mechanism associated with a semi-Markov chain can be described as follows: suppose that between two consecutive times, a transition occurred between state i and state j with probability p . The process remains in state j for GH a period u determined randomly by the corresponding state occupancy distribution +d (u); H u"1, 2, 2,. Then the process moves to another state according to the transition distribution (p , 2 , p , 2 , p ). H HI H(\ A.4. HIDDEN MARKOVIAN MODELS
Both Markov chains and semi-Markov chains can be used as building blocks of families of two-level stochastic processes i.e. pairs of stochastic processes +S , X , where the &&output'' R R process +X , is related to the &&state'' process +S , R R by a probabilistic function or mapping denoted by f (hence X "f (S )). Since the mapping f is R R such that f (i)" f ( j) may be satis"ed for some di!erent i, j, that is a given output may be observed in di!erent states, the state process +S , R is not observable directly but only indirectly through the output process. The non-observable state process +S , may be either a Markov chain R or a semi-Markov chain while the output process +X , may either be discrete or continuous, R univariate or multivariate. In the following we will restrict our attention to discrete output processes. The discrete output process +X , is related R to the state process +S , by the observation (or R state dependent) probabilities b (y)"P (X "y"S "j) with b (y)"1. H R R H W These observation probabilities can be arranged as a J;N matrix with all rows summing to one. The de"nition of the observation probabilities expresses the assumption that the output process at time t depends only on the state process at time t. Note that X is considered univariate for R
520
Y. GUED DON E¹ A¸.
convenience: the extension to the multivariate case is straightforward since, in this latter case, the elementary observed variables X are asRC sumed to be conditionally independent given the state S "s . R R
In the case of a hidden Markovian model +S , X ,"+S , X , 2 , X , 2,, the characterR R R R RC istic distributions (see Section 4.1) can be de"ned both for the state process +S , and for each eleR mentary output process +X ,. RC