ELSEVIER
Journal of Statistical Planning and Inference 57 (1997) 93-107
journalof statistical planning and inference
Graphical profiles as an aid to understanding plant breeding experiments K a y e E. B a s f o r d a'*, J o h n W. T u k e y b' 1 aDepartment of Agriculture, The University of Queensland, Brisbane Qld 4072, Australia b 408 Fine Hall, Princeton University, Princeton, NJ 08544-1000, USA 1
Received 1 September 1994; revised 1 May 1995
Abstract
In many plant improvement programs, a considerable number of genotypes (more strictly lines) are grown in a range of environments (which may refer to different locations within the same or different years). The outcomes (or responses) will usually be described by several attributes measured on each genotype within a particular environment. Often, comparison of the patterns of genotype response across environments is more useful that the traditional comparison of individual responses. Successive graphical presentations are used to summarize the information contained in such plant breeding experiments. They include overlap plots, their corresponding semigraphical tables, and profiles across environments for individual genotypes and groups of genotypes. The interpretation of these displays, as an aid to understanding these experiments, is illustrated and discussed.
AMS classification: 62-07; 62-09 Keywords: Overlap plots; Aperture summaries; Response patterns
Introduction
Statisticians have been urged 'to pay more attention to the strategy of dealing with data' (Tukey, 1986a). Here we are focussing on the strategy of analysing a specific plant breeding trial. Such experiments often include a considerable number of
* Corresponding author. Tel.: + 61 7 3365 2810; fax: + 61 7 3365 1177; e-mail: k.e.basford@mailbox. uq.oz.au. ~The contribution by John Tukey was prepared in part in connection with research at Princeton University, supported by the Army Research Office, Durham, DAAL03-91-G-0138, and facilitated by the Alfred P. Sloan Foundation. 0378-3758/97/$17.00 © 1997 Elsevier Science B.V. All rights reserved PII S 0 3 7 8 - 3 7 5 8 ( 9 5 ) 0 0 0 3 8 - 9
94
K.E. Basford, J.W. Tukey / Journal of Statistical Planning and Inference 57 (1997) 93-107
genotypes (more strictly lines) grown in a range of environments (which may refer to different locations within the same or different years). The outcomes will usually be described by several attributes measured on each genotype in each environment. Often the comparison of the patterns of genotype response across environments will be more useful than the traditional comparison of individual responses. We emphasize the advantages of successive graphical presentations in summarizing the information contained in the data and discuss how they aid in understanding the responses. Thus, there are two parts to such a process: • finding the appearances, and • communicating them, with the second one being just as hard as the first. We discuss a specific data set (Section 1), Benjamini and Hochberg's FDR procedure (Section 2, essential for a useful balance between optimism and pessimism), the use of overlap bars, or better, the corresponding semigraphical tables (Section 3, equally essential, distinguishing m ( m - 1)/2 differences as to confidence in direction with only m symbols), equipping a few profiles with apertures and center bars or center gaps (Section 4), grouping genotypes to allow effective use of profile displays with limited numbers of profiles (Section 5) and a brief word about smoothing profiles (Section 6).
1. Soybean data The data set being considered here is from an experiment that was first reported by Mungomery et al. (1974). Fifty-eight soybean lines were evaluated at four locations in southeastern Queensland, Australia in 1970 and 1971. Although not strictly correct, we shall refer to these soybean lines as genotypes to avoid any confusion with the geometric meaning of the word 'lines'. The locations, Redland Bay, Lawes, Brookstead and Nambour, were all within 150 km of Brisbane, and covered a range of climatic, cultural and edaphic (soil and topographical) conditions. The experiment was described as a randomized complete block design with two replicates in each location in each year. However, because of differences in maturity and expected growth pattern, the 58 genotypes were actually randomized in two distinct (adjacent) subsets (43 + 15 = 58) within each block. The differences in relative maturity (average days to flowering) among the genotypes are displayed in Exhibit 1 where there is an evident cleavage between • the first 43 genotypes (40 local selections from a cross, their 2 parents, and one other which was used as a parent in earlier trials) which have ranks 16 to 58, and • the other 15 (of which 12 were from the US and none were as late as any of the first 43 genotypes) which have ranks 1 to 15. In this experiment, chemical and agronomic attributes were observed, including seed yield (t/ha), plant height (m), lodging (rating scale 1-5), seed size (g/100 seeds), seed protein percentage and seed oil percentage. Lodging refers to the tendency of the
K.E. Basford, J.W. Tukey / Journal of Statistical Planning and Inference 57 (1997) 93-107
95
plant to fall over (1 = upright, 5 = prone) which adversely affects yield and ease of harvesting. As is customary, each year-by-location combination will be referred to as an environment. We have between 5000 and 6000 data values here (counting all six attributes) - - quite a small number in comparison with international variety trials, where 25 000 data values are not uncommon. 1.1. Preliminary considerations With regard to the choice of expression in which the data should be analysed, i.e., raw, square root, logarithm, etc., the problems are not unusual and we have looked at various ways of suggesting re-expression. For this paper, we only present analyses of the raw data, without prejudice as to whether or not re-expression might help. Given the experimental details, we are led to consider two possible covariates, days to maturity and days to flowering, in our analysis of the six measured attributes. Days to maturity (when 80% of the pods are ripe) is available for six of the eight environments (as it was not collected in Brookstead and Nambour in 1970). Days to flowering (when the first flower appears) is available in four of the six environments for which days to maturity is available (Lawes, 1970; Nambour, 1971; Redland Bay, 1970, 1971). However, days to flowering is measured more accurately (because it is easier to get right) than days to maturity. Also, the usual maturity group classifications, see Mungomery et al. (1974), are based on days to flowering, not on days to maturity. Thus, it seems best to use days to flowering averaged over those environments in which it was recorded. Exhibit 1 displays the plot of days to flowering against its rank order, and this exhibit can be used to partition the genotypes into three maturity classes, E, M1 and M2. Class E contains the 15 genotypes of overseas origin with shorter days to flowering (early maturing genotypes). Class M 1 contains the 19 genotypes with the shortest days to flowering of the local cultivars, while Class M2 contains the remaining 24 genotypes (of the 43) with the longest days to flowering. (Clearly, the difference between E and M1 is much stronger than that between M1 and M2). The latter two classes (jointly referred to as Class M) constitute the main body of soybean genotypes under investigation. These two different classes may not respond to the environments in the same way, e.g., if additional rain falls or specific stresses occur between the flowering or maturity times of the classes. For most of what we do, then, it will be desirable to look separately at the 43 genotypes (or perhaps fractions of them) and at the 15 genotypes, rather than looking at all 58 genotypes at once. Judgements, for example, of the degree to which experimental results have shown clear differences are likely to be different for the more homogeneous 43 than for the more heterogeneous 58. If, in some analysis, the 43 should be nearly unresolved, while the 58 include clear differences, attending only to the latter fact could mislead us considerably, even leading us to believe in clearly established differences among the 43.
96
K.E. Basford, J.W. Tukey /Journal of Statistical Planning and Inference 57 (1997) 93-107 80
go 00 c L q>
m0
oooooQooa°O
70
M2
oooooQ°°° 0000000000 °00
60
o
M1
oool 0 >,,
50
"0 0 0~ ~E
40
ooO
o°°°°°°°°
oOO 30 0
1=0
30
40
5
60
Rank of days to flowering Exhibit 1. Reordering of genotypes (soybean lines are referred to as genotypes here and in all subsequent exhibits to avoid confusion with the geometric meaning of lines) based on average days to flowering measured in Lawes (1970), Nambour (1971) and Redland Bay (1970, 1971)with maturity classes denoted by E, MI and M2. Ranks 1 to 15 are genotypes 44 to 58, ranks 16 to 58 are genotypes 1 to 43. Genotypes 41 and 43 (o) are the parents of genotypes 1 to 40, while genotype 42 (~) was another parent in previous trials.
2. Within environment analyses for single attributes Analysis of variance within each environment indicated that there were highly significant differences a m o n g the genotypes for each attribute using blocks (replicates) × genotype interaction as the error term. We shall not discuss these further except as they feed a multiple c o m p a r i s o n procedure which can be used to c o m p a r e individual means.
2.1. Multiple comparison procedures Dealing with multiplicity becomes increasingly painful as the n u m b e r of things to be c o m p a r e d increases. There are, for example, 1653 = 57(58)/2 pairwise comparisons a m o n g 58 genotype means. Neither of the usual least significant difference (LSD) or Tukey's honestly significant difference (HSD), corresponding to 5%-individual and 5%-simultaneous, respectively, is satisfactory. 5% per individual c o m p a r i s o n implies an average of 1653/20 = 83 incorrect statements a b o u t the pairwise comparisons a m o n g 58 means. Unless m a n y m o r e than 83 differences have well-established directions, this seems intolerably many. 5% per family (5%-simultaneous) implies that we spend only 5%/1653 = 0.003% error rate on each comparison, which results in finding significance for the difference far too rarely. Benjamini and H o c h b e r g (1995) have p r o p o s e d an intermediate choice; one that seems a reasonably satisfactory compromise; one that can be described in an
K.E. Basford, J.W. Tukey / Journal of Statistical Planning and lnference 57(1997) 93-107
97
understandable way. The essential idea is to relate the number of false positive statements to the total number of positive statements, true or false combined. If, for example, we make only 10 positive statements about any of the 1653 comparisons of pairs among 58, we do not want to accept even a modest number of false positive statements. We might live with one, although an average number less than one seems a little better. If we find ourselves working with 1500 positive statements out of the 1653, we can reasonably stand 50 of them being false positives, since this would be only one in 3 0 - - a n d we are used, absent questions of multiplicity, to accepting a 5% chance, 1 in 20, as reasonable.
2.2. Benjamini and Hochberg multiple comparisons The Benjamini and Hochberg procedure follows this line of thought and plans to control the average value, over many experiments, of the false discovery rate (FDR) given by the number of false positive statements divided by the number of positive statements at or below 5%. It is not trivially clear that any procedure will have the desired property, but we are fortunate that Benjamini and Hochberg have found, and expounded, a procedure that is not too hard to use, and does what is desired. In accordance with Benjamini and Hochberg (1995), let m be the number of comparisons to be made (here m = 1653). Let i = 1. . . . . m be the rank of the absolute value of the t-statistic for the comparison concerned when ordered from largest to smallest, so that the observed p-values, P~i)for the ith comparison, are weakly increasing from i = 1 to i = m, i.e., they are ordered from smallest to largest. Then, we can be confident of the observed direction of the iTM comparison when, beginning with the mth comparison, i is the first integer satisfying P(i) ~ PFDR(i) =
i~/2m.
We declare a B-confident direction for all comparisons (and only those) for which j ~< i. Let the rank of this threshold comparison be ic (here ic = 87). Because we are using a common standard error and equally replicated means, we can determine the BSD-value (Benjamini and Hochberg significant difference) by taking the corresponding t-value for the threshold comparisons, tc, and calculating BSO = Itclx/(2 • EMS/n), where EMS is the error mean square and n is the number of values in the means being compared. (This is, of course, equal to the absolute value of the difference in the means which produced tc). Then for any absolute difference greater than this, we can assert B-confidence in direction at the a-level. Hence, the BSD is interpreted in the same way as a least significant difference or a critical studentized range (its behaviour is intermediate between them). Benjamini and Hochberg (1995) provide more detail on their multiple comparison procedure. The population of experiments being averaged over to control the false
98
K.E. Basford, J.W. Tukey/Journal of Statistical Planning and Inference 57 (1997) 93-107
discovery rate (at or below 5%) in our case is that the for the same 58 genotypes in a different collection of environments.
3. Overlap plots While the graphical and semigraphical procedures that follow can equally easily be based upon any choice of significant difference, we find other choices (than BSD) either too radical or too conservative. We will use 5% BSD values to drive both graphical and semigraphical description. Graphical descriptions involving 1653 different separate elements would be unbearable, so we need to restrict attention to approaches that plot (or tabulate) one thing per value compared. If we are to focus on differences, rather than on individual values, this means using overlap-underlap b a r s - - b a r s for which failure to overlap corresponds to confidence about direction. If failure to reach B-confidence about direction corresponds to overlapping bars, then the bars should extend BSD/2 up and down from each point estimate. This is because, as noted earlier, we are using a common standard error and equally replicated means. Exhibit 2 contains the overlap plot (of observed mean _+ (5%BSD)/2) for a particular attribute-environment combination with the genotypes ordered by the original genotype number (panel 2(a)) and also with the genotypes reordered by days to flowering (panel 2(b)). We have found it even more useful to sort the genotypes on increasing order of their point estimates taken modulo 5% BSD (the length of the bars), with ties (incompletely, perhaps) broken by the point estimates themselves (panel 2(c)). The staircase to which the genotype belongs depends on the corresponding integer part, e.g., 0 for the lowest, 1 for the next, etc.
3.1. Semigraphical table form For us, a semigraphical table is one in which the detailed positioning of entries conveys information otherwise naturally conveyed by a graphic display. The major defect with Exhibit 2(c) is an inadequate ability to include useable labels for each of the 58 overlap-underlap bars; only genotypes 20, 48, 51, 52 and 58 were identified. Exhibit 3 shows the result of converting this panel into semigraphical form. The three staircases of overlap bars (inadequately labelled) become three columns each of numerical entries (giving both genotype number and quantitative response). Nonoverlap becomes differing by more than a shift of one column at a fixed height- 'more' either because of an appropriate difference in height or because of a difference of more than one column. The major consequence of the 'modulo 5% BSD' sort order in panel 2(c) was that whenever the value corresponding to the upper end of one bar is near that
K.E. Basford, J.W. Tukey/Journal of Statistical Planning and Inference 57 (1997) 93-107
99
(a) 6
%. c-
4
-lo
2
IlIIIIIIIIIIiIII[iiiiIIIIIIII
° - -
>-
0 .
.
.
.
i
0
.
.
.
.
i
5
.
.
.
.
i
10
.
.
.
.
i
15
.
.
.
.
20
i
.
.
.
.
25
i
.
.
.
.
30
i
.
.
.
.
35
i
.
.
.
.
40
i
.
.
.
.
45
i
.
.
.
.
50
I
.
.
.
.
55
Genotype number
(b) %`
iiiIIIiIiIIIIIIIIIIIIIIIIIII[
t-
>.-
°--
.
.
.
.
I
.
5
.
.
.
I
10
~
'
'
'
l
.
15
.
.
.
I
20
.
.
.
.
i
.
.
25
.
.
i
.
.
.
30
.
i
.
.
35
.
.
i
40
.
.
.
.
!
45
.
.
.
.
i
50
.
.
.
.
i
.
.
.
.
55
Rank on days to flowering
6
%` ..C
1o
(c) 48
4
2
>-
°--
I
0
52 i
20
58 I
I
Residue modulo 5~BSD Exhibit 2 Overlap plots based on the 5% BSD value of 1.860 (the length of the bars) for the 58 genotypes for mean yield (t/ha) in Lawes 1970 (a) ordered by genotype number, (b) ordered by mean days to flowering, and (c) ordered by residue modulo 5% BSD with the staircase to which the mean belongs depending on the corresponding integer part.
for the lower e n d of a n o t h e r , the lateral s e p a r a t i o n of the b a r s is small. In E x h i b i t 3, the c o n s e q u e n c e is t h a t the two c o r r e s p o n d i n g g e n o t y p e s (for which the bars are close to being one a b o v e the other) a p p e a r in a d j a c e n t columns, at n e a r l y the same level.
100
K.E. Basford, J.W. Tukey /Journal of Statistical Planning and Inference 57 (1997) 93 107 continuation b
29 17 11 16
(1.849) (1.722) (1.663) (1.655)
40 12 44
(1.556) (1.472) (1.456)
46 19 18 30 21 43
58
20
A . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
24 9
(2.643) (2.567)
3 31
(2.567) (2.562)
8 39 6 23 34 33 5 1 35 38 25 55 47 2 42 27 15 22 37 36 41 12
(2.457) (2.436) (2.409) (2.406) (2.401) (2.395) (2.392) (2.387) (2.341) (2.341) (2.315) (2.307) (2.298) (2.282) (2.242) (2.239) (2.220) (2.185) (2.166) (1.981) (1.967) (1.965)
(0.587) 48
50
(3.375)
45 26 56
(3.259) (3.211) (3.170)
57 32 10
(3.119) (3.003) (2.985)
4 28 49 53 54 14 7
(2.877) (2.775) (2.761) (2.748) (2.734) (2.721) (2.699)
(1.440) (1.434) (1.432) (1.402) (1.373) (1.370)
(1.227)
(0.998)
52 (0.775) A ............................
A
(4.381)
A
Exhibit 3. Confidence-no confidence information about directions of differencesa between mean yields in t/ha (in parentheses) of genotypes in Lawes (1970) (corresponding to Exhibit 2(c), using the 5 % BSD value of 1.860)
4. Profiles across environments and B-confidence Plant breeders want to know about the response pattern of the genotypes across e n v i r o n m e n t s f o r e a c h a t t r i b u t e . S u c h r e s p o n s e o r p e r f o r m a n c e p a t t e r n s will b e c a l l e d
a For example, genotype 51 (with a mean yield of 0.587) is B-confidently less than genotype 31 (with a mean yield of 2.562) as well as those above genotype 31 in the middle column and for those (only one) in the right hand column. b The right hand panel is a continuation from the bottom of the left hand panel.
K.E. Basford, J.W. Tukey /Journal of Statistical Planning and Inference 57 (1997) 93 107
101
profiles across environments. For reasonable transfer of information, we would want to present, in one picture, profiles for perhaps no more than five or six genotypes (or groups of genotypes). We are likely to need to convey some confidence-like--or overlap-like--information about these profiles. We can proceed using the 5% BSD overlap bars. Consider the mean yield data (from plots in two replicates) for genotypes 1 to 4 in Lawes 1970: Genotype
Yield
5% BSD bar
1 2 3 4 Outer summary Inner summary
2.387 2.282 2.567 2.877
1.427 1.322 1.607 1.917 1.322 1.917
to to to to to to
3.347 3.242 3.527 3.837 3.837 3.242
We now have four overlap bars associated with four genotypes and one environment. With eight (in other data sets, more) environments, we cannot afford more than one, or possibly two, uncertainty indicators associated with an environment, and we cannot afford confusion due to overlapping. We do not know how to meet such a requirement in general, but where the profiles are selected to be similar, as where we plan to work, we can do fairly well. Our first step mirrors a general approach to confidence interval presentation in which, rather than showing the interval where the typical value (estimated parameter) is supposed to fall, we show the complement of the interval, putting ink everywhere we are confident that the typical value does n o t fall. We can do this by drawing arrows from the top and bottom of the figure; arrows that stop where the interval would end. We now do a similar thing for overlap bars rather than confidence intervals, as shown in the next to left panel of Exhibit 4. We want to summarize our four overlap bars as compactly and usefully as we can. Two kinds of summaries are natural: the common part of the four bars which we shall call the inner (hollow) bar, and the most extreme values of any of the four which are at arrow ends. Panel C in Exhibit 4 shows the result for the example. Here the conversion of bars into apertures, between panels A and B, makes the construction clearer and the result less messy. These results are combined with the (imaginary) profiles for the four genotypes in panels D and E. This is extended to all eight environments in Exhibit 5 (with actual profiles), where the interpretation of overlapping of inner bars for two different environments is that the directions of none of the (4 × 2) (4 x 2 - 1)/2 = 28 differences is established. Nonoverlapping of two apertures, by contrast, signifies that the directions of all (4) (4) = 16 differences between a profile in one environment and a profile in the other are established to take the same direction. Something that does happen in Exhibit 5, at two of the environments, is an inner gap, shown by dashed horizontal lines, where the low end of the high overlap bar
K.E. Basford, J. If'.. Tukey /Journal of Statistical Planning and Inference 57 (1997) 93-107
102
B
C
E
D
l
4 ¸
I-
3
J
"0 2
• O T ~'
Genotype Genotype Genotype Genotype
A B C D E
l
1 2 ,.3 4-
Confidence bars Corresponding apertures Outer aperture and inner bar Outer aperture and (imaginary) profiles Outer aperture, (imaginary) profiles and inner bar
Exhibit 4. Mean yield (t/ha) for genotypes 1 to 4 in Lawes (1970).
0 r"
3
140% of median 120% of median
->=
median
2
80% of median 60% of median
B
R
R'
N
L
N'
B'
L'
Exhibit 5. Mean yield (t/ha) and supplemented B-summaries for genotypes 1 to 4 (symbols defined in Exhibit 4) in eight environments (see text for environment code). Apertures correspond to overlap bars and are about 70% the length of B-confidence intervals.
ICE. Basford, J.W. Tukey/Journal of Statistical Planning and Inference 57 (1997) 93-107
103
exceeds the high end of the low overlap bar (and hence no inner bar), indicating that the direction of difference for at least one pair of profiles at this environment is well established. In Exhibit 5, the order of the environments on the horizontal axis is based on the mean for that environment over all 58 genotypes (more detail below). The convention used in labelling the environments is the first letter of the location with a prime denoting 1971 (and absence of a prime denoting 1970). Two scales are presented on the vertical axis: the original one (on the left-hand side) and another scale representing percentages of the median across environments of the mean yields in each environment (on the right-hand side). In practice, we can also follow this procedure and present such aperture profiles for groups of genotypes as well as for individual genotypes.
5. Grouping the data When we move from single values to profiles, we cannot look at (and comprehend) many different profiles at once. Something like four to six is all we can compare and have a good transfer of information from eye to brain. For the 58 (or 43) genotypes, this means that we must group (or aggregate) the genotypes into perhaps 6 to 8 fractions (or combine in stages). Then we can display summary profiles of these fractions in one picture and the profiles for the individual genotypes within these fractions in other pictures. Because of the major differences in maturity displayed in the preliminary analyses, we decided to group the genotypes separately within the three maturity classes, i.e., E, M1, and M2. The groups thus formed show, as we see from Exhibit 6, rather detailed compatibility with days to flowering, whose values were not used in constructing the groups (but were used to order groups and subgroups). Thus, the decision to work separately within each of the three classes is unlikely to have had any bad effects. Our purpose in grouping is to ease and improve the transfer of i n s i g h t - - a n d i n f o r m a t i o n - - to the viewer of the profiles. We do not care whether or not the groups of soybean genotypes formed are 'real', either actually or demonstrably. Any of various grouping procedures would be likely to serve us almost as well. Accordingly, and to save space, we shall omit all details, noting only that we did find new coordinates that were formally orthonormal and used Euclidean distance. 5.1. Graphical presentation
The first decision in presenting these profiles across environments (for each attribute) is the order (and spacing) of the environments on the horizontal axis. We have chosen to use such summary behaviour as the overall mean across all 58 soybean genotypes within each environment. This was originally put forward by Yates and Cochran (1938), popularized in the joint linear regression literature by Finlay and
104
K.E. Basford, J.W. Tukey/Journal of Statistical Planning and Inference 57 (1997) 93-107
Maturity class
E
Group
Subgroup
1
y
51 58
33.3
z
44 46 52
40.5
x
54 55 56 57
39.6
y
45 47 49 53
43.4
z
48 50
44.6
42
55.3
W
26 27 35
56.0
X
39 40 41
58.6
Y
15 24 38
60.1
Z
2 25
61.9
4 9 10
60.9
2
3 4
Ml
M2
7
Genotypes
Median days to flowering
3567
61.4
1 14 16 28 33
68.0
23 31 34
69.4
32 37
70.2
V
19 29
69.1
W
20 22 36
72.0
X
8 3O 43
73.3
Y
11 12 13 21
74.6
Z
17 18
75.2
Exhibit 6. First and second stage grouping of genotypes. The embolded genotypes, 41 and 43 are parents of local selections 1 to 40, while the italicized genotype 42 was the other parent used in previous trials.
W i l k i n s o n (1963), E b e r h a r t a n d Russell (1966) a n d P e r k i n s a n d Jinks (1968), a n d also used in the initial p a t t e r n analysis l i t e r a t u r e by Byth et al. (1976). T o illustrate o u r displays, we shall c o n s i d e r m e a n yield. G i v e n t h a t the e n v i r o n m e n t m e a n s are r e a s o n a b l y distinct, we shall use these directly on the h o r i z o n t a l axis, as an a l t e r n a t i v e to equal s p a c i n g (as was used in E x h i b i t 5). T o gain an overview of the differences a m o n g the groups, we present the profiles of the g r o u p m e a n s in the s o u t h e a s t c o r n e r of E x h i b i t 7. Solid lines have been used for the g r o u p s in m a t u r i t y class E, d o t t e d lines for those in m a t u r i t y class M1 a n d d a s h e d lines for those in m a t u r i t y class M2. T h e n to gain an a p p r e c i a t i o n of the v a r i a b i l i t y within these groups, we have p l o t t e d the profiles of the i n d i v i d u a l s o y b e a n g e n o t y p e s within each g r o u p in the seven o t h e r panels of E x h i b i t 7. It is fairly clear t h a t different p a t t e r n s of response are being identified a m o n g the different groups.
K.E. Basford, J. ~. Tukey /Journal of Statistical Planning and Inference 57 (1997) 93 107
-o
Group 1
/ 4~
Group 2
"37"
t
0
r-
'
1.5
'
2.0
2.0
2.5
Overall environment mean yield
Group 3
Group 4
2
:+
,,..."+" "".., ..................... ~'"
A..-.+"
0
"~
1.5
2.5
Overall environment mean yield
o)
0~
.o_
~E
++I+ (~
'
1.5
-o0) 4 "~,
g
'
2.0
......~.. ~+++:~~ ~v¢ .....
1.5
2.5
,::!
:":':ii :::::::7 2.0
2.5
Overall environment mean yield
Overall environment mean yield
Group 5
Group 6
4
iI
"61" ..- .. ::...:.~..
j,
. . . . . +:-,~,.. . . . . . . . . . . . . . (D
105
~.,
_~ Q)
g
4L
"69" *, . . . . . ~.~,-~--~_ * ==.='-===°°~ - ' ~ . t~ff~ ~. ", ..j-=a..,-_-'_-~:::::Mt
2
p:'
"~
.., 0
1.5
2.0
2.5
Overall environment mean yield
1.5
'
2.0
'
2,5
Overall environment mean yield
Group 7
Groups 1 to 7 "73"
4
g
~
........
::-:"
~
-
,;
©
0 1.5
2.0
2.5
Overall environment mean yield
1,5
2,0
2.5
Overall enviroment, mean yield
Exhibit 7. Profiles of mean yield (t/ha) for individual genotypes within each group and profiles of mean yield for groups against overall environment mean. N u m b e r s in quotation marks are mean days to flowering. The ordering of the environments on the horizontal axis is given in Exhibit 5, and the maturity class E, M1 and M2 has been denoted by solid, dotted, and dashed lines, respectively.
6. Smoothing the profiles Our purpose in smoothing is mainly to ease and improve the transfer of insight about the behaviour underlying the data. We expect there will be situations where
106
K.E. Basford, J.W. Tukey /Journal of Statistical Planning and Inference 57 (1997) 93 107
prediction for a new environment, for which only the horizontally plotted quantity is given, will be better from the smoothed broken line (pieces of profile) than from the observed broken line, but we know better than to promise this. Our smoothing needs to be robust, and so cannot always have the smooth of a sum equal to the sum of the smooths. This means we can take advantage of subtracting a 'smooth portion', smoothing the resulting residuals, and adding back the 'smooth portion'. We did this (with a median-based robust smoother), but do not have space here for details. However, we have illustrated the result for Group 7 (Exhibit 8) where we have ordered the environments on the mean yield within Group 7 on the horizontal axis.
3.0
(a) Unsmoothed profiles
.%.
2.5
O. . . . . . .
,C,
0
/
..C
/
2.0
•
,1~-v
.-"" ,'~-'-5~Lv'~;':-. ..." ..... .~ :~'--,', ,, "-o
"O
1.5 C O "D ID
z;
"':%/
1.0
-"-",
.... ~"--'-'-=I.'" "0" ,'
0.5 R
B
R'
L
B'
NL'
I
0.0 0.5
,.o
N'
o
7v
•
7w
v
7x
•
7y
[]
7z
;o
2.5
Mean yield for Group 7 ( t / h a ) 3.0
(b)
Smoothed profiles
2.5
.~0
¢-
2.0 "10 , -q) -
1.5
c o
1.0
0.5
r o #° R
0.0 0.5
B 1.0 '
R'
L
B'NL'
1.5 '
N'
i
20
o
7v
•
7w
v
7x
•
7y
o
7z 2.5
Mean yield for Group 7 (t/he) Exhibit 8. Unsmoothed and smoothed profiles of median yield (t/ha) for five subgroups, v to z, in Group 7 against the mean yield for Group 7 (see text for environment code).
K.E. Basford, J.W. Tukey/Journal of Statistical Planning and Inference 57 (1997) 93 107
107
7. Comments The investigation and results presented here provide a partial account of a larger treatment of data analytic approaches to plant breeding trials (Basford and Tukey, 1997).
References Basford, K.E. and J.W. Tukey (1997). Data Analytic Procedures for Plant Breeding Trials (in preparation). Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57, 289 300. Byth, D.E., R.L. Eisemann and I.H. DeLacy (1976). Two-way pattern analysis of a large data set to evaluate genotypic adaptation. Heredity 37, 215-230. Eberhart, S.A. and W.A. Russell (1966). Stability parameters for comparing varieties. Crop Sci. 6, 36~0. Finlay, K.W. and G.N. Wilkinson (1963). The analysis of adaptation in a plant breeding programme. Australian J. Agric. Res. 14, 742-754. Mungomery, V.E., R. Shorter and D.E. Byth (1974). Genotype x environment interactions and environmental adaptation. I. Pattern analysis--application to soya bean populations. Australian J. Agric. Res. 35, 59-72. Perkins, J.M. and J.L. Jinks (1968). Environmental and genotype-environmental components of variability, III. Multiple lines and crosses. Heredity 23, 339-356 Tukey, J.W. (1986a) 2. Sunset Salvo. Amer. Statist. 40, 72-76. Yates, F. and W.G. Cochran (1938). The analysis of groups of experiments. J. Agric. Sci. 28, 556 80.
2Letters used with years on John Tukey's publications correspond to bibliographies in all volumes of his collected works.