Binning Clones by Hybridization with Complex Probes: Statistical Refinement of an Inner Product Mapping Method

GENOMICS 41, 141–154 (1997) GE974652 ARTICLE NO. Binning Clones by Hybridization with Complex Probes: Statistical Refinement of an Inner Product Ma...

Download PDF

253KB Sizes 2 Downloads 27 Views

Report

PDF Reader
Full Text

GENOMICS

41, 141–154 (1997) GE974652

ARTICLE NO.

Binning Clones by Hybridization with Complex Probes: Statistical Refinement of an Inner Product Mapping Method CHRIS ANDREWS,* B. DEVLIN,*,†,1 MARK PERLIN,‡

AND

KATHRYN ROEDER*

*Department of Statistics and ‡Department of Computer Science, Carnegie Mellon University; and †Department of Psychiatry, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania 15213 Received June 18, 1996; accepted January 24, 1997

Molecular methods that use long-range information to solve genomics problems (i.e., top-down strategies) efficiently have become increasingly prominent in the genomics literature. One such method, an implementation of inner product mapping (IPM), uses noisy, longrange radiation hybrid (RH)/YAC overlap data and relatively noise-free RH/STS overlap data to localize clones to specific chromosomal regions. Because the molecular data are rarely noise-free, statistical models tailored to the top-down molecular methods make the methods far more effective. We develop two statistical models for IPM (or any other top-down strategy of similar form), a parametric logit model and a nonparametric order-restricted model, and show how these models can be implemented within a hierarchical Bayes framework. Using these models, we refine the chromosome 11 map reported in M. Perlin et al. (1995, Genomics 28: 315–327). Our analyses improve the IPM map, both in terms of successful localization of clones and in terms of the confidence with which they are localized. q 1997 Academic Press

INTRODUCTION

Top-down strategies, which divide a large, difficult problem into smaller, simpler subproblems or which use long-range information to simplify the problem (Aho et al., 1983), play an increasingly prominent role in genomics research. The process by which the sequence-tag site (STS) framework map of the human genome was built (Hudson et al., 1995) provides an excellent illustration of a top-down strategy. The goal of this mapping effort was to locate STS markers on specific human chromosomes and, within chromosomes, determine their order and approximate locations. To map about 15,000 STS markers, two longrange resources were employed: radiation hybrid (RH) mapping of monomorphic STSs and genetic mapping of polymorphic STSs. Together, these methods mapped about 70% of the markers, which then function as a framework on which to hang new markers. STS-con1 To whom correspondence should be addressed. Telephone: (412) 268-8973. Fax: (412) 268-7828. E-mail: [email protected].

tent mapping (Green and Green, 1991) with megaYACs (Bellanne-Chantelot et al., 1992) was used to establish local proximity groups for new STS markers and to locate these groups relative to the premapped framework markers. Without the top-down long-range information, the bottom-up information provided by STScontent mapping would yield little more than rough, local clusters. The STS framework map will be extremely useful for cloning disease genes and as a foundation for largescale sequencing efforts. However, the map itself will be insufficient to complete the next goal of the human genome project, determining the entire nucleotide sequence, even if it achieves a much higher STS density. Other sequencing resources will be needed, particularly smaller, sequenceable clones such as BACs or PACs localized to specific genomic regions. Localization of BACs or PACs could be achieved by any of a large number of methods. One top-down method that could accomplish this task, dubbed IPM for inner product mapping (Perlin and Chakravarti, 1993), was used by Perlin et al. (1995) to develop a YAC map of chromosome 11. This method shares features with the STS-framework mapping methodology in that each uses long-range information for efficient localization and each utilizes both noisy and relatively noisefree data. A fundamental difference between the methods is the definition of the bins and what is binned. For the framework map, the bins are defined by chromosomal breaks, specifically recombinant, RH, and megaYAC breaks, and it is the STSs themselves that are binned (Cox et al., 1994). For IPM, bins are chromosomal regions surrounding STSs, and it is the clones themselves that are binned. While different in philosophy, these two methods are clearly complementary. IPM uses two kinds of data to achieve its binning. For a set of STSs and RHs, RH versus PCR STS comparisons (Cox et al., 1990) generate data to position the fragments of each RH; statistically, each experiment yields an indicator describing whether the STS and RH overlap. For a library of clones and the same set of RHs, the RHs are serially hybridized against the set of clones (Monaco et al., 1991) to generate data that position RHs relative to each clone; statistically, each

141

AID

GENO 4652

/

6r2f$$$261

03-15-97 00:54:29

0888-7543/97 $25.00 Copyright q 1997 by Academic Press All rights of reproduction in any form reserved.

gnmal

142

ANDREWS ET AL.

into a series of far smaller, simpler experiments can make a molecular task more feasible, top-down strategies pave the way for more efficient and stable statistical analyses by reducing the problem to a simpler set of subproblems. In this report, we refine the chromosome 11 map reported in Perlin et al. (1995). To accomplish this task, we develop statistical methods to complement IPM or any other top-down method of similar structure. The objective of these methods is to assign YAC (or smallerinsert bacterial) clones to local chromosomal regions centered by STS locations (‘‘STS bins’’) based on the overlap characteristics of a set of RHs with both the clones and the STSs. The likelihood model is incorporated into Bayesian hierarchical models (Berger, 1985), which provides a convenient framework for expressing the likelihood results in terms of the probability that a given clone is from each STS bin. Powerful computational methods that are available within this Bayesian framework are obtained to determine these probabilities. Finally, the clone map is evaluated using independent, clone–clone hybridization data (Qin et al., 1996). FIG. 1. Pictoral representation of IPM method. (a) An example of the true clone locations along the chromosome. Note that clones 2 and 4 do not overlap any STS. (b) The two idealized data tables: table A records the overlap (1) or nonoverlap (0) results of the YAC vs RH comparisons; table B records the (1/0) results of the RH vs STS comparisons. (c) The clone localization table obtained by comparing rows of table A and columns of table B. Note that clones 2 and 4 are correctly localized to STS bins near their true locations.

experiment ideally yields an indicator describing whether the clone and RH overlap. The key idea of IPM is straightforward when these indicators of overlap are considered: If the set of RHs that overlap an STS also appear to overlap a specific clone, then the clone’s true location on the chromosome is most likely proximate to that STS (Fig. 1). The molecular results are less than ideal, however. Although the STS versus RH data are almost noisefree, there is usually substantial uncertainty about whether a clone and an RH overlap. In this instance, an experimental result for clone versus RH overlap is commonly scored on a scale such as from 02, for unlikely to overlap, to 2, for very likely to overlap. The IPM study of Perlin et al. (1995) successfully used L2 minimization (L2-min) to compare the RH scores of clones and STSs, in part to overcome this data uncertainty. This L2-min implementation has limitations, however, which include limited probabilistic interpretation of the map and less than maximal efficiency. These limitations can be overcome naturally by building a statistical model that dovetails with IPM’s topdown strategy. In fact, there can be a synergism between top-down strategies and statistical methods that is very appealing. Top-down strategies can turn potentially intractable statistical problems into problems that are easily solved by powerful statistical methods. In the same way that changing a single, very complex experiment

AID

GENO 4652

/

6r2f$$$262

03-15-97 00:54:29

MOLECULAR METHODS With a very dense STS map, clone binning can be achieved by directly comparing each clone against each STS. Instead, for greater experimental efficiency, Perlin et al. (1995) implemented an IPM approach (Perlin and Chakravarti, 1993) to obtain the desired binning. Specifically, Perlin et al. (1995) binned chromosome 11-specific YAC clones using two datasets: a dataset containing an RH signature for each clone, obtained by DNA hybridization studies, and a dataset containing an RH signature for each mapped STS, obtained by PCR comparison. These data, the clone versus RH table and the RH versus mapped STS table, were then computationally combined to indirectly infer each clone’s location relative to the mapped STSs. Because we analyze the same data that Perlin et al. (1995) used to produce their chromosome 11 map, we only outline their molecular methods. Refer to the original paper for more details. The clone versus RH table. The clones Perlin et al. (1995) mapped were the Roswell Park Cancer Institute (RPCI) library of 1728 nonchimeric chromosome 11 YACs, which have a 350-kb average insert size (Qin et al., 1996). To produce the clone versus RH table, they robotically spotted Alu-PCR products of chromosome 11 YACs onto nylon membranes, then serially hybridized these against Alu-PCR products from RHs, which are described below. The results of these experiments were scores, at one of five levels, indicating the probability of RH/clone overlap. Only the 1295 clones that obtained at least one probable RH/clone overlap were used in our analyses. The RH versus mapped STS table. The complex probes used by Perlin et al. (1995) were a collection of 73 chromosome 11-derived radiation hybrids (Richard et al., 1991, 1993), having an 8-Mb average interbreak distance, an average retention probability of 0.25, and a minimum retention probability of 0.02. These RHs had been previously characterized by a set of 506 chromosome 11-specific STSs (James et al., 1994), demonstrating 240 unique RH bins and an average interbin distance of 625 kb. The binary comparison scores of the 73 RHs versus the 240 bins had been used to map the STSs (James et al., 1994), and also served to localize the chromosomal fragments of each RH relative to the mapped STSs. This methodology differs from the more typical protocol by which simple probes are hybridized against more complex immobilized targets. To reduce the number of experiments significantly, IPM draws on Lehrach’s idea (Monaco et al., 1991) of reversing the hybridization direction and using complex probes derived from inter-Alu-PCR (Nelson et al., 1989) products of RH DNA. When such complex DNA probes are used, there are at least four potential problems. First,

gnmal

STATISTICAL METHODS FOR BINNING CLONES the inter-Alu amplification of the complex DNA may be incomplete due to PCR competition between its many inter-Alu sites; this leads to underrepresentation of the inter-Alu sites in the probe and can cause false-negative scores. Second, because any one target has few inter-Alu sites, and the labeled probe may have thousands of sites (only a few of which are specific for the target), there is a potential signal-to-noise problem in detecting true hybridizations. Third, because a target can present a variable number of sites, and the complex probe may show large variation in amplification, labeling, and binding of these sites, the observed hybridization intensities can exhibit large variation. Fourth, repetitive genomic DNA in the complex probe may cause false-positive scores. None of these obstacles is insurmountable, as the results of Perlin et al. (1995) show: 1. Because the false-negative rate exceeded 50%, the inter-Alu competition did appear to reduce inter-Alu site representation in the complex probe. However, computational methods separated the signal from the noise in these data. 2. The inherently low signal-to-noise ratio required the use of larger amounts of radioactive label to provide useful data. 3. The variability of inter-Alu hybridization generated a wide range of intensities on the autoradiographs. By using a quantitative as opposed to a binary scoring scale, however, the data were described adequately. In these data, both true and false DNA overlaps could produce identical hybridization signals; thus, the observed intensity can only be interpreted probabilistically. 4. The repetitive DNA was suppressed by competing the labeled probe with an excess of Cot-1 DNA and by including placental DNA in the prehybridization solution. Since the hybridization data has a low false-positive rate, genomic repeats did not appear to pose a substantial problem.

STATISTICAL ANALYSES

The objective is to estimate the (posterior) probability that a clone should be assigned to a particular STS bin (that is, the short chromosomal region surrounding an STS). To obtain this probability, two items are required: a prior probability distribution for the bin of the clone and the likelihood of each bin based on the data (Berger, 1985). For this application and most applications of this kind, there is little a priori information about the clone’s true position in the genome. For this reason we use distributions that give each bin equal weight, a priori. Where other priors are required, we use distributions that are dominated by the likelihood and hence have little influence on the outcome of the analysis. Bayesian models have been used successfully for other genomic applications. Lange and Boehnke (1992), for example, used Bayesian methods for RH mapping. Although such models are analytically daunting, they are still tractable by applying powerful computational machinery. Because the models we propose are too complex for direct analysis, we use a Markov chain Monte Carlo algorithm (Geyer, 1989; Gelfand and Smith, 1990; Smith and Roberts, 1993) to generate an estimate of the posterior distribution of the bin of each clone. Where appropriate, each of the following subsections begins with a general explanation of the methods described therein and then proceeds to give technical supporting matter. The main ideas of the methodology should be apparent without thorough consideration of the technical matter.

AID

GENO 4652

/

6r2f$$$262

03-15-97 00:54:29

143

Basic Notation For notation, let i Å 1, . . . , I index the clones; j Å 1, . . . , J index the STSs; and k Å 1, . . . , K index the RHs. For the chromosome 11 dataset, I Å 1295, J Å 240, and K Å 73. Also, let tables A, B, and C represent the true state of nature: Aik Å 1 if clone i and RH k overlap, and 0 otherwise; Bkj Å 1 if RH k and STS j overlap, and 0 otherwise; Cij Å 1 if Aik Å Bkj for each k, and 0 otherwise (Fig. 1). We wish to estimate table C from estimates of tables A and B. For each i we assume Cij Å 1 for exactly one value of j and we can therefore write Ci Å j to denote ‘‘clone i is in STS bin j.’’ This assumption is reasonable, provided that clones are sufficiently small, STSs are not too dense, and RHs are sufficiently large. These techniques are robust to minor violations of these assumptions. Note that Ci Å j does not imply that clone i and STS j overlap. It merely implies that any RH that overlaps with clone i also overlaps with STS j and vice versa. Observable Variables The analyses of the chromosome 11 dataset assume that table B is measured error free, except for 80 missing values. This simplifying assumption is reasonable because the RH versus STS overlap experiments are very reliable. A value in B is referred to as the ‘‘overlap status’’ between an RH and an STS. We handle the missing values using standard statistical techniques (Tanner and Wong, 1987) described in the appendices. Table A cannot be determined without error due to uncertainty in scoring the overlap between a clone and an RH. For each clone i and each RH k, the experimenter estimates the possibility of overlap using a discrete scale ranging from 1 (very unlikely) to L (very ˆ , is likely). An entry in this I 1 K table of estimates, A referred to as the ‘‘overlap score’’ (or just ‘‘score’’) between a clone and an RH. Modeling this scoring process is a key feature of our analysis. ˆ has no missing For the chromosome 11 dataset, A observations, and each score is one of five levels. Because very few of the highest scores were observed, the fourth and fifth categories are collapsed into a single category (so L Å 4). The average score (over the K RHs) for a given clone varies widely by clone for these data. Some of the variation is due to the number of times the clone overlaps the RHs. In the chromosome 11 dataset, the RH coverage of the chromosome is not uniform: 61 of the 73 RHs overlap the STS bin 89 (centromere); only 12 overlap STS bin 196. However, this nonuniform coverage does not explain all the variability in the average clone scores (see Molecular Methods). Thus we build a probability model with parameters that vary by clone, thereby accounting for the clonal variation and making the model more accurate. For a given clone the probability of achieving a given score is assumed to be the same for any RH that overlaps that clone. The same assumption is made for RHs that do not overlap the clone.

gnmal

144

ANDREWS ET AL.

ˆ contains overFIG. 2. Data for a hypothetical example. Table A lap score for clone i (rows) vs RH k (columns). It is a noisy estimate of table A in Fig. 1. Table B contains the overlap status between RH k (rows) and STS j (columns). It is observed without error.

To fix ideas, consider a small, hypothetical example, with I Å 5 clones, K Å 3 RHs, J Å 8 STSs, and overlap ˆ ik) ranging between 1 and 4 (Fig. 2). Upon scores (A examination of these tables it seems likely that clone 1 lies in bin 2, clone 2 lies in bin 3, clone 3 lies in bin 4, clone 4 lies in bin 5, and clone 5 lies in bin 7 (i.e., C1 Å 2, C2 Å 3, C3 Å 4, C4 Å 5, and C5 Å 7). This binning ˆ correis inferred because the higher scores in a row of A spond roughly with the 1’s in the inferred column of B. Our model formalizes this inferential binning procedure. L2-min The IPM implementation of Perlin et al. (1995) uses an inner product to measure the agreement between the RH signature of a clone and the RH signature of ˆ and B are each STS. The values in the matrices A rescaled prior to the inner product analysis. Normalizing each column of B controls for unequal retention of RHs across the chromosome. In particular, when few RHs are retained at an STS bin, those RHs are given more weight in the inner product analysis. Each row ˆ is normalized so that the resulting IPM scores are of A between 01 and 1. ˆ and a The inner product of a normalized row of A normalized column of B is a measure of how well a particular clone fits in a particular STS bin. A clone is localized to the STS bin where the largest score is achieved (the ‘‘IPM-max-height’’). As shown in the Appendix of Perlin et al. (1995), this maximization of the renormalized inner product corresponds to a minimization of the least-squares distance between the RH signature of a clone and the RH signatures of the STS bins. This least-squares distance minimization approach is a standard technique (e.g., regression) that is simple and computationally efficient.

with the given STS for each score versus the given clone. The pattern in this table determines how well the ˆ matches the jth column of B. If the clone ith row of A has been placed in the correct STS bin, high scores in ˆ should be concentrated in the RH/STS overlap row A and low scores should be concentrated in the RH/STS nonoverlap row. For example, Table 2 displays two such tables from the chromosome 11 dataset, one when clone 1B5 is placed in bin 103 (Table 2a) and the other when clone 1B5 is placed in bin 31 (Table 2b). Note that Table 2a has most observations in the high-score columns of the overlap row and in the low-score columns of the nonoverlap row. Table 2b does not exhibit this pattern. These patterns suggest that clone 1B5 is more likely to belong in bin 103 than in bin 31. Technical matter. Formally, we build this table by first assuming that clone i belongs in STS bin j. Each RH k is scored against clone i and typed against STS ˆ ik), k Å 1, . . . , j giving K bivariate observations: (Bkj , A K. For those RHs that overlap STS j, the RH/clone scores are exchangeable (the index k contributes no information); likewise for the RHs that do not overlap STS j. Thus we can summarize the data obtained from ˆ and column j of B in a 2 1 L table (Table 1). row i of A The rows of the table are indexed by the overlap status of the RH fragments with STS j, and the columns are indexed by the RH/clone i score. The number of RH fragments with overlap status t Å 0, 1 and with score (ij ) ˆ ik Å l), where l Å 1, . . . , L is ntl Å (KkÅ1Ind(Bkj Å t and A Ind() is the indicator function. Likelihood Model The general structure of the likelihood model follows from the overlap tables. The row margin of an overlap table is determined by the placement of a clone in an STS bin (table B is considered fixed and known). Thus the likelihood model is based on the conditional probability of each score given each overlap status, yielding the likelihood that clone i is from STS bin j. Two competing models are formulated for the parameters of the likelihood model, one nonparametric and one parametric. The nonparametric model, which we call the order-restricted model, assumes only that the probability of a high score is greater if the clone and the RH overlap than if they do not. The parametric TABLE 1

Summarizing the Data

Observed Overlap Table

ˆ and B can be summarized in The data from tables A a way that is amenable to likelihood analysis. For a ˆ given clone and STS bin, the relevant data in tables A and B can be summarized in a table with two rows, for whether or not the STS overlaps with an RH, and with L columns, for the L possible overlap scores for the clone versus an RH (Table 1). Because each RH contributes one observation to the table, the cells of the table contain the number of RHs that did (or did not) overlap

AID

GENO 4652

/

6r2f$$$263

03-15-97 00:54:29

Score Overlap 0 1 Total

1

2

(ij) 01 (ij) 11 (i) /1

(ij) 02 (ij) 12 (i) /2

n n n

n n n

. . .

L

. . . . . . . . .

(ij) 0L (ij) 1L (i) /L

n n n

Total (j)

n0/ (j) n1/ K

Note. Observed overlap table constructed for clone i in STS bin j. Rows indexed by overlap status. Columns indexed by overlap score.

gnmal

145

STATISTICAL METHODS FOR BINNING CLONES

TABLE 2 Two Overlap Tables for Clone 1B5 (a) In bin 103

(b) In bin 31

Score

Score

Overlap

1

2

3

4

Total

Overlap

1

2

3

4

Total

0 1 Total

42 4 46

7 9 16

0 6 6

0 5 5

49 24 73

0 1 Total

37 9 46

7 9 16

4 2 6

5 0 5

53 20 73

Note. Observed overlap tables for clone 1B5 in bins (a) 103 and (b) 31. Rows indexed by overlap status. Columns indexed by overlap score.

model, which is known as the logit model, makes a stronger assumption about the relationship between scores and overlap. Computer programs in C that execute the analyses described in this paper are available from the authors (see Acknowledgments for contact information). Technical matter. For each likelihood model, define ˆ ik Å the following parameters for the probability that A l given that Aik Å t. Let (i) tl

u Å P(AO ik Å lÉAik Å t) Å P(AO ik Å lÉBkj Å t, Ci Å j)

clone and the RH overlap than when they do not. For example, in Table 2a, bin 103 is a good candidate for placement of clone 1B5 because 5/24 ú 0/49, 11/24 ú 0/49, and 20/24 ú 7/49. Technical matter. More formally, in addition to the obvious restrictions (LlÅ1utl(i) Å 1 and utl(i) § 0, we assume that the following inequality holds for each l Å 2, . . . , L and each i Å 1, . . . , I: L

(i) Å P(AO ik § lÉAik Å 1) ∑ u1l=

l=Ål

L

for each i Å 1, . . . , I, k Å 1, . . . , K, l Å 1, . . . , L, and t Å 0, 1. The second equality holds as Aik Å Bkj for all k if Ci Å j. (In fact, this will not be true in practice for all k, but the limited number of discrepancies has little or no impact on the estimates because of redundant coverage.) The likelihood function for the ith table is then the product of two multinomial likelihoods, one for the RH/ STS overlaps (t Å 1) and one for the RH/STS nonoverlaps (t Å 0), L

L

(i) n0l (i) n1l Li Å P(AO iÉB, Ci, u(i)) } ∏ [u0l ] ∏ [u1l ] , (ij )

lÅ1

(ij )

[1]

lÅ1

where u(i) is a vector with components utl(i) for t Å 0, 1 and l Å 1, . . . , L. The clones are measured independently so the full likelihood function is the product over i of the individual clone likelihoods, I

1

L

L Å P(AO ÉB, C, u) } ∏ ∏ ∏ [utl(i)]ntl , (ij )

[2]

iÅ1 tÅ0 lÅ1

where u is a vector with components u(i). Order-Restricted Model For this nonparametric model, we impose a minimal set of restrictions. In particular, the parameters u(i) ˆ tend to be must be restricted so that high scores in A associated with overlaps and low scores with nonoverlaps. For the order-restricted model, this restriction is enforced by requiring that the probability of an RH/ clone experiment scoring at least l be larger when the

AID

GENO 4652

/

6r2f$$$263

03-15-97 00:54:29

(i) § P(AO ik § lÉAik Å 0) Å ∑ u0l= .

[3]

l=Ål

We give the label V to the region of u(i) values that satisfy all these conditions. A likelihood method places the clone in the bin that achieves the maximum likelihood possible over all STS bins. To find this maximum, we first find the maximum likelihood within each bin (i.e., maximize over u(i) for each bin j). For clone i in bin j, the vector of observed (ij) (ij) proportions (uH (ij) tl Å ntl /nt/ ) is the maximum likelihood (i) estimate of u if it lies in V. In this case the maximum log likelihood for clone i in bin j [log of Eq. [1]] is log Li Å n0/ ∑ uH 0l log uH 0l / n1/ ∑ uH 1l log uH 1l , [4] (ij )

(ij )

l

(ij ) tl

(ij ) (ij ) t/ tl

(ij )

(ij )

(ij )

(ij )

l

because n Å n uH . Equation [4] is minimized when the row probabilities are even (1/L) and is maximized when all the probability is in one cell. Thus the likelihood favors bins for which the overlap table has unequal cell probabilities that satisfy the order restriction. For example, the order-restricted model favors bin 86 over bin 71 for clone 1H10 (Table 3). For both binnings the vector of observed proportions is in V, but the row probabilities in Table 3a are more uneven than those in Table 3b. When uH (ij), the vector of observed proportions for a particular clone placement, is not in V, the maximum likelihood estimate of u(i) for bin j is the point in V ‘‘closest’’ to uH (ij). The likelihood naturally penalizes for lack of fit. This penalty will make placements such as bin 31 for clone 1B5 (Table 2b) improbable.

gnmal

146

ANDREWS ET AL.

TABLE 3 Two Overlap Tables for Clone 1H10 (a) In bin 86

(b) In bin 71

Score

Score

Overlap

1

2

3

4

Total

Overlap

1

2

3

4

Total

0 1 Total

30 4 34

3 2 5

2 0 2

3 29 32

38 35 73

0 1 Total

23 11 34

3 2 5

1 1 2

14 18 32

41 32 73

Note. Observed overlap tables for clone 1H10 in bins (a) 86 and (b) 71. Rows indexed by overlap status. Columns indexed by overlap score.

Along with the likelihood, a prior distribution on each parameter is needed to complete the Bayesian model. For the order-restricted model ‘‘reference priors’’ were chosen for all of the parameters in the model because little or no information was available about these parameters prior to the collection of the data. A reference prior is a prior that is dominated by the likelihood function and does not favor any particular values of the parameter (Lee, 1989). For these priors, the results are quite similar to a likelihood analysis. A uniform prior on the region V is used for u(i). The resulting density function of u(i) is constant and positive on the permissible values of the parameter and 0 elsewhere. With no information about the location of a clone prior to the experiment, it is reasonable to assign equal probability to each of the J possible locations. Thus, P(Ci Å j) Å 1/J for each i and j. The reference prior for a missing value of the B matrix is the Bernoulli(1/2) distribution: P(Bkj Å 0) Å P(Bkj Å 1) Å 1/2. These priors are not necessarily ideal and could be improved using additional information. For example, if certain STS bins are larger or if some regions of the chromosome produce clones that are more likely to be retained in the experiment, that information could be included in the prior for Ci . The prior for Bkj could potentially be improved by accounting for the geometry of RHs, but in our application this alteration is of little value due to the large amount of data. To implement the Markov chain Monte Carlo algorithm, the conditional distributions of each parameter given all the remaining parameters and the data are needed. See Appendix I for further details concerning implementation of the order-restricted model.

Recall from the Basic Notation section that the choice of STS bin determines the values of Aik , k Å 1, . . . , K, because Aik Å Bkj if clone i is in bin j. For illustration, we use the data in Table 2 to display the log odds of overlap given each score and the best-fitting line on the logit scale (Fig. 3). When clone 1C1 is correctly placed in bin 103 (Fig. 3a), the data match the best-fitting line quite closely. Provided the logit has a positive slope, the probability of overlap is larger for high scores than for low scores. Consequently, it is sensible to restrict the slope of the logit model to be positive. Technical matter. The logit model in Eq. [5] predicts the row variable (overlap status), conditional on the column variable (overlap score). The likelihood described in Eq. [2] is for a model that predicts the column variable, conditional on the row variable. Bayes’ rule is employed to reverse the conditional probabilities of the logit model (see Appendix II). Although clones tend to differ, they also have features in common. With a random effects approach (Searle, 1971) the probability model can capture the common features of clones while still allowing for variability among clones. This aspect of the model is incorporated in the prior distributions. In keeping with conventional random effects models, the logit parameters

Logit Model The second proposed model for the parameters of the overlap table is parametric and imposes more structure across scores (l) and across clones (i) on the probability ˆ ik Å l given that Aik Å t. The logit model assumes that A that the probability of overlap between an RH and a clone increases as the score (l) between the RH and the clone increases. We use the logit function [logit(P) Å log odds(P) Å log(P/(1 0 P))] to model this increase: logit(P(Aik Å 1ÉAO ik Å l)) Å bi0 / bi1l.

AID

GENO 4652

/

6r2f$$$263

[5]

03-15-97 00:54:29

FIG. 3. Fitted logit for clone 1B5 in STS bins (a) 103 and (b) 31. Data is from Table 2 with 0.1 added to all cells to avoid 0 cell counts.

gnmal

STATISTICAL METHODS FOR BINNING CLONES

147

localized to bin 20 because the posterior probability, according to the logit model, for that bin is 0.71. Clone 2E7 (Fig. 4b) is localized to bin 41 but with less confidence. A clone map is built by localizing each clone to the STS bin with the maximum posterior probability for that clone. Because the priors in our model are dominated by the likelihood, the maximum posterior probability of clone location will not differ substantially from a maximum likelihood estimate. Evaluating a Map

FIG. 4. Posterior distributions from logit model for four clones: (a) 1A2, excellent; (b) 2E7, intermediate; (c) 7C5, poor; and (d) 15H3, ambiguous.

for the I clones are a priori assumed to follow a bivariate normal distribution with unknown mean g and unknown covariance matrix S,

bi Ç N2(g, S). This is a typical assumption for any random effects logit analysis (Zeger and Karim, 1991). The random effects logistic model is considerably more complex than the order-restricted model, but some generalizations can be made about where the method will place a clone based on the form of the (ij) summary table. The observed logits (log(n(ij) 1l /n0l )) must correspond roughly to a line that has a positive slope. If multiple STS bins provide good fits to the data, the logit model favors those bins with steeper logit slopes. However, the random effects model forces a certain amount of commonality of logits for clones in their favored placements. For instance, suppose that the ob(ij) served logits (log(n(ij) 1l /n0l )) for STS bin j correspond quite closely to a line with a slope considerably greater than those observed for other clones. A random effect model is likely to favor STS bin j* over j if its observed (ij=) logits (log(n(ij=) 1l /n0l )) fit reasonably closely to a line with a slope similar to that found for other clones. To ensure that high scores are associated with overlap and low scores with nonoverlap, the mean logit slope is constrained to be positive (g1 ú 0). Given that there is no other a priori information on the value of the remaining parameters, we use standard reference prior distributions on the rest of the parameter space. Further discussion of the model formulation, the prior distributions, the complete conditionals, the starting values, and other implementation details for the logit model are included in Appendix II. Obtaining a Map To estimate the posterior probability that clone i belongs in STS bin j, we use Markov chain Monte Carlo simulation. The STS bin with the maximum posterior probability is the best choice for the location of the clone. For example, clone 1A2 (Fig. 4a) is confidently

AID

GENO 4652

/

6r2f$$$263

03-15-97 00:54:29

To evaluate the performance of each mapping method, we use the RPCI data of clone-by-clone comparisons. Using PCR products from inter-Alu sites, the RPCI data consist of hybridization probings of RPCI YACs against a gridded YAC library to produce groups of proximate YACs. Of the YAC clones we analyze, only a subset were assessed by RPCI. Because the map positions of these YAC proximity groups are largely unknown, they cannot be used to directly assess the mapping methods; however, they can provide an indirect assessment in conjunction with a statistical analysis based on maximum likelihood. A key assumption of the maximum likelihood analysis is that the elements of a RPCI cluster belong in the same STS bin or nearby STS bins. Given this assumption, the principle behind the maximum likelihood method is simple: if the mapping method is effective, the clusters it forms based on placing YACs in the same or nearby bins should be very similar to the YAC clusters formed by the RPCI hybridization experiments. We wish to estimate this degree of similarity. We employed two likelihood methods to evaluate the degree of similarity: the method described in Perlin et al. (1995) and a refinement of that method. Because the two methods produced essentially identical results, we report only the results using the Perlin et al. (1995) method. CHROMOSOME 11 MAP

One of the key features of our likelihood-based analysis is that it assigns to each clone a probability of membership for each STS bin. Taken over all STS bins, the set of probabilities is called the posterior probability distribution. These probability distributions, one for each clone, determine in large part the probability of correctly classifying a clone to its STS bin. Assuming the statistical modeling assumptions are approximately correct, these probabilities can be interpreted as the probability that the clone belongs in specific bins, given the data at hand. As an example we reproduce these distributions for four clones (Fig. 4), estimated using the logit model, which represent results having a range of classification properties: excellent, intermediate, poor, and ambiguous. Our level of confidence in binning appropriately declines as the maximum probability decreases. In general we take the bin with the maximum poste-

gnmal

148

ANDREWS ET AL.

TABLE 4 Estimated Performance of Three Mapping Methods Mapping methods Number of clones mapped

Logit

Order-restricted

L2-min

300 400 500 600 700 800

0.90 0.86 0.86 0.82 0.80 0.76

0.86 0.83 0.81 0.80 0.76 0.71

0.88 0.85 0.80 0.74 0.72 0.69

Note. Estimated performance of the three mapping methods. Values in the table are maximum likelihood estimates of the probability of successfully mapping each member of a group of clones.

rior probability of membership as the estimated location for the clone. Thus these probabilities are critical. To construct a map for each method, we bin only the 800 clones that have the highest maximum posterior probabilities or L2-min scores (henceforth, the ‘‘top 800’’). The posterior probabilities corresponding to the top 800 clones are §0.1209 and §0.1606 for the logit and order-restricted models, respectively; for the L2min method, it is §0.3059. Notice that the cutoff value for the logit model excludes the poorly mapped clone in Fig. 4c from further consideration. The top 800 clones differed somewhat by method. To examine the mapping properties of clones with different maximum posterior probabilities or L2-min scores, we first arranged the clones in descending order based on their maximum posterior probability or L2-min score (i.e., largest to smallest value); then we subdivided the top 800 clones into groups, from the top 300 to the top 800, by hundreds. We evaluate the performance of the binning methods for the top 800 clones by using the subdivisions just described and a maximum likelihood method described in Perlin et al. (1995). For any subdivision of the top 300 to top 800 clones, it is apparent that the logit model performs better than either the order-restricted model or the L2-min method (Table 4). The comparison is more complex when the order-restricted model is compared to the L2-min method. For the top 300 and 400 clones mapped, the L2-min method yields better classification than the order-restricted model, but performance is reversed thereafter. Apparently sensible classification methods such as the L2-min perform quite well when the data are close to the idealized structure, but to successfully position clones using noisier data requires more rigorous statistical models. In addition to predicting the appropriate bin more frequently, the statistical models are also more definitive in their placement of the clones than the L2-min method. Both the logit and the order-restricted model have no ties and have fewer near-ties in the posterior probability of clone membership to STS bins than the L2-min method does for its scoring method. Again considering the top 800 clones, there are 48 clones with ties for the L2-min model. If we consider other peaks

AID

GENO 4652

/

6r2f$$$264

03-15-97 00:54:29

within 5% of the maximum, there are 49, 52, and 359 clones with near-ties for the logit, order-restricted and L2-min models, respectively. Thus the statistical methods yield more interpretable results in several ways: they yield posterior probabilities of clone membership to each STS bin; they are more accurate, in general; and they are precise, yielding few ties and near-ties for clone location and overall a more refined location for each clone. We use the classification derived from the logit and order-restricted models to refine the chromosome 11 map produced by Perlin et al. (1995). Rather than reproduce the entire map, we illustrate a small, continuous section of it, from STS bin 172 to STS bin 181. This section is selected because it contains the CRYA2 gene (centered approximately at bin 177), and 13 of the YAC clones are known to span the gene. For the top 800 clones, all three mapping methods agree on the bin location for the majority (63/100) of the clones mapped to this region by at least one of the methods (Table 5). In addition, most of the clones in this ‘‘agreement’’ table are mapped with high posterior probabilities. Of the 13 clones forming the CRYA2 conTABLE 5 Map Agreement Table Bin

Clone (logit, order-restricted, L2-min)

172

1E4 11F6 12B1 15A11

(96, (95, (99, (99,

98, 88, 97, 99,

79) 63) 65) 95)

6B9 12A1 12C2 15E1

(88, (98, (47, (93,

92, 73, 71, 86,

79) 75) 48) 72)

6E9 12A3 12C3 16A12

(65, (64, (16, (82,

55, 31, 93, 52,

63) 46) 39) 84)

173

4A1 11G3 15A12 19G8

(84, (36, (85, (99,

87, 57, 94, 96,

50) 38) 57) 68)

6C11 (96, 89, 83) 13G9 (71, 32, 57) 15G12 (99, 99, 95)

11A8 (35, 23, 38) 15A9 (50, 50, 85) 16G11 (94, 99, 89)

174

2D6 (76, 58, 37) 12H9 (74, 54, 52)

6C10 (73, 92, 74) 13A10 (88, 84, 79)

8F7 (72, 82, 82) 14A7 (99, 98, 91)

175

1G11 (68, 63, 66) 5B2 (75, 68, 60) 19C2 (96, 99, 97)

4A3 (67, 59, 44) 8A12 (96, 94, 83)

4A8 (97, 99, 99) 18E8 (70, 73, 83)

176 177

2E5 7D5 14F9 17F11 18G8

(88, (58, (26, (71, (99,

94, 86, 00, 68, 98,

78) 71) 43) 83) 79)

3F3 12A5 15E3 17F12 19F7

(97, (39, (99, (96, (49,

99, 23, 99, 96, 48,

83) 35) 99) 91) 83)

4H7 12A6 15E4 17H4

(43, (90, (97, (99,

50, 96, 90, 99,

65) 93) 86) 87)

178

7C6 (63, 60, 57)

18B5 (39, 41, 84)

179

2G7 (56, 36, 86) 12E11 (76, 75, 83)

11B9 (68, 62, 68) 17C10 (45, 37, 63)

12D1 (14, 00, 32) 18F12 (17, 20, 46)

180

14E10 (81, 84, 77)

181

7B2 (49, 44, 37) 12D3 (49, 43, 53)

7B4 (20, 12, 42) 12F11 (56, 73, 74)

7F1 (19, 31, 00)

Note. After each bin are listed the clones that all three methods map to that bin. Also included are clones that are mapped by two of the methods to the bin and not mapped at all by the third method (indicated by ‘‘00’’). After each identification tag is the posterior probability or score (1100) for each method (logit, order-restricted, and L2-min).

gnmal

149

STATISTICAL METHODS FOR BINNING CLONES

TABLE 6 Map Disagreement Table RPCI 1E2 2G8 3B7 4F7 4G9 4H1 5D12 5G10 6A1 6G12 8H7 9D2 9F2 12A2 12B3 12C1 12C5 12E3 12E6 12E10 12F10 12H10 13H12 14A1 14A8 14F5 15H6 18A3 18A5 18A12 18B1 18B9 18C6 18C12 18D11 19C10 19G12

Logi 192 174 173 179 173 172 46 174 177 174 172 nm 172 173 nm 179 nm nm nm 179 175 181 177 177 174 172 181 177 172 172 179 175 179 173 172 174 179

Order-restricted

(12) (18) (84) (35) (45) (57) (12) (39) (30) (66) (62) (00) (33) (42) (00) (31) (00) (00) (00) (36) (43) (57) (50) (50) (44) (50) (64) (43) (29) (24) (46) (16) (34) (38) (13) (39) (15)

172 174 173 179 173 172 nm 71 181 174 172 175 172 173 172 173 172 172 173 179 174 181 178 178 174 173 181 176 173 nm 179 235 20 173 172 175 26

(28) (22) (72) (26) (96) (96) (00) (40) (29) (93) (80) (16) (37) (35) (48) (30) (46) (20) (25) (54) (43) (42) (50) (49) (29) (50) (28) (50) (25) (00) (63) (29) (28) (38) (21) (31) (17)

L2-min nm 126 174 177 174 175 174 174 177 126 174 nm 177 172 nm 179 nm 146 131 178 174 182 177 177 173 172 23 181 177 232 178 174 20 172 134 177 179

(00) (49) (85) (56) (72) (65) (33) (43) (42) (50) (55) (00) (32) (63) (00) (44) (00) (44) (57) (63) (61) (64) (88) (97) (64) (84) (64) (77) (49) (47) (86) (58) (75) (51) (37) (70) (59)

Note. The clones that are mapped to the region 172–181 by at least one of the three methods. Columns 2–4 indicate where each method maps the clone and with what posterior probability or score (1100). ‘‘00’’ indicates that the clone is not mapped by the method (i.e., not in the top 800).

tig, 12 map to the region. The 13th clone is not in the top 800 for the logit model and therefore is not considered. It is in the top 800 for the other two methods, but does not map to the region. The true locations for the remaining clones (37/100), which are mapped to the region by at least one of the three methods, is somewhat ambiguous (Table 6). Not only do the mapping methods exhibit some disagreement on bin membership for these clones, the posterior probabilities or L2-min scores are frequently unimpressive. The statistical mapping methods agree in the placement far more often (16/37) than either does with the L2-min placements (7/37 for the logit model and 2/ 37 for the order-restricted model). While a consolidated map could be based on the results of the statistical methods alone, we choose to use that information in conjunction with the L2-min method and the RPCI clone/clone hybridizations to pro-

AID

GENO 4652

/

6r2f$$$264

03-15-97 00:54:29

duce a final map. This map is not based on any formal analysis, but simply on visual inspection of the data. In addition to the consensus map (Table 5), we bin 19 other clones to this region (Table 7). To bin these additional clones, we rely on evidence from the RPCI data that these clones cluster with other clones already mapped to the region with high probability (i.e., clones from Table 5), placing the clones at a consensus locality based on the results of the mapping methods. One of the clones, 4F7, agrees well with the RPCI data both in bin 179 (where the statistical methods place it) and in bin 177 (where the L2-min method places it), and thus it appears twice in Table 7. Of the 16 clones in the ‘‘disagreement’’ table (Table 6) mapped to the same bin by the statistical methods, 8 could be confirmed by the RPCI data. This result, which is a minimum estimate of performance due to the partial information available in the RPCI data, strongly suggests that using the statistical methods alone is an excellent mapping strategy. The entire map can be viewed and downloaded by anyone accessing our website (address under Acknowledgments). Because the entire probability distribution for clone membership to STS bins is often helpful when localizing clones, we also provide the entire posterior distribution for the top 800 at the website. In addition, we provide the consensus map, again using the information obtained from the RPCI data. DISCUSSION

One approach to the problem of sequencing the human genome is to map clones to their original genomic locations, then determine a minimum tiling path of clones, and finally sequence that subset of clones. Ideally the clones used for this top-down strategy would be short DNA fragments such as BACs or PACs. As yet, no top-down strategy has been developed in sufficient generality to sequence the genome. An important tool available for sequencing the human genome is a fairly dense STS framework map TABLE 7 Map Supplement Bin 172 173 174 175 176 177 178 179 180 181

Clone 1E2, 2G3 3B7, 4G9, 12A2, 14F5 2G8, 6G8 12F10, 19C10 4F7, 6A1 13H12, 14A1 4F7, 18B1, 18C6, 19G12 15H6 12H10

Note. Additional clones placed in the region of STS bins 172–181 using information from RPCI clone/clone hybridization data. These clones cluster well with the unambiguously mapped clones in the agreement table (Table 5).

gnmal

150

ANDREWS ET AL.

(Hudson et al., 1995). Because the map is evolving at a rapid rate, a denser version should be available in the near future. While the map will undoubtedly be useful for sequencing the genome, how it will be used remains to be seen. Clearly the framework map could be used in conjunction with STS content mapping (Green and Green, 1991) to localize the DNA fragments. Such an approach, however, would require an enormous number of PCR experiments. Therefore it may not be the most efficient strategy for localizing and choosing clones to sequence. One alternative strategy is to use IPM in conjunction with the framework map to localize the clones. The IPM strategy combines noisy long-range data (RH/ clone hybridization data) with almost noise-free shortrange data (RH/STS overlap data) to infer, indirectly, clone locations. This indirect approach has the notable benefit of significantly reducing the number of laboratory experiments required to map clones relative to direct methods such as STS-content mapping. Thus IPM solves the clone-mapping problem at lower cost. Moreover, by eliminating the need for direct clone and STS comparisons, it enables clone mapping even when clones and STSs do not overlap, as will often be the case for YAC clones mapped against relatively sparse STS maps and for small insert (bacterial) clones mapped against even relatively dense STS maps. YACs are far larger than either BACs or PACs, which are more desirable substrates for sequencing. Nonetheless, in principle there is no reason why the IPM strategy cannot be effective for localizing smaller clones, although some redesign of the experimental methods may be required. In fact, recent research on wholegenome RHs by Perlin (unpublished) suggests that the IPM strategy is implementable at this scale. Our results show that statistical methods can be tailored to the IPM top-down strategy, providing accurate and interpretable results (in this instance) for clone binning. Using these methods, we refine the chromosome 11 map (Tables 5, 6, and 7) first presented by Perlin et al. (1995). Our analyses suggest that the improvement is substantial, not only in terms of the localizing the clones (Table 4) but also in terms of the confidence with which they are localized. Moreover, the availability of maps based on different statistical models, together with supplemental information from RPCI, yields a highly accurate clone map (Tables 5 and 7). While the binning of clones to local regions is an important step in the process of building a minimal tiling path of clones, it is obviously not a complete answer. Nevertheless, by using the binning information in conjunction with information on clone length and clone overlap (Qin et al., 1996), it should be possible to build such tiling paths very efficiently. Statistical methods to address this problem is one focus of our current genomics research. Statistical Methods Based on our analyses of the chromosome 11 data, it appears that the logit model is better at assigning

AID

GENO 4652

/

6r2f$$$264

03-15-97 00:54:29

clones to STS bins than either the order-restricted model or the L2-min method (Table 4). When the orderrestricted method is compared to L2-min, the L2-min method appears to perform better for ‘‘high-confidence’’ clones (i.e., the top 400 clones), but the performance is reversed for lower-confidence clones (Table 4). Much of this behavior can be explained by the congruity between the data and the assumptions imposed by the models. For instance, the logit model imposes the assumption that, when the clone is placed in the correct STS bin, the probability that an STS and an RH overlap increases linearly on the logit scale with increasing RH/YAC overlap score (for the same RH). There is no rigorous molecular motivation for this assumption. However, the logit model frequently provides a good summary for data of this form (Agresti, 1990), and the assumption of increasing probability of RH/STS overlap with increasing RH/YAC overlap score certainly makes sense from the molecular perspective. L2-min imposes a different linearity assumption, namely that the probability that an STS and an RH overlap increases linearly with increasing RH/YAC overlap score (Perlin et al., 1995). While this assumption and that for the logit model may seem almost identical, it is well known that the error structure (constant variance) for the L2-min model is not optimal for the kind of data analyzed. The order-restricted model imposes only a very weak assumption about the form of the data. It is a nonparametric model, whereas the logit and L2-min models are parametric models (imposing linearity assumptions). Without an immense amount of data, parametric models typically perform better than nonparametric models when the parametric model assumptions roughly conform to the true process generating the data. However, as the disparity between assumptions and the process generating the data increase, the performance of the nonparametric model is expected to improve. We suspect that this scenario explains the performance of L2min versus the order-restricted model: the assumption imposed by L2-min is adequate for high-confidence clones but is less adequate for lower-confidence clones. An important feature of the logit model is its random effects structure. This structure forces the slope from a particular 2 1 L table (e.g., Table 2) to shrink toward a common slope, that being the average slope over all such tables. Shrinkage in this setting is well known to increase the accuracy of the estimated slopes (Efron and Morris, 1973) and therefore should increase the accuracy of the classification. In fact, when we compared the random effects model to the standard maximum likelihood logit model using the chromosome 11 data, we found the classification performance of the random effects model to be superior. It is unnecessary for a researcher to choose between the logit and the order-restricted methods. While the logit model outperforms the order-restricted model for the data we have analyzed (Table 4), these results speak only to the average performance. The results from the logit model are critically dependent on its lin-

gnmal

151

STATISTICAL METHODS FOR BINNING CLONES

earity assumption—which appears to hold generally, but may not be valid for a specific clone. Thus, for mapping clones to STS bins, it should be quite useful to examine the results of both methods, particularly if the methods localize clones to very different bins. In fact, to produce the final map (Tables 5 and 6 and website), we used information from all three methods and the Roswell Park clone/clone hybridization data. Future Developments In this report we have illustrated some statistical methods, in particular Bayesian hierarchical models, for one step in the complex process of sequencing a large genome. Theoretically, it should be possible to build a hierarchical model for the entire process. To do so, the kind of errors introduced at each level of the process must be characterized and modeled (e.g., errors in base-calling and errors in STS framework maps). The resulting hierarchical model would undoubtedly improve the accuracy of any resulting sequence and probably the speed by which the sequence is determined. The model would be complex—perhaps too complex to analyze exactly—but approximations can be applied to produce good solutions. An example of how more complex models can be built follows naturally from the methods we just developed. In this brief example, we describe how to build upon the hierarchical Bayes binning methods to refine the assignment of clones to bins while also refining the order of STSs in an STS framework map (see Mukhopadhyay et al. [1994] for a preliminary approach to this problem). We assumed early in our analyses that each clone should be classified to exactly one STS bin. If two STSs are close together, it is likely that a clone in that region will overlap both of them rather than just one. With our techniques the clone is likely to fit well into both STS bins. For STSs separated by greater distances, the clones will most likely fall between these STSs. In fact, the clones might not be close to either STS. These observations suggest another means of creating STS bins. Define bins based on the region between adjacent pairs of STS bins rather than on the region surrounding individual STSs. Define B by

Bkj Å

1

if a fragment of RH k and STS j overlap,

1

if a fragment of RH k and STS j / 1 overlap,

0

otherwise.

When we applied this alternative model to the chromosome 11 data, the performance of the new model was mixed. Some clones that had been previously mapped to two STS-centered bins, each with moderate posterior probability, were now mapped with high posterior probability to a single inter-STS bin. However, many clones that previously had been mapped very well were no longer mapped well.

AID

GENO 4652

/

6r2f$$$264

03-15-97 00:54:29

An obvious drawback to this binning method is that it requires the STSs be ordered accurately on the chromosome (i.e., STS j is adjacent to STS j / 1). Clearly the bins make little sense when the ordering is inaccurate. In fact, the ordering is unlikely to be perfect for these or any other data available at this time (Lunetta and Boehnke, 1994). Nonetheless, this drawback may not be as much of an impediment as it appears. Lange and Boehnke (1992) present a Bayesian model to determine the order of STSs on a chromosome from RH/STS overlap experiments. This model, possibly with some minor revisions, can be combined with the analyses we just outlined to form a more complex Bayesian hierarchical model. Based on this more complex model, it should be possible to improve (iteratively) the binning of clones while also refining the order of the STSs. APPENDIX I: ORDER-RESTRICTED MODEL

A Markov chain Monte Carlo (MCMC) simulation iteratively samples from the conditional distribution of each parameter given the data as well as all the other parameters. This conditional distribution is known as the complete conditional distribution. For notational convenience we refer to the set of conditioning events as ‘‘the rest.’’ The complete conditionals are derived from the likelihood and the prior distributions using Bayes’ rule. Because all prior distributions for the order-restricted model are constant over the parameter space, the complete conditionals are proportional to the likelihood on that region. Specifically, the complete conditional distributions for the order-restricted model are (ij *)

∏1tÅ0 ∏LlÅ1 [utl(i)]ntl P(Ci Å j*Érest) Å J (ij ) , (jÅ1 ∏1tÅ0 ∏LlÅ1 [utl(i)]ntl j* √ {1, . . . , J}; P(Bkj Å t*Érest) Å

∏{i:CiÅj} u

(i) t*AO ik

(

1 tÅ0

(i) ∏{i:CiÅj} utA O ik

, t* √ {0, 1};

and f(u(i)*Érest) Å

∏1tÅ0 ∏LlÅ1 [u(i)*]

(ij)

ntl

*V ∏1tÅ0 ∏LlÅ1 [u(i)] du(i) n(ij)

, u(i)* √ V,

tl

where V is as defined by Eq. [3] under Statistical Analyses. The parameter space for each Ci and missing Bkj is discrete so the denominators of the complete conditionals are found by direct computation. However, to avoid computing the many-dimensional integral in the denominator of the complete conditional of u(i), we employ a Metropolis–Hastings step (Tierney, 1994; Hastings, 1970). The starting values of the parameters come from random sampling from prior distributions and maximum likelihood estimates. First we obtain starting values for the missing values of the B matrix by random sampling from a Bernoulli(1/2) distribution. Next approximate

gnmal

152

ANDREWS ET AL.

maximum likelihood estimates for u(i) are found for each clone i in each bin j. The estimate found is close to the maximum likelihood estimate and is guaranteed to be in the region V. To compute the starting value, (i) , and for l Å 1, . . . , L and t Å 0, 1, let htl Å (Ll=Ålutl= estimate them by

S

hP 0l Å min and

S

hP 1l Å max

D

(ij )

(ij )

(ij )

(ij )

(Ll=Åln0l= (Ll=Åln/l= , n(j0/) K

D

(Ll=Åln1l= (Ll=Åln/l= , . n(j1/) K

Last, utl(i) Å hP tl 0 hP tl/1 , where hP tL/1 Å 0. Each clone’s initial placement (Ci) is the bin where its approximate maximum likelihood occurs. The sampling order is each u(i), the missing values of the B matrix, and then C. Results are based on the last 8000 iterations of a 10,000-iteration simulation. APPENDIX II: LOGIT MODEL

Bayes’ Rule The logit model as described under Statistical Analyses is not completely specified because the conditional probability in Eq. [5] must be reversed via Bayes’ rule. To employ Bayes’ rule, the marginal column and row ˆ ik Å probabilities must be specified. These are pil Å P(A l) and qit Å P(Aik Å t), respectively. Note that qit is a known quantity determined by the row margins, which are fixed for each STS bin. Now use Bayes’ rule to rewrite utl(i) in terms of the parameters in this model:

utl(i) Å P(AO ik Å lÉAik Å t) Å

P(Aik Å tÉAO ik Å l)P(AO ik Å l) P(Aik Å t)

Å

[g(l, bi)]t[1 0 g(l, bi)]10tpil , qit

[6]

exp(bi0 / bi1l) 1 / exp(bi0 / bi1l)

is derived from Eq. [5]. The full likelihood function can then be obtained directly by substitution of Eq. [6] into Eq. [2]. Because the row margins are fixed for a given binning C, the likelihood for this dataset is of the same form as the product of I case–control studies (Breslow and Day, 1980) times the product of I multinomials. It is well known that the logit model is not fully identifiable under case–control sampling (i.e., more than one set of parameters yields the same maximum likelihood). This implies that there is a set of values of (bi ,

AID

GENO 4652

/

6r2f$$$265

Implementation Details The logit model described in this appendix is reparametrized slightly, relative to that in the body of the paper, so that the sampling distributions are easier to evaluate. Specifically, the vectors bi are rewritten g / bi . Then Eq. [5] becomes [after recentering the values ˆ to sl Å 0(L 0 1)/2, . . . , (L 0 1)/2 from l Å 1, . . . , in A L)] logit{P(Aik Å 1ÉAO ik Å l)} Å g0 / bi0 / (g1 / bi1)sl . The distribution of the bi , which can be inferred from the distribution of the bi , is bivariate normal bi Ç N2(0, S). Listed below are the remaining variables in the logit model and their prior distributions including the two table marginals that we estimated from their observed values.

where g(l, bi) Å P(Aik Å 1ÉAO ik Å l) Å

pi1 , . . . , piL01) that yields identical probability distriˆ (Roeder et al., 1996). This is not a hinbutions for A drance here because interest is only on the likelihood of the tables themselves, not the particular parameters that define the tables. Nevertheless, due to the lack of identifiability, the model must be constrained for computational stability. To make the model fully identifiable, only one of the L parameters bi0 , pi1 , . . . , piL01 needs to be specified for ˆ has significant information each i. However, because A about the values of pi , each component of pi is estimated ˆ . These from the observed marginal values from table A values are then assumed to be fixed (as opposed to random quantities) for the analysis [see item (e) below]. This assumption, while not strictly accurate, leads to stable performance and enhanced speed of convergence for the MCMC algorithm. Furthermore, it does not noticeably affect the final conclusions drawn from the data.

03-15-97 00:54:29

(a) Ci Ç Uniform on {1, . . . , J}. (b) Bkj Ç Bernoulli(1/2) for missing values in B. (c) g } 1 on R 1 R/. The mean slope parameter must be positive. (d) S } ÉSÉ03/2 on the space of all positive definite 2 1 2 matrices. (i) (e) pil Å (n/l / 1)/(K / L), the posterior mode if the prior distribution for pi were flat. (j) /K, where Ci Å j. (f) qit Å nt/ All of the priors were chosen to be reference priors, hence the likelihood dominates the prior and no particular values are favored a priori. The prior distributions for Ci and Bkj are also ‘‘proper,’’ meaning the sum over all possible outcomes is 1. By contrast, the standard choice of reference priors for g and S are ‘‘improper,’’ meaning that the integral over all possible values is infinite. The specified hierarchical structure and the prior

gnmal

153

STATISTICAL METHODS FOR BINNING CLONES

distributions result in the following complete conditionals where utl(i)(g, bi) is defined in Eq. [6]: (ij *)

∏1tÅ0 ∏LlÅ1 [utl(i)(g, bi )]ntl P(Ci Å j*Érest) Å J (ij ) , (jÅ1 ∏1tÅ0 ∏LlÅ1 [utl(i)(g, bi )]ntl

data and good starting values, this region of the parameter space will not be visited. The chromosome 11 dataset is large enough and the starting values found from maximum likelihood methods are good enough to avoid the trouble area. Under these conditions we obtain (in practice) a proper posterior.

j* √ {1, . . . , J}; (i) ∏{i:CiÅj} [ut*A O ik(g, bi )]

P(Bkj Å t*Érest) Å

(i) (1tÅ0 ∏{i:CiÅj} [utA (g, bi )] O ik

ACKNOWLEDGMENTS

, t* √ {0, 1};

f(g*Érest) Å

∏

I iÅ1

∏

1 tÅ0

∏

L lÅ1

(ij )

(i) tl

[u (g*, bi )]ntl

*R1R/ ∏IiÅ1 ∏1tÅ0 ∏LlÅ1 [utl(i)(g, bi )]ntl dg (ij)

,

This research was supported in part by a doctoral fellowship from AT&T (C.A.), NSF Grants DMS9496219 (K.R.) and DMS9508427 (B.D. and K.R.), and NIH Grants HG00348 (B.D.) and HG00856 (M.P.). The address of the web site that contains the map of chromosome 11 and C programs to implement these methods is www.stat. cmu.edu/Çroeder/map/map.html.

I

g* √ R 1 R/; S01*Érest Ç Wishart2(I, ( ∑ bib*i )01);

REFERENCES

iÅ1

and

S

exp 0

D

1 b** (01b*i ∏1tÅ0 i 2 (ij )

f(b*i Érest) Å

*

R2

1 ∏LlÅ1 [utl(i)(g, b*i )]ntl 1 exp 0 b*i (01bi ∏1tÅ0 2

S

D

, b*i √ R2.

(ij )

1 ∏LlÅ1 [utl(i)(g, bi )]ntl dbi The exact distribution of (01 is known and can be sampled directly. The parameter spaces for each Ci and missing Bkj are discrete so the constants of proportionality are found by summation. The distributions of g and bi are continuous so it is more feasible to use a Metropolis–Hastings step than to evaluate the twodimensional integrals in the denominators. Again, starting values of the parameters are obtained by random sampling from prior distributions and maximum likelihood estimates. First we obtain starting values for the missing values of the B matrix by random sampling from a Bernoulli(1/2) distribution. For each i, J estimates of bi are obtained by fitting a logit model with clone i placed in each STS bin using maximum likelihood methods. Clone i is placed in the bin where its maximum likelihood logit score occurs. The fitted value of bi that corresponds with the binning that maximizes the likelihood is retained. The starting value for g is the mean of the fitted bi’s and finally bi Å bi 0 g. With starting values for all parameters but ( specified, the initial value for ( can be sampled directly from its complete conditional. The sampling order is (, the missing values of the B matrix, C, g, and then bi . Results are based on the last 8000 iterations of a 10,000-iteration simulation. The marginal posterior distribution of ( is improper for this hierarchical model. The density function increases without bound as É(É r 0. This region can cause poor performance in a MCMC chain if the parameter becomes trapped here. However, with sufficient

AID

GENO 4652

/

6r2f$$$265

03-15-97 00:54:29

Agresti, A. (1990). ‘‘Categorical Data Analysis,’’ Wiley, New York. Aho, A. V., Hopcroft, J. E., and Ullman, J. D. (1983). ‘‘Data Structures and Algorithms,’’ Addison-Wesley, Reading, MA. Bellanne-Chantelot, C., Lacroix, B., Ougen, P., Billault, A., Beaufils, S., Bertrand, S., Georges, S., Glibert, F., Gros, I., Lucotte, G., Susini, L., Codani, J.-J., Gesnouin, P., Pook, S., Vaysseix, G., Lu-Kuo, J., Ried, T., Ward, D., Chumakov, I., Le Paslier, D., Barillot, E., and Cohen, D. (1992). Mapping the whole genome by fingerprinting yeast artificial chromosomes. Cell 70: 1059–1068. Berger, J. O. (1985). ‘‘Statistical Decision Theory and Bayesian Analysis,’’ 2nd ed., Springer–Verlag, New York. Breslow, N. E., and Day, N. E. (1980). ‘‘Statistical Methods in Cancer Research,’’ Vol. 1, ‘‘The Analysis of Case–Control Studies,’’ IARC, Lyon. Cox, D. R., Burmeister, M., Price, E. R., Kim, S., and Myers, R. M. (1990). Radiation hybrid mapping: A somatic cell genetic method for constructing high-resolution maps of mammalian chromosomes. Science 250: 245–250. Cox, D. R., Green, E. D., Lander, E. S., Cohen, D., and Myers, R. M. (1994). Assessing mapping progress in the human genome project. Science 265: 2031–2032. Efron, B., and Morris, C. (1973). Stein’s estimation rule and its competitors—An empirical Bayes approach. J. Am. Stat. Assoc. 68: 117–130. Gelfand, A. E., and Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. J. Am. Stat. Assoc. 85: 398–409. Geyer, C. (1989). Practical Markov chain Monte Carlo. Stat. Sci. 4: 473–482. Green, E. D., and Green, P. (1991). Sequence-tagged site (STS) content mapping of human chromosomes: Theoretical considerations and early experiences. PCR Methods Appl. 1: 77–90. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57: 97–109. Hudson, T. J., Stein, L., Gerety, S. S., Ma, J., Castle, A. B., Silva, J., Slonim, D. K., Baptista, R., Kruglyak, L., Xu, S.-H., et al. (1995). An STS-based map of the human genome. Science 270: 1945–1954. James, M. R., Richard, C. W., III, Schott, J.-J., Yousry, C., Clark, K., Bell, J., Hazan, J., Dubay, C., Vignal, A., Agrapart, M., et al. (1994). A radiation hybrid map of 506 STS markers spanning human chromosome 11. Nature Genet. 8: 70–76. Lange, K., and Boehnke, M. (1992). Bayesian methods and optimal experimental design for gene mapping by radiation hybrids. Ann. Hum. Genet. 56: 119–144. Lee, P. M. (1989). ‘‘Bayesian Statistics: An Introduction,’’ Arnold, London. Lunetta, K., and Boehnke, M. (1994). Multipoint radiation hybrid

gnmal

154

ANDREWS ET AL.

mapping: Comparisons of methods, sample size requirements, and optimal study characteristics. Genomics 21: 92–103. Monaco, A. P., Lam, V. M. S., Zehetner, G., Lennon, G. G., Douglas, C., Nizetic, D., Goodfellow, P. N., and Lehrach, H. (1991). Mapping irradiation hybrids to cosmid and yeast artificial chromosome libraries by direct hybridization of Alu-PCR products. Nucleic Acids Res. 19: 3315–3318. Mukhopadhyay, N., Gorin, M. B., and Perlin, M. W. (1994). Ordering STSs via YAC hybridizations to radiation hybrids using inner product mapping. Am. J. Hum. Genet. 55(Suppl.): A266. Nelson, D. L., Ledbetter, S. A., Corbo, L., Victoria, M. F., RamirezSolis, R., Webster, T. D., Ledbetter, D. H., and Caskey, C. T. (1989). Alu polymerase chain reaction: A method for rapid isolation of human-specific sequences from complex DNA sources. Proc. Natl. Acad. Sci. USA 86: 6686–6690. Nowak, N., Qin, S., Zhang, J., Sait, S., Higgins, M., Cheng, Y., Li, L., Munroe, D., Evans, G., Housman, D., and Shows, T. (1994). Generating a physical map of chromosome 11. Am. J. Hum. Genet. 55(Suppl.): A267. [Abstract] Perlin, M. W., and Chakravarti, A. (1993). Efficient construction of high-resolution physical maps from yeast artificial chromosomes using radiation hybrids: Inner product mapping. Genomics 18: 283–289. Perlin, M. W., Duggan, D. J., Davis, K., Farr, J. E., Findler, R. B., Higgins, M. J., Nowak, N. J., Evans, G. A., Qin, S., Zhang, J., Shows, T. B., James, M. R., and Richard, C. W., III (1995). Rapid construction of integrated maps using inner product mapping: YAC coverage of human chromosome 11. Genomics 28: 315–327. Qin, S., Nowak, N. J., Zhang, J., Sait, S. N. J., Mayers, P., Higgins, M. J., Cheng, Y., Li, L., Munroe, D. M., Gerhard, D. S., Weber,

AID

GENO 4652

/

6r2f$$$266

03-15-97 00:54:29

B. H., Bric, E., Housman, D. E., Evans, G. A., and Shows, T. B. (1996). A high-resolution physical map of human chromosome 11. Proc. Natl. Acad. Sci. USA 93: 3149–3154. Qin, S., Zhang, J., Isaacs, C. M., Nagafuchi, S., Jani, S. S., Abel, K. J., Higgins, M. J., Nowak, N. J., and Shows, T. B. (1993). A chromosome 11 YAC library. Genomics 16: 580–585. Richard, C., III, Boehnke, M., Berg, D., Lichy, J., Meeker, T., Myers, R., and Cox, D. (1993). A radiation hybrid map of the distal short arm of human chromosome 11 containing the Beckwith–Weidemann and associated embryonal tumor disease loci. Am. J. Hum. Genet. 52: 915–921. Richard, C., III, Withers, D., Meeker, T., Maurer, S., Evans, G., Myers, R., and Cox, D. (1991). A radiation hybrid map of the proximal long arm of human chromosome 11 containing the multiple endocrine neoplasia type 1 (MEN-1) and bcl-1 disease loci. Am. J. Hum. Genet. 49: 1189–1196. Roeder, K. R., Carroll, R. J., and Lindsay, B. G. (1996). A semiparametric mixture approach to case–control studies with errors in covariables. J. Am. Stat. Assoc. 91: 722–732. Searle, S. R. (1971). ‘‘Linear Models,’’ Wiley, New York. Smith, A. F. M., and Roberts, G. O. (1993). Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods. J. R. Stat. Soc. B 55: 3–24. Tanner, M. A., and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc. 82: 528– 540. Tierney, L. (1994). Markov chains for exploring posterior distributions (with discussion). Ann. Stat. 22: 1701–1762. Zeger, S., and Karim, M. R. (1991). Generalized linear model with random effects: A Gibbs sampling approach. J. Am. Stat. Assoc. 86: 79–86.

gnmal