GENOMICS
12,435-446
(1992)
CPROP: A Rule-Based Program for Constructing STANLEY LETOVSKY* AND MARY
Genetic Maps
B. BERLYNt
*Letovsky Associates, 286 West Rock Avenue, New Haven, Connecticut 06515; and tDepartment and School of Forestry and Environmental Studies, Yale University, New Haven, Connecticut
of Biology 06520
Received July 12, 1991; revised October 8. 1991
Gene mapping assignschromosomalcoordinates to genetic loci based on analysis of fragmentary ordering and metric data. In assemblinggenetic maps,geneticistsuserules of inferenceto derive new facts about order and distance between loci from experimentally derived conclusionsabout order and distance. They construct comprehensive maps by merging related setsof data and resolving conflicts between them. In this article we describe software that formalizes and automates someof theserules of inference to yield a useful mapconstruction utility called CPROP. o 1992 Academic press, I~C.
1.
INTRODUCTION
Gene mapping assigns chromosomal coordinates to genetic loci based on analysis of fragmentary ordering and metric data. In assembling genetic maps, geneticists use rules of inference to derive new facts about order and distance between loci from experimentally derived conclusions about order and distance. The software we describe here formalizes and automates some of these rules of inference to yield a useful map construction utility called CPROP.’ CPROP was developed to assist in the construction of maps of the Escherichia coli genome. In the past this task has been performed by hand analysis of published genetic and physical data (Taylor and Adelberg, 1960; Taylor and Thoman, 1964; Taylor and Trotter, 1967; Bachmann et al., 1976; Bachmann and Low, 1980; Bachmann, 1983,1990), but the recent computerization of E. coli mapping data in the E. coli Genetic Stock Center (CGSC) database created an opportunity to provide automated support for the data analysis required for mapmaking. Nothing about CPROP is specific to bacterial genetics, however; it appears to be broadly applicable to other organisms. CPROP has a very general notion of genetic data that allows it to accept information from a variety of experimental sources and at different levels of resolution-for example, distances from sequencing, restriction maps, or cotransduction data. CPROP accepts as input a set of con1 The name CPROP (pronounced See-prop) is a contraction struintpropagator, the term for the type of inference algorithm the program. See Section 3 for details.
of conused by
straints on the positions and order of loci derived from experimental data. It exhaustively infers logical consequences of those constraints to derive the tightest possible bounds on positions, as well the most complete partial orderings of the markers. It also detects and helps identify inconsistencies in the data. The CGSC database has been extended with capabilities for storing various types of experimental data, automatically converting them into the constraint format required by CPROP, retrieving and selecting datasets to be analyzed by CPROP, and generating diagrams and storing the resulting maps. Our formalization of the map assembly problem draws on ideas from the artificial intelligence (henceforth AI) literature on planning and temporal reasoning, which describes approaches to the problem of task scheduling, i.e., scheduling a series of tasks to be performed subject to duration, order, and deadline constraints (Sacerdoti, 1977; Dean, 1985). Task scheduling shares with genetic mapping the concern with constructing a one-dimensional “map” that satisfies given constraints. Some details vary; for example, in the genetic problem one frequently finds local maps that lack a global orientation. To represent and reason about such locally ordered clusters requires some novel ideas. On the other hand, many of the complexities of temporal reasoning such as deadlines or the persistence of facts are absent from the genetic case. One of the key ideas in planning research is the use of a task network representation which explicitly represents the constraints on a set of tasks (Sacerdoti, 1977; Dean, 1985). For example, the fact that task A must occur before task B may be represented by a directed link between two nodes labeled A and B. A network of such nodes and links can compactly denote an exponentially larger set of schedules without explicitly enumerating them. Further planning activities may result in new constraints being added to the network, thereby reducing the set of schedules denoted by it. Typically, the conversion from a task network to a complete schedule occurs only when the plan is ready to be executed, this is called least commitment planning because tasks are never ordered until they have to be. We adopt an analogous approach in our representation of genetic maps: we present
435 All
Copyright 0 1992 rights of reproduction
osss-7543/92 $3.00 by Academic Press, Inc. in any form reserved.
436
LETOVSKY
a totally ordered sequence of loci only if the data unambiguously support a total ordering. We believe it is more useful in general to show an incomplete ordering that faithfully reflects the available data and that makes explicit where the uncertainty is. We can contrast this with approaches based on maximum likelihood, which enumerate complete orderings in order of likelihood2; to continue the planning analogy, maximum likelihood methods are like planners that generate and evaluate complete schedules. The full information content of the data is represented in the distribution of likelihoods over all possible maps of the loci. Such a likelihood distribution is not something that is readily communicated or understood, however. By contrast, CPROP constructs a relatively compact representation that makes clear what information is and is not available about the spatial relationships among the loci.
2. REPRESENTATION
OF
MAPS
Despite the variety of experimental types and the forms of primary data, including sequences, restriction fragment data, recombination frequencies, transduction data, and times of entry, genetic mapping experiments ultimately yield two main kinds of facts. The first is a distance assertion: it says that the distance between two markers3 A and B is some value. The second says that A is before B in some set of locally ordered markers, whose global orientation with respect to the chromosome (or any other frame of reference) may be known or unknown. We refer to these two statement types as metric and ordering assertions, respectively. To represent experimental uncertainty explicitly in the measurement of intermarker distances, we use uncertainty intervals rather than single numbers. An uncertainty interval can be represented by a pair of numbers-a lower bound and an upper bound on the true distance. Uncertainty intervals are similar to the standard statistical presentation of data as a mean + a standard error term, except for one detail: CPROP in principle requires not the 95 or 99% confidence intervals typically used in statistical analysis, but 100% confidence intervals. Although in theory a 100% confidence interval would have to be infinite, or for a linkage group, the entire length of the chromosome, in practice we use a ’ Although we can compare CPROP’s representation of map information with that used in maximum likelihood analysis, the primary tasks of these techniques differ. Maximum likelihood methods are used to derive map conclusions from pedigree data, whereas CPROP is used to integrate map information from numerous disparate sources. CPROP uses mostly derived, or secondary data; statistical analysis of the data occurs prior to input and is used to provide CPROP with uncertainty measures. CPROP is intended primarily for assimilating information into a developing map database. Issues of coupling CPROP to primary analysis software are discussed in Section 6. a Note that we consider markers to be geometric points; if the map being developed is expected to contain data of such high resolution that genes are viewed as intervals, then the left and right endpoints of these intervals can be represented as distinct points.
AND
BERLYN
rather smaller uncertainty interval and rely on CPROP to catch any inconsistencies that may arise as a result.4 Our current application assigns uncertainties to particular types of bacterial genetic measurements in a partially ad hoc manner, for two reasons. Bacterial genetic data are not routinely subjected to statistical analysis, but rather are often based on a single, large sampling of each population. Sampling variation is overshadowed by biological variation (for example, background effects on recombination frequency) that is difficult to define and quantitate. Input data are drawn primarily from such published experiments, which do not include standard errors. Second, even given a standard error, the decision to treat a 95, or 99, or 99.9% confidence interval as a nominal 100% confidence interval is inherently ad hoc. By default, we assign a relative error of 20% for bacterial cotransduction measurements, However, in cases in which several experiments have provided measurements of the same distance, these measurements may yield a true mean and standard error, which can then be used to derive a nondefault uncertainty interval. This is a preprocessing step, however; it is not at present part of the operation of CPROP. We use error terms for other sources of data based on the resolution of the technique. (For distances obtained from DNA sequencing of regions, the errors are considered nearly negligible.) The assignment of appropriate error intervals based on biological considerations is probably the most critical factor for obtaining meaningful results. Using intervals that are too small will cause contradictions; overly generous error estimates will lead to unnecessary ambiguity in the map.4 Ordering information is represented as pairwise constraints between loci, relative to a specified frame of reference. We call an ordering frame of reference a local ordering window or LOW. An ordering constraint says that locus A is before locus B in LOW i. The set of constraints associated with each LOW defines what is called a partial order. A partial order on a set assigns a relative order to some but not all pairs of elements in a set. For example, A before B and A before C is a partial order. A total order, or sequence, orders all pairs of elements. Note that a partial order cannot contain any cycles. LOWS are kept cycle-free by CPROP, which reports a contradiction if a cycle is introduced into an LOW.” 4 See discussion of contradictions in Section 4 for further details. 5 Note that CPROP does not associate any measure of uncertainty or likelihood with ordering information-ordering constraints are treated as if they were completely certain. This is consistent with standard genetic analysis, since in most cases ordering conclusions are stronger than metric ones. The inclusion and propagation of likelihood measures within CPROP would constitute a significant extension of the version presented here. Such an extension would need to solve a number of significant problems. Methods would be needed for assigning likelihoods to input constraints, propagating likelihoods to newly inferred constraints, and integrating likelihoods when the same constraints are repeatedly inferred by different combinations of rules and data. Finally, the control structure that determines the order in which inferences are drawn would have to be redesigned, perhaps in such a way as to infer new constraints in maximum likelihood order or
RULE-BASED
PROGRAM
Loci that are ordered in one local ordering window may be oppositely ordered or not ordered at all in another. To represent a constraint like B is between A and C, we create a local ordering window i with two assertions: In Low i, A before B In Low i, B before C. To represent a second betweenness assertion, such as D is between C and E, we would create a new LOW j:
FOR
GENETIC
437
MAPS
The usefulness of the constraint representation derives from its ability to represent three different types of uncertainty: l Metric uncertainty is captured by the uncertainty intervals used to represent distances between markers. l Ordering uncertainty is captured by representing orderings as sets of pairwise constraints between markers, so that orders can be partial as well as total. l Uncertainty in the relative orientation of different sets of partially ordered markers can be captured by representing each partial order as a distinct local ordering window.
In Low j, C before D In Low j, D before E.
3. MAP ASSEMBLY
Although the LOWS i and j share locus C, they cannot be combined, because we have no way of knowing whether before in i means the same as before in j, or if it means the opposite. In the former case the order would have to be ABCDE, whereas in the latter it could be AEBDC, EADBC, and so on. Thus, keeping constraints segregated into different LOWS allows us to explicitly represent uncertainty about how one set of ordering constraints is oriented relative to others. In general, we can have both distance and ordering information relating a pair of markers. Thus CPROP’s .complete representation of a constraint is a tuple containing the following pieces of information:
The algorithm used to assemble maps in CPROP uses a form of mechanical inference called constraintpropagation, which has been usefully applied to the task scheduling problem (Dean, 1985; Dechter and Pearl, 1987). Constraint propagation is an AI programming technique that is commonly used to prune the available alternatives in search problems. Constraint propagation systems implicitly represent the set of possible solutions to a problem as a set of constraints that any solution must satisfy, rather than by exhaustively representing the individual solutions. A simple example of this style of representation is the use of a partial order to represent the set of total orders compatible with it. Constraint propagation systems draw derived constraints from a set of input constraints, using a set of inference rules. For example, for a partial order, one might use the transitivity of ordering property6 as an inference rule to derive new orderings. Different inference rules are appropriate in different kinds of constraint systems, but the general function of such rules is to make explicit aspects of the available information that might prune the set of solutions denoted by the representation. The task of a constraint propagation system is to compute the closure of a set of inference rules over the data set-i.e., to draw all possible inferences licensed by the rules. For this procedure to terminate, the number of inferences must be finite. They should also preferably have a computational cost at worst polynomial in the size of the dataset, or else the algorithm will be impractical on large problems. Some common inference rules and their associated closures are transitivity of ordering and transitive closure; substitution of equals for equals and congruence closure; universal instantiation and exhaustive forward chaining; and a variety of arithmetic and geometric rules and closures (McAllester, 1987; McDermott, 1983; Davis, 1981). Typically, constraint propagation systems are implemented in an incremental manner, so that adding a new constraint requires only small amounts of inference to reestablish the inferential closure. Some systems also support incremental retraction of constraints; this re-
(locusA,
locusB,
LOW,
LowerBound,
UpperBound)
Loci and LOWS can be represented by any sort of unique symbol or ID number; integers are convenient for LOWS because negation can then be interpreted as the mirror image LOW; i.e., if A is before B in LOW #i, then B is before A in LOW #-i. Constraints can contain purely metric information, purely ordering information, or a mixture of both. A pure metric constraint is one where the LOW is initially null. A pure ordering constraint is one where the lower and upper bounds are null. Translating experimental data into the constraint representation is usually straightforward. In bacteria the most complex case is cotransduction data; for these we have developed a program, called COTRANS, which automatically derives constraints in the format required by CPROP (Berlyn and Letovsky, manuscript in preparation). In general, distances between loci are translated into metric constraints with appropriate uncertainty intervals, while betweenness constraints (trios of loci for which the middle locus is known) are translated into a pair of ordering constraints in a newly generated local ordering window. Data from published genetic maps can also be included as ordering constraints or as distances or both. to explore the (large) search space of internally consistent ally inconsistent constraint sets in some efficient manner.
but mutus If A is before
B and B is before
C, then
A is before
C.
438
LETOVSKY
AND
BERLYN
Merge LOWS D A
Transitive Closure of Ordering b
i:
A B
---+A
j:
i:
A-c
A--C
D
A
’
A
i: __)+ j:
Interval Intersect i on t@I
@i:
C
B
DA,,
B
A-C C
B
DA,,
B:
Derived Order Triangle EauaIity
s
--
*-CBC --
A
C
+__)
-
C
A
Disioint Order
s-c
0-----WAWC
A-B
Clipping B Ku
-I
w
A
rc ___-____---
)
Order
!i+
Distance Uncertainty
A
Equality Above implies below FIG.
1.
CPROP
quires maintenance of data dependencies (Charniak et al., 1980) (also called truth maintenance) showing which input constraints support which derived constraints, so that the derived constraints can be retracted when the input constraints supporting them are retracted. 3.1 Inference
ydc 4:
Rules
This section describes the inference rules used in CPROP. See Fig. 1 for visual intuitions. Note that in our constraint representation, each local ordering window incorporates an arbitrary choice of the ordering relation before. It should be the case that reflection of any of the
inference
8
b, I
Y
rules.
input LOWS has no effect on CPROP’s output other than a possible reflection of some of the output LOWS. This property can be achieved either by maintaining two mirror versions of each LOW or by having a single version of each LOW and implementing mirror image versions of the inference rules where necessary. In successive implementations of CPROP, we have tried both approaches; the former appears to be simpler. Transitive closure of ordering. This rule says that if A is before B and B is before C in the same LOW, then conclude that A is before C in that LOW. The inferences produced by this rule are not very interesting in themselves, but they often allow other rules to fire, which may then produce more interesting conclusions.
RULE-BASED
PROGRAM FOR GENETIC MAPS
Merge local ordering windows. Two LOWS can be merged whenever they both order the same pair of markers. There are two variants of this rule, depending on whether the LOWS agree or disagree on the order. To merge two LOWS, the constraints associated with one LOW get their LOW ID#‘s changed to the ID# of the other LOW or its negation, depending on whether the two LOWS agree or disagree, respectively, about the order of the markers. Given different uncertainty inInterval intersection. tervals for the distance between loci A and B, the true distance should he in the intersection of the intervals. This rule is the basis of the inequality data dependency system described in McDermott (1983). Whenever new distance constraints are derived by CPROP, they are immediately intersected with the current tightest bounds on the distance to yield tighter bounds, if possible. The system therefore always works with the current tightest bounds available on all distances. Triangle equality. The triangle inequality, which says that any two sides of a triangle must have a combined length greater than or equal to the third side, is an equality in one dimension-it represents the additivity of genetic distances. All the remaining rules are distinct operational forms of this equality. The three labeled Triangle Equality in Fig. 1 derive one side of a one-dimensional triangle given two of the others, plus some ordering information. The first rule adds the two short sides to get the long side; the others subtract the left or right short side from the long side to get the other short side. Note that addition and subtraction here refer to operations on uncertainty intervals rather than on single numbers. Interval addition adds the corresponding upper and lower bounds. Interval subtraction is a bit more complex to ensure that uncertainty always increases as measurements are combined: INTERVAL
ADDITION:
KY 4 + [4, %!I = [b + 4, Ul
+
&!I.
ill 741 - L&J,41 = u, - u,, Ul - 123. Here I, and u1 are the lower and upper bounds, respectively, of the first uncertainty interval, etc. Derived order. Another variant of the triangle equality says that if all three sides of a triangle are known, then the order of the vertices can be inferred. Since our distances are only known to within uncertainty intervals, it is not enough to show that two sides add up to an interval that intersects the third; we must also show that no other pair of sides can do the same. Thus, if there is too much fuzz in the distances, this rule can fail to apply. If the rule does apply, a new LOW ID# is created to represent the order, Disjoint order. Disjoint order applies to two unordered nodes whose distances to a common vertex are known, in the same direction, and whose uncertainty
439
intervals are disjoint. It is then possible to determine from the distances which of them is closer to the common vertex, and thereby to derive an order. Note that there is a mirror image variant of this rule, which is not shown. Clipping. These rules restrict the uncertainty intervals of two distances from a common point, where the order of all three points is known. If the loci are A, B, and C, in that order, then the lower bound on the AC distance cannot be less than the lower bound on the AB distance; if it is, the rule clips it so that it is not. A similar rule applies to upper bounds, and both occur in mirror image forms, for a total of four rules. The above rule set is not complete for constraint maps, in that it does not derive all orders, distances, and distance uncertainty reductions implied by the data. It has been shown that the problem of determining the satisfiability of a set of betweenness constraints, which is a subproblem of what CPROP computes, is NP-complete7 (Opatrny, 1979), whereas the function computed by CPROP has polynomial complexity. Our LOWMerge rule examines only pairwise consistency of LOWS, but N-way merges or conflicts, for N > 2, are also possible, and these can be detected only by enumeration. However, our experience suggests that CPROP is as complete as human judgment: an earlier version of the program lacked the clipping and disjoint order rules and produced maps that were inferior to those we were able to infer from the same data. With the addition of these rules, we have detected no further holes in its performance. 3.2 Assumptions A series of assumptions, or idealizations, underlies the rules used by the constraint propagator. These include: The triangle equality holds. The triangle equality assumes a one-dimensional Euclidean geometry, i.e., that distances add properly. This assumption may fail for a number of reasons. Genetic maps are typically intended to reflect positions of genes on the wildtype (or other standard) chromosome, but they necessarily incorporate data obtained from a multitude of mutant strains. Differences in chromosome geometry produced by structural mutations in different strains, as well as other types of strain-specific or allele-specific variations in the relationship between recombination frequency and distance, could undermine the Euclidean assumption. The known structural mutations are dealt with by corrections to wildtype geometry prior to CPROP input and thus should not cause a contradiction. Other deviations affecting recombination must be handled by deletion or weakening of the constraints involved. Our database preserves the record of these changes and the links to the original experiments, strains, and alleles, so that patterns can be traced or changes made in the original conclusions when new evidence is introduced. 7 Thanks to Frank Olken for pointing this out.
440
LETOVSKY
The ordering is ultimately linear. This assumption is, in fact, false for E. coli, in that the E. coli chromosome is a cycle, not a line, and so the constraint propagator is actually using a mathematically inappropriate theory. Revising the inference rules to reflect this fact would be difficult. The notion of order must be changed, because on a circle every point is both before and after every other point. There are two distances between every pair of points on a circle. The uncertainty in the circle’s circumference must be somehow taken into account. Fortunately, it is possible to avoid all of these complexities by limiting the use of the constraint propagator to datasets that can be regarded as linear. This means that the set of ordering constraints cannot cycle, even implicitly. Any such cycling will be detected by the constraint propagator, which will report a contradiction. In practice, we can live with this restriction by analyzing only cycle-free subsets of the constraints in any single run of the constraint propagator. 3.3 Result Maps The constraint propagator takes as input a set of constraints, which can include pure metric constraints, pure ordering constraints, and mixed constraints. There can be more than one constraint for a given pair of markers; this situation arises frequently because different crosses or experiments often produce information about the same marker pair. The input constraints are typically gathered by querying the database for all experimental data bearing on a set of markers of interest. The output of the constraint propagator is, like its input, a set of constraints. The output differs from the input in that it typically includes constraints between nodes that were not directly constrained in the input; it contains at most one constraint per pair of nodes; it ideally contains fewer, but larger, local ordering windows and, ideally, tighter distance bounds. The output is presented to the user in two different ways: Transitive reduction of the constraint set. CPROP generates the transitive closure of both distance addition and ordering, which tends to produce on the order of n2 constraints for n markers. Many of these constraints are trivial consequences of other constraints. It is important that the system deduce these constraints, since sometimes they will lead to useful conclusions, but it is not necessary to confront the user with all of them. To seemore clearly what kind of redundant information the system is capable of inferring, consider what happens when it is given as input a completely ordered set of markers, i.e., a single local ordering window andpairwise distance constraints between adjacent markers only. The results are no more constrained than the input, of course, but they include the distance bounds between every pair of markers. This illustrates the point that the constraint propagator is not guaranteed to do useful work: it exploits opportunities for constraint propagation if they exist in the data, but not all datasets provide such opportunities. For example, applying the con-
AND BERLYN
straint propagator to its own output leaves the result unchanged. From the user’s standpoint, a display of the distances between adjacent points is much more concise and informative than the set of all pairwise distances. A postprocessing step is therefore applied to the output constraints in which any constraint whose LOW and distance intervals are a trivial consequence of other constraints is eliminated: we compute the transitive reduction of the output with respect to the additive version of the triangle equality. Inferential reduction is the opposite of closure-elimination of all constraints that are derivable by the rules from the remaining constraints. The resulting reduced set of constraints has size of order n for n totally ordered markers. See the constraints table in Fig. 2 for an example. Reference map. Although the reduced constraint set contains all the information the system has about the marker geometry, it does not make the overall arrangement of the markers readily apparent. We have developed a more intuitive presentation, called a reference map, which contains somewhat less information than the constraint set, but in a more easily understood form. In contrast to constraint sets, reference maps are easily diagrammed. Figure 3 shows a reference map diagram generated by the system. Reference maps are also displayed in tabular form, as shown in Fig. 2. The reference map table for a local ordering window contains a row for every marker in the LOW. Each row contains the following information: ORDER BOUNDS. The column labeled MinPos gives the smallest ordinal position in the sequence compatible with the constraints on the marker. The MaxPos column gives the largest such number. The MinPos and MaxPos values permit an intuitive presentation of the ordering possibilities compatible with a partial order. For example, the relative order of dld and zed704 in Fig. 2 is uncertain, as is the order of nfo and zee700. COORDINATES RELATIVE TO A REFERENCE SITE. Each site has a Lower and Upper field, which specify the bounds on the distance between that site and a single site in the LOW, called the reference. In Fig. 2 the reference was chosen to be cdd. Currently, we select as the default reference marker a site having minimal average uncertainty in its metric constraints; however, users can override this choice and select any marker they wish as the reference. Reference maps can lose both ordering and metric information that is present in the full constraint set. Metric information may be lost because the reference map throws out all pairwise distances that do not involve the reference. As a general rule, uncertainty intervals in a reference map will tend to increase in size with distance from the reference. Sometimes there is no choice of reference that gives minimal uncertainties for all sites. In Fig. 3 choosing gatA as the reference would have given a slightly tighter interval for udk (0.30 instead of 0.41) but other intervals would then have been larger. In this case the difference is small, but in other maps the tradeoffs involved in choosing a reference can be more substan-
RULE-BASED
PROGRAM
CPROP RESULTS Reference naps LOW Jane 49 his Udk
g&A dld zed704 cdd mgl cirA nfo see700 fruK Constraints LOW Before 49 his his his his his Udk U&k Udk Udk udk
gatA gatA gatA gatA dld dld dld zed704 zed704 cdd cdd cdd cdd cdd ml31 ml31 mi31 cirA cirA cirA nfo zee700
ni.nPos 1 2 3 4 4 6 7 6 9 9 11
HaxPos 1 2 3 5 5 6 7 6 10 IO 11
Lower -2.3 -1.7 -1.1 -0.8 -0.7 0 .37 .57 .67 .68 1.02
Upper -2.1 -1.3 -1.0 -0.6 -0.6 0 .4 .6 .72 .a 1.07
Lower 2.19 1.39 1.15 .54 1.52 1.91 .31 1.71 .68 2.03 1.04 1.61 .24 .37 .65 1.22 1.02 .67 1.04 .57 1.02 .37 .67 .68 .18 .65 .31 .45 .1 .ll .33 .25
Upper 2.3 1.65 1.26 .95 1.63 2.35 .61 2.15 1.09 2.55 1.15 1.75 .5 .48 .8 1.39 1.19 .78 1.18 .6 1.07 .4 .72 .6 .2 .68 .4 .48 .12 .22 .38 .34
FOR
FIG, 2. CPROP results. The input to CPROP consisted of 89 constraints derived from cotransduction and restriction map data. The cotransduction constraints were generated by applying our COTRANS program to data from 28 selections on 10 multimarker crosses, described in Josephsen et al. (1983) and Middendorf et al. (1984). The restriction map constraints involve distances between three loci, based on unpublished data from B. Weiss (personal communication). These inputs resulted in over 206 inferences and 6 contradictions requiring resolution. (See discussion of contradictions in Section 4.) In the output, all 11 loci were grouped into a single LOW. The upper table shows a reference map for this LOW, the lower table shows the reduced output constraints.
tial. Reference maps can lose ordering information, since the Min and MaxPos’s cannot represent orderings among sites whose ordinal ranges overlap. In practice, these information losses are a minor disadvantage compared to the greater accessibility of reference maps as a data presentation technique. The output constraints derived by the constraint propagator can be stored, upon request, in a database table called the CMap, for constraint map or canonical
441
MAPS
map, since it is intended as the repository of all currently believed constraints. The CMap is ultimately expected to contain a reduced constraint set for all marker pairs that have been studied so far. Within the CMap, a special local ordering window (LOW#l) is used to represent the global frame of reference of the E. coli chromosome. As new data arrive, they are analyzed together with relevant data from the CMap, so that their consistency with previous assertions can be ascertained and the new data can be used to further tighten the constraints in the CMap. 4.
After cdd dld gatA udk zed704 cirA gatA mgl zed704 zee700 cdd cirA dld zed704 cdd cirA mgl cdd ml31 cirA fruK mgl nfo zee700 cirA fruK zea700 fNK nfo zee700 fruK fruK
GENETIC
CONTRADICTION
DETECTION
AND
DIAGNOSIS
In the normal course of operation the constraint propagator will exhaustively apply the inference rules to LOW#49
I
his( 1)
44
T udk(2)
I @A(3)
45
I
46
dld(4,5)
I
zed704(4,5)
I
zee700(9,10)
-'cdd(6)
---W(7)
q =cirA(B) Info(9,lO)
47
IfruK(11)
FIG. 3. Diagram generated from the reference map in Fig. 2. The reference locus is c&f, and the I-bars indicate the uncertainty interval for the distance of each locus from cdd. The global coordinates were obtained by assigning cdd a position of 46 min. The numbers next to each site indicate the relative order of the loci (i.e., the MinPos and MaxPos of Fig. 2). When the order is ambiguous, the range ofpossibilities is given.
442
LETOVSKY AND BERLYN
tighten the constraints. If the initial constraint set was overconstrained, so that no possible map can satisfy it, CPROP will generally detect a contradiction. There are three different types of contradictions. Metric contradiction: A derived metric constraint on a pair of sites has no overlap with the currently believed metric bounds for that pair. This means that no intermarker distance is compatible with the constraints. Ordering contradiction (cycling): The ordering B before A is added to a local ordering window that already contains A before B. Derived order contradiction: All the distances along the sides of a marker triangle are known, and none of the markers can be in the middle. Detection of such contradictions is straightforward. Metric contradictions are detected when tightening metric constraints; ordering contradictions are detected when adding new ordering constraints; and derived order contradictions are detected when examining marker triangles for possible derived orders. Contradiction detection is an important function of the constraint propagator, since it helps the user identify inconsistencies in the input constraints. It is crucial that the system not only detect contradictions but also assist the user in identifying which input constraints are likely to be at fault. The system cannot, even in principle, identify constraints as wrong; at best, it can show that two constraints, or two sets of constraints, lead to incompatible conclusions. It is up to the user to decide which set to believe or whether to relax the constraints by increasing the uncertainty intervals. To explain how CPROP reports contradictions, we need a vocabulary of relevant terms. A contradiction manifests as an interaction of two or three incompatible constraints, which are called the conflict set. Metric and ordering contradictions have two elements in the conflict set; derived orders have three. The constraints manipulated by the system are either primary, i.e., supplied as input, or derived by the constraint propagator using the inference rules. Each derived constraint has an exph nation, which is a tree of support, having primary assertions as its leaves and the derived constraint as its root. Each internal node in such trees corresponds to a previously derived constraint resulting from the application of one of the system’s rules of inference. An explanation tree thus records the subset of the constraint propagator’s inferential activity which was relevant to producing a derived constraint. Maintenance of such inferential data dependencies is a standard AI programming technique used to support explanation and retraction of inferences (Charniak et al., 1980). The set of primary constraints at the leaves of a derived constraint’s explanation tree will be called its primary support. When CPROP finds a contradiction, it displays the conflict set, along with the primary support and an explanation tree for each derived assertion in the conflict set (see Fig. 4). In practice, we find the explanation trees useful only when they are fairly small; when they are
large, they become incomprehensible. The primary support has emerged as the more easily understood and diagnostically useful display. However, when it is not obvious how a conflict set element was derived from the primary support, the explanation and the log must be studied. We continue to search for more intuitive ways of presenting and summarizing these explanation trees. The frequency of contradictions is influenced by a number of factors, including the quality of the data and the correctness of the methods of inferring constraints from it. Contradictions can result if the metric uncertainty in the data is underestimated in the constraints. Also, for data sources with significant metric uncertainties, including most types of recombination data, the likelihood that the result is overconstrained (i.e., contradictory) is greater when there are more constraints for a given set of loci. In the dataset used for Fig. 3, there were 89 constraints on 11 loci. All metric constraints from cotransduction experiments were assigned an uncertainty of &lo%. Six contradictions were subsequently detected; 13 constraints were weakened to achieve a conflict-free map. These contradictions are not surprising since variability in data from different transductions and selections is, of course, characteristic of cotransduction data and other measures relating crossover frequency to distance. Assigning map positions often requires subjective judgments to reconcile conflicting data. With CPROP, the user resolves conflicts interactively by weakening or eliminating constraints from the input dataset. An appropriate resolution is often suggested by examination of the input constraints or the experimental data they are based on. For example, when four input constraints on the same pair of sites are consistent, the rejection of a single conflicting fifth constraint may be an appropriate resolution. In some cases a low population size for the cross underlying one member of a conflicting pair of cotransduction-derived constraints may argue for its elimination. In other cases, small uncertainty intervals fail to overlap only by a small distance, and unioning the intervals provides a good resolution. When the conflict involves different types of data, more weight is given to distances obtained from sequencing and restriction data. Of course, there are cases that do not in themselves suggest a preferred resolution. Increasing the uncertainty interval of each or unioning the conflicting intervals may then be used to resolve the conflict. The variety and subjective nature of different resolutions, as well as the possible future ramifications of the decision, emphasize the importance of preserving the record and allowing reversal of an earlier remedy. Our database-embedded implementation of CPROP provides such capabilities. 5. IMPLEMENTATION
The constraint propagator could be most simply implemented on top of an inference engine that supports conjunctive forward chaining, maintenance of data de-
RULE-BASED II
79 80 81 82 83 84
PROGRAM
FOR
GENETIC
Constraint
Source
Triangle Sum Intersection Triangle Left Intersection Disjoint Order Triangle Right
fruK fruK fruX fruX zee700 zee700
*** Derived Order Contradiction!!
--------< ---
cdd cdd mgl mgl mgl mgl
443
MAPS
Support q
= = = in q
[0.82,1.05] [0.96,1.05] [0.71,0.88] [0.71,0.80] LOW52 co.37,0.551
(1,55,2,66) (62,791
(1,80,2,66) (55,81) (25,59,1,82) (25,59,83,82)
***
Conflict Set: 82 Intersection 77 Rst#17011 60 COT-Dist#I4154
fruK --- mgl mgl --- cirA cirA --- fruK
Primary Support: 1 COT-CM#I5700 2 COT-CWI5700 55 COT-Dist#I5700 62 COT-DistM4436 66 COT-Dist#14872
fruK < < W fruK --cdd --cdd ---
= [0.71,0.80] = [0.18,0.201 q
[0.39,0.48]
(55,81) (input) (input)
mgl in LOW52 cdd in LOW52 mgl = [0.65,0.80] fruK = ~0.96,1.18] mgl = [0.17,0.25]
Explanation Tree: q CO.71,0.80] 82 Intersection fruK --- mgl = [0.65,0.80] 55 COT-Dist#I5700 fruK --- mgl 81 Triangle Left fruK --- mgl = [0.71,0.881 I COT-CM1115700 fruK < mgl in LOW52 = [0.96,1.05] 80 Intersection fruK --- cdd 62 COT-Dist#I4436 cdd --- fruK = [0.96,1.18] 79 Triangle Sum fruK --- cdd q [0.82,1.051 I COT-CM#I5700 fruK < mgl in LOU#52 55 COT-Dist#I5700 fruK --- mgl q [0.65,0.801 2 COT-CM#15700 mgl < cdd in LOW52 = [0.17,0.25] 66 COT-Dist#I4872 cdd --- mgl mgl < cdd in LOW52 2 COT-CIWI5700 66 COT-Dist#I4872 cdd --- mgl = [0.17,0.25] FIG. 4. Contradiction. The figure shows a segment of a processing log in which a contradiction was encountered during one of the early CPROP runs in the construction of the map in Fig. 3. The numbered lines at the top are generated during constraint propagation; each describes an inference made by CPROP. The column labeled “Source” gives the name of the rule responsible for the inference; the column labeled “Constraint” shows the inferred constraint, either metric or ordering; and the column labeled “Support” gives the row numbers of the inferences on which the current inference was based. The derived-order contradiction occurs because none of the three sitesfruK, mgl, and cirA can be in the middle, given the currently inferred distances between them. These distances are shown in the conflict set. Two of the conflict set elements are input constraints, while the third is inferred, and its primary support and explanation tree are not displayed. Each row of the explanation is indented slightly more than the inference it supports; e.g., constraint 81 is supported by 1,80,2, and 66. In examining this conflict it was noted that other constraints for the cdd-mgl distance were quite variable, ranging from 0.17 to 0.42. The conflict was resolved by unioning all of these intervals.
pendencies, and arithmetic processing in the rules. In such a system a declarative notation of CPROP’s inference rules would be directly executable. It is likely that some of the many commercial inference engines would be adequate for this purpose, although it is not clear that they would be efficient. Our first implementation was designed to integrate easily into an existing relational database system, and it was implemented in an extended version of SQL. It was not notably efficient; it takes about 20 s to analyze the constraints from a single experiment, and as long as 3 h for a set of constraints on 20 loci derived from multiple experiments. Recently, we have reimplemented CPROP in C, so that the latter problem now runs in less than a second on a Sun workstation. The C version is implemented as a standalone program independent of our database, so that it can be easily used by others.* We still use our database system * This
version
is available
from
the authors
on request.
to retrieve and store constraints, experimental data, and the CMap and to provide a user interface to CPROP, but other users can integrate CPROP into their own data management environments as they choose. Figure 5 shows a block diagram of the components of CPROP and their integration into the CGSC database. Boxes correspond to processes, and ellipses to data structures. Following the flow counterclockwise from the upper right corner, we see different types of experiments being analyzed in different ways to yield constraints in a common format for storage in the constraint database. The user of CPROP queries this database and retrieves a set of constraints that is supplied as input to CPROP. CPROP applies its inference rules exhaustively to these constraints to generate new constraints, which are fed back to the rules until quiescence is reached. CPROP’s outputs include the final set of output constraints, the reduced output constraints, and the reference maps, plus a processing log that records the
444
LETOVSKY
user
AND
BERLYN
4-1
formulates
1
Experiment
user
Type
C
1
selects -7
dll c Inference Rules
CS
Transitive
user
controls
to user
Outpu It Constra ints ’
‘y
I I
nerert conb,l”b,“r
l-l
I
I
to
7
FIG.
file
to
5.
Overview
7
user
of CPROP
to
embedded
inputs, results, and all of the inferences. If the user chooses, the reduced constraints can be incorporated into the CMap portion of the constraint database, and so be available for inclusion in the input to future runs of CPROP. Any contradictions uncovered during the CPROP run are presented to the user for interactive conflict resolution, which can result in the removal and/ or addition of constraints from the database. A record of the conflict and its resolution is stored as well. Note that the publicly available C implementation of CPROP corresponds only to what is in the box in the lower left of Fig. 5; the other components are integrated into the CGSC database. The C implementation represents the constraints on a set of N loci using an N X N matrix. The rows and columns of the matrix are labeled with the locus names;
v
user
in a database
environment.
each cell contains the LOW ID# and uncertainty interval for the pair of loci that index it. The diagonal elements of this matrix are degenerate; they correspond to constraints between a locus and itself. The lower and upper triangles are mirror images of each other, i.e., identical except for negation of the LOW ID#. Propagation is applied to both the upper and the lower triangle, thus ensuring the appropriate mirror symmetries for the rules. The constraint propagation algorithm maintains a queue of constraints that have been updated, which is initialized to the input constraints. For each updated constraint, every rule that might use that constraint as a precondition is checked to see if it can draw a new conclusion. If so, the conclusion is added to the map and the new or updated constraint is added to the queue. Propa-
RULE-BASED
PROGRAM
gation of a single updated constraint ij requires checking all triangles that may contain in’ and all possible third loci k. The time complexity of CPROP’s algorithm is O(n5), where n is the number of sites. This is because, apart from merging of local ordering windows, it does work proportional to the number of triangles that can be formed among the sites, which is O(n3). In the worst case this work may be repeated for every merge, and in the worst case there could be O(n2) merges. 6. DISCUSSION CPROP is intended to integrate mapping information from numerous sources. We are currently using it to construct regional maps of the E. coli K-12 linkage group, using data derived from sequences, restriction maps, crosses, and transductions. To date, we have applied CPROP to loci that fall within a few 5- to lo-min intervals on the E. coli chromosome. The data shown in Figs. 2 and 3 illustrate its use on one such set. We have not yet applied CPROP to the complete set of 1400 loci of E. coli. Nonetheless, we believe that CPROP’s utility with small regions alone will make it of interest to the broader mapping community. More experience is needed to determine the practicality of the tool with very large datasets. In assembling maps from bacterial cotransduction data, we use another program called COTRANS,’ which analyzes the primary data and produces constraints as output, which is then fed into CPROP. Other analytical tools could be coupled to CPROP in a similar manner if their outputs can be expressed in CPROP’s constraint format. The most widely used analytical tool for eukaryotes is currently maximum likelihood analysis (see, e.g., Lander et al., 1987), and the question of how best to integrate maximum likelihood results into CPROP is a topic for further investigation. For relatively complete data (experimental populations), CPROP may provide either an alternative or a supplementary way of integrating the output of 2-point and 3-point analyses. For incomplete, natural population data, the powerful multipoint maximum likelihood methods produce what might be called total maps-i.e., totally ordered loci separated by definite distances. This contrasts with CPROP’s manipulation of a more generalized representation that we call a partial map. A most-likely multipoint total map may contain incorrect orderings; rather than feeding a complete but potentially incorrect map into CPROP, it seems preferable to provide CPROP with some sort of maximum likelihood partial map. Any partial map subsumes (or generalizes) a set of total maps, and the likelihood of a partial map is the sum of the likelihoods of the total maps it subsumes. It follows that the null map of a set of loci, i.e., the partial map that contains no ordering information and has infinite metric uncertainties, would have a likelihood of unity because it subsumes all total maps. The null map is thus always the 9 Manuscript in preparation.
FOR
GENETIC
445
MAPS
maximum likelihood partial map, but this trivial conclusion is not useful. A maximum likelihood-based partial mapping procedure would therefore have to provide a principled basis (other than maximization) for choosing the partial map that provides the best compromise between likelihood and specificity. Maximum likelihood methods are often slow due to the computational cost of evaluating the exponentially many total maps. Another form of coupling between CPROP and ML techniques may be possible in which CPROP is applied prior to ML methods to reduce the set of total maps that must be considered. 7. CONCLUSIONS We have described CPROP, a software tool for analyzing genetic data that automates certain types of qualitative and quantitative reasoning used in genetic analysis. We have integrated CPROP into a database for managing mapping data and conclusions; this database records any editing judgments that were made to resolve conflicts in the mapping data. CPROP detects and helps identify such conflicts, and the database at all stages preserves the user’s options to exclude or modify any rule-derived conclusions judged inappropriate. The system is designed to support accountability in the integration of data: raw and derived data are linked to references, and when conflicting data are detected and resolved, the record and rationale for the conflict resolution are preserved. For tasks that involve the synthesis of large amounts of data, such as the generation of a genetic map of an organism using data from many experiments, it is important to support accurate representation of uncertainty in the knowledge, easy integration of new data, and accountability of the conclusions. Mechanization of the reasoning involved in map generation creates two opportunities for accountability: the map can be regenerated at any time, thereby showing that the program plus data do in fact imply the conclusions, and if the inference engine maintains data dependencies on its reasoning, the support for any particular conclusion can be readily examined. ACKNOWLEDGMENT This work DIR9019995
is supported by the National Science Foundation and previously as a Supplement to NSF-BSR8807021.
NSF-
REFERENCES Bachmann, B. J. (1983). Linkage 7. Microbial. Rev. 47: 180-230.
map of Escherichia
coli K-12,
edition
Bachmann, B. J. (1990). Linkage 8. Microbial. Rev. 54: 130-197.
map of Escherichio
coli K-12,
edition
Bachmann, coli K-12, Bachmann, linkage
B. J., and Low, K. B. (1980). Linkage edition 6. Microbial. Rev. 44: l-56.
map of Escherichio
B. J., Low, K. B., and Taylor, A. L. (1976). Recalibrated map of Escherichia coli K-12. Bucteriol. Rev. 40: 116-167.
Charniak, E., Riesbeck, telligence Programming,”
C., and McDermott, D. (1980). Erlbaum, Hillsdale, NJ,
“Artificial
Zn-
446 Davis, E. (1981). Yale University
LETOVSKY “Organizing Department
Spatial Knowledge,” Master’s of Computer Science.
Dean, T. (1985). “Temporal Imagery: An Approach Time for Planning and Problem Soluing,” PhD sity Department of Computer Science. Dechter, R., and Pearl, J. (1987). Network-based straint-satisfaction problems. Artif. Intelligence Josephsen, J., Hammer-Jespersen, Mapping of the gene for cytidine K-12. J. Bucteriol. 154: 72-75.
K., and deaminase
AND
thesis,
to Reasoning about thesis, Yale Univerheuristics 34: l-38.
for con-
Hansen, T. D. (1983). (cdd) in Escherichia coli
Lander, E., Green, P., Abrahamson, J., Barlow, A., Daly, M., Lincoln, S., and Newburg, L. (1987). Mapmaker: An interactive computer package for constructing primary genetic linkage maps of experimental and natural populations. Genomics 1: 174-181. McAllester, D. (1987). tem for Mathematics,”
“ONTIC: A Knowledge PhD thesis, MIT.
Representation
Sys-
BERLYN
McDermott, D. (1983). Data dependencies on inequalities. ceedings of AAAI-83,” American Association for Artificial gence, Washington, DC.
In “ProIntelli-
Middendorf, A., Schweizer, H., Vreemann, J., and Boos, Mapping of markers in the gyrA-his region of Escherichia Gen. Genet. 197: 175-181.
W. (1984). coli. Mol.
Opatrny, 8(l):
J. Comput.
J. (1979). 111-114.
The
total
ordering
problem.
SIAM
Sacerdoti, E. D. (1977). “A Structure for Plans and Behavior,” Elsevier North-Holland, New York. Taylor, A. L., and Adelherg, E. A. (1960). Linkage analysis with very high frequency males of Escherichia coli. Genetics 45: 1233-1243. Taylor, A. L., and Thoman, M. S. (1964). The genetic map of Escherichia coli K-12. Genetics 50: 659-677. Taylor, A. L., and Trotter, C. D. (1967). Revised linkage map of Escherichia coli. Bacterial. Rev. 31: 332-353.