Constrained Markov networks for automated analysis of G-banded chromosomes

Constrained Markov networks for automated analysis of G-banded chromosomes

Cornput. Biol. Med. Vol. 23, No. 2, pp. 105-114, Printed in Great Britain 1993 0010-4825/93 s6.00+ .oo @ 1993 Pergamon Press Ltd CONSTRAINED MARKOV...

909KB Sizes 1 Downloads 31 Views

Cornput. Biol. Med. Vol. 23, No. 2, pp. 105-114, Printed in Great Britain

1993

0010-4825/93 s6.00+ .oo @ 1993 Pergamon Press Ltd

CONSTRAINED MARKOV NETWORKS FOR AUTOMATED ANALYSIS OF G-BANDED CHROMOSOMES C. G~JTHRIE, J. GREGOR*

and M. G. THOMASON

Department of Computer Science, University of Tennessee, Knoxville, TN 379961301, U.S.A. (Received 17 March 1992; received for publication 23 November 1992) Abstract-Automated analysis of chromosome band patterns using probabilistic Markov networks has been reported in previous work. Band patterns are represented as strings of symbols. Inferred from a set of learning strings, a Markov network is a model of intraband and interband relations in these strings. The inference is entirely data-driven and is accomplished using dynamic programming. This paper presents a new model of chromosome band patterns, the constrained Markov network, which is a special case of its predecessor. Substantial experimental evidence of the superiority of the new model over the old is given in terms of equal results in centromere finding and improved results in classification for the 22 autosomes. Furthermore, a method for simplification of constrained Markov networks is shown to be of considerable importance with respect to computational complexity. Centromere finding Banded chromosome Chromosome classification Dynamic programming Constrained Markov network Pattern recognition

1. INTRODUCTION Chromosome analysis has many applications, e.g. in prenatal amniocentesis examination, in evaluation of malignant diseases like the leukemias, and in monitoring biological effects of environmental mutagens. The normal human chromosome complement consists of 22 pairs of homologous chromosomes (autosomes) and one pair of sex chromosomes. The aim of chromosome analysis is to detect deviations from this so-called karyotype with respect to numerical and structural aberrations. The process involves labeling each chromosome according to its type (l-22, X, Y) in order to identify missing or extra chromosomes as well as distorted parts of individual chromosomes. This is facilitated by staining techniques that provide the chromosomes with type-specific band patterns [ 11. This paper reports recent results in automated chromosome analysis based primarily on the structural information conveyed by the band patterns. Markov networks, which are special forms of finite-state Markov chains, constitute the central component of the system. Individual chromosome density profiles of band patterns are filtered and nonlinearly mapped into strings using symbols from a fmite alphabet. A Markov network is automatically inferred for each chromosome type as a structural model of intra- and interband relations between successive symbols in a set of learning strings. Previous work includes (i) classification of individual chromosomes [2,3,4], and (ii) centromere finding [5], i.e. locating the transition from a chromosome’s shorter p-arm into its longer q-arm. While promising experimental results have been achieved, studies have also revealed certain weaknesses of the network topology [6,7]. This paper presents a new model of chromosome band patterns, the constrained Markov network (CMN), and gives substantial empirical evidence of the superiority of the new model over the old. * Author to whom correspondence should be addressed. CWt2322-C

105

C. GUTHRIEet al.

Fig. 1. Example of an unconstrained Markov network (UMN).

To contrast with the new model, the old model is referred to as unconstrained (UMN). Section 2 outlines basics of both UMNs and CMNs. Section 3 describes data material, band pattern representation and general experimental design. Sections 4 and 5 discuss experimental results in centromere finding and chromosome classification by both UMNs and CMNs. Section 6 discusses CMN pruning as a method of model simplification of considerably practical importance. Applied methods and specific experimental design considerations are also addressed. A conclusion is given in Section 7. 2. MARKOV

NETWORK

BASICS

The consistency and distinctiveness of the band pattern for each normal chromosome type [l] suggest structural models that describe intra- and interband relations. By nature, however, digitized images of banded chromosomes are noisy due to factors such as cellto-cell variations in length, band widths, number of bands, and density levels. A realistic model must take these variations into consideration. Given a set of learning samples representing the sequential nature of the band pattern as strings of symbols, such models can be built, or inferred, from the data by merging the strings where symbols are alike and allowing alternatives elsewhere. The Markov network inference technique is a mechanism capable of automatically handling strings with common but noisy substrings. 2.1. Unconstrained Markov network (UiUN) A Markov network [8] is a structural model of a pattern class obtained through data-driven inference in which learning strings are optimally aligned one after the other by dynamic programming. For each string-to-network alignment, network modifications necessary to incorporate that string into the network are implemented. The inferred model is a directed graph with each node assigned a unique symbol from the string alphabet. A network has a unique starting node, a final absorbing node, no cycles before absorption, and arcs labeled with frequency counts for relative-frequency estimates of transition probabilities. The dynamic programming cost function applies the relativefrequency estimates and specifically computes an optimal alignment (for modifying the network to incorporate a specific string) to maximize the probability with which the updated Markov network will generate that string. This maximum probability is called the alignment probability, Pdg. Modifying a network involves incrementing the frequency counts of arcs traversed in the alignment, but it also may require adding nodes and/or arcs to incorporate the new learning string. The aim of the inference process is to discover and reinforce recurrent structure by updating arc frequency counts so that substrings common to the set of learning strings are emphasized with high probability in a network’s realizations, while random substrings are given low probabilities. The phenomenon is apparent from Fig. 1, which shows the initial part of a UMN inferred from alignment of 100 chromosome p-arms of type 5 using the band pattern representation outlined in Section 3. Nodes without labels are empty nodes used for bookkeeping purposes only, and they output an unobservable symbol.

G-banded chromosomes

107

The entire network has 201 arcs and 116 nodes of which 94 are labeled with a symbol from the string alphabet. After inference, computation of string-to-network alignments for band pattern analysis or chromosome classification uses the same dynamic programming cost function as above, but no network modifications are carried out. 2.2. Constrained Markov network (CMN) A constrained Markov network [9] is a special case of UMN. The CMN topology is fixed to be a simple concatenation of stages, each of which gives a choice of a symbol from the string alphabet or the unobservable symbol. As with UMNs, transition probabilities are estimated by relative-frequencies and used in the dynamic programming cost function for maximizing a string’s alignment probability. String-to-network alignment is computed stage-by-stage, as opposed to node-by-node, so network modification consists of updating arc frequency counts and possibly inserting new stages. Figure 2 shows the first six stages of the CMN resulting from alignment of the same strings used to infer the UMN in Fig. 1. For simplicity, zero-count arcs and nodes are excluded. Within its total of 31 stages, the entire network has 144 nodes, of which 91 are labeled with a symbol from the string alphabet and 33 are empty nodes used only to interconnect stages. 2.3. Computational complexity aspects Because of difference in topology, a UMN is computationally more expensive to use than a CMN. The computational complexity is proportional to the number of symbols in an aligned string times, respectively, the number of arcs for a UMN or stages for a CMN. A characteristic of UMNs is that they tend to be “bushy”, i.e. have relatively many arcs in an unsymmetric structure; CMNs tend to be “compact”, i.e. have relatively few stages, and their structure is always a uniform concatenation of stages. As an example, for the networks used in the experiments reported below, the average is 7 times as many arcs in a UMN as there are stages in a CMN: the range is from 242 to 1017 for UMN arcs and from 35 to 150 for CMN stages. 3. EXPERIMENTAL

PRELIMINARIES

3.1. Data material and band pattern representation The chromosome data are obtained from a database of approximately 7000 G-banded human metaphase chromosomes in p-q orientation [lo]. Overlapped and severely bent chromosomes are excluded. The digitized images are processed to obtain idealized, onedimensional density profiles [ll] that emphasize the band pattern along the chromosomes. A manually determined centromere position, which is defined as the transition between the p-arm and the q-arm, is coded into the data. The density profiles are mapped nonlinearly into strings composed of symbols from the alphabet { 1,2,3,4,5,6}. These strings are difference coded using the alphabet = for 0, A for + 1, a for - 1, B for +2, . . . . to represent signed differences of successive symbols in a left-to-right string

Fig. 2. Example of a constrained Markov network (CMN).

C. GUTHRIEet al.

108

sequence; e.g. string 134411 becomes ABA= c= a when difference details on the profile-to-string processing, see references [2,3,5].

coded. For more

3.2. General experimental design considerations Two balanced datasets, ai and /Ii, are formed for chromosome type i, 1 ~i~22, by assigning every other of the 200 mid-length samples from the database to each of the datasets in turn. The length of these 4400 strings ranges from 21 to 106 symbols with an average of 48.2 symbols, and the length of the p-arms ranges from 2 to 52 symbols with an average of 17.4 symbols. Error rates are measured as the mean of error rates obtained by inferring using dataset a,, and testing with dataset pm, and vice versa. Within-class results are computed for n = m, and between-class results are computed for IZ# m. 4. CENTROMERE

FINDING

4.1. Method and design The fact that a chromosome has two arms, p and q, which meet at the centromere can be built into a Markov network by forced landmarking [12]. A forced landmark is a transition arc made to occur in every realization of the network. For the chromosome data, it is created by segmenting the learning strings into their p-arms and q-arms, inferring a separate network for each of the two sets of substrings, then concatenating those into one network with the jointing point defining the forced landmark, i.e. the centromere. When a string is aligned with such a Markov network, the location in the string that aligns with the forced landmark becomes an estimate of the string’s centromere position. The centromere position estimate is obtained on basis of optimal (maximum probability) string-to-network alignment. Forced landmarking has been shown feasible for centromere finding in a pilot study using chromosomes of type 6-12 and UMNs [5]. Here, all 22 autosomes are used and results for UMNs and CMNs are compared. The chromosome database has recently been used also for assessment of a dedicated, image processing, centromere finding technique called shape profile [13]. The subset of results in that study [14] for the chromosome samples used here serves as a point of reference; however, chromosome type and orientation are assumed known when evaluating Markov network performance, whereas shape profile results are obtained without this information. To allow for quantization effects, inevitable small inaccuracies in the manually determined centromere positions, and variations in string lengths, estimated centromere positions are accepted as correct if located within the range of +2 symbols of the encoded position or a distance corresponding to f 5% of the string length. 4.2. Results and discussion Proportions of correctly estimated centromere positions and mislocations are shown in Table 1. On average, 96.5%) 97.5% and 95.0% of the centromeres are located correctly using UMNs, CMNs, and shape profiles, respectively. Mislocations are not distributed evenly over individual chromosome types or between methods. Though Markov network and shape profile results should be compared with great caution, as they are derived applying different constraints, it is noteworthy that forced landmarking deals equally well with all chromosome types, while types 21 and 22 show significantly poorer results for the image processing technique. A detailed examination of the Markov network results reveals that the modest improvement of CMN over UMN derives from an additional 59 strings (distributed over 15 types) for which the estimated centromere position becomes correct and only 12 strings (distributed over 3 types) for which it becomes wrong. Markov network estimated centromeres obtained for UMNs and CMNs are located within - 8/ + 16 and - 5/ + 17 symbols, respectively, from the encoded position. For

G-banded chromosomes

109

Table 1. Proportions of estimated centromere positions. Markov network results are based on known class and orientation, while shape protile results (courtesyof Piper [14])are not unconstrained

constrained

MN

Type

% Located correctly

% Errors located in q-arm P--

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

99.5 99.0 98.5 98.5 97.5 97.0 100.0 90.5 95.0 98.5 98.0 100.0 93.0 91.5 89.5 98.5 96.5 97.0 96.0 98.5 93.5 96.5

0.0 0.0 0.0 1.0 0.5 0.5 0.0 3.5 3.5 1.0 1.0 0.0 3.5 4.0 5.0 1.0 1.5 1.5 2.0 1.0 2.5 0.0

0.5 1.0 1.5 0.5 2.0 2.5 0.0 6.0 1.5 0.5 1.0 0.0 3.5 4.5 5.5 0.5 2.0 1.5 2.0 0.5 4.0 3.5

Avrg

96.5 (f3.1)

1.5 (a1.5)

2.0 (h1.7)

Range

89.5-100.0

0X1-5.0

0.0-6.0

% hxhed correctly 100.0 100.0 99.0 99.5 99.0 98.5 98.5 92.5 93.5 99.5 97.5 98.5 93.0 94.5 94.0 98.5 99.0 99.0 98.0 99.5 97.5 95.0

MN

% Errors located in p-arlll q-0.0 0.0 0.0 0.0 1.0 0.0 0.5 0.0 0.5 0.5 1.0 0.5 0.5 1.0 4.5 3.0 1.5 5.0 0.0 0.5 2.0 0.5 1.0 0.5 3.0 4.0 2.5 3.0 4.0 2.0 1.5 0.0 1.0 0.0 0.5 0.5 0.5 1.5 0.5 0.0 2.5 0.0 5.0 0.0

Shape % Located correctly

Profile

I

% Errors located in p-arm q-arm

97.5 96.5 98.0 98.0 99.5 99.5 99.0 96.0 99.0 99.0 99.5 98.5 93.5 91.5 90.5 94.0 100.0 93.5 99.5 100.0 69.0 79.0

1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.5 0.5 5.5 8.5 0.0 0.0 0.5 0.0 0.0 29.0 19.5

2.5 2.5 1.0 1.0 0.5 0.5 1.0 4.0 1.0 0.5 0.5 1.0 6.0 3.0 1.0 6.0 0.0 6.0 0.5 0.0 2.0 1.5

0.0

i.0

97.5w4.5)

1.0 W.5)

1.5 (fl.5)

95.0 (i7.s)

3.1 (fr.3)

1.9(zttl.s)

92.5-100.0

0.0-5.0

0.0-5.0

69.0-100.0

0.0-29.0

0.0-6.0

shape profile, - 45/ + 27 are the corresponding numbers. A negative value means that the centromere is located too far to the left, i.e. in the p-arm, and a positive value refers to mislocation into the q-arm. 5. CHROMOSOME

CLASSIFICATION

5.1. Method and design Aligning a string with a Markov network for type i produces the maximum probability, PL,, with which the network will generate that string, if modified accordingly. Aligning a string with networks for different types produces a set of such probabilities on the basis of which a classification decision can be made. A normalized probability, Pim= has been found useful with UMNs to compensate somewhat for between-class P$P;,, variation in string lengths [2]. Reference probability Pf,,,, is the alignment probability of the most likely realization of the network for type i with that network. The simplest approach to classification, a maximum likelihood classifier, is used here. Other classification schemes, like nearest-neighbor rule and linear discriminant function, are also possible [4]. For the autosomes, classification is a 22 class problem. However, a length test may be used to select a subset of candidate classes to reduce the problem size. That is, each Markov network may be assigned a “string length acceptance range”, such that the network represents a candidate class for a string if, and only if, the length of the string falls within this range. In the experiments below for which a length test is applied, the acceptance range is the minimum-to-maximum length in a network’s set of learning strings extended by + 1 to take some natural variation and random noise into consideration. * This simple length test yields 33 and 32 sets of candidate classes for the a and p datasets, respectively. On average, a string is aligned with 4.8 networks with a minimum of 1 network (for the very long and very short strings only) and a maximum of 8 networks. * Cell-wise length normalization might yield a more appropriate range per specific cell, but classification of isolated chromosomes is computed here and cell-context is not taken into account.

C. GUTHRIEet al.

110

Performance of maximum likelihood classification is evaluated for forced landmark UMNs and CMNs using Palg, P$h, Pm, and P%, where superscript Igt denotes use of the length test. Superscript i is dropped for simpler notation. 5.2. Resuh

and discussion

Classification results in Table 2 include average error rate and standard deviation per Denver group, each of which represents a subset of the chromosome types as indicated. For UMNs, lowest average error rate is achieved using the normalized alignment probabilities together with the length test (P!&), namely, 8.5%. Without normalization (I’!!> the error rate is 1.5% higher while without the length test (P,,,) the increase is 11.2%. An unacceptable 41.2% average error rate results from using raw alignment probabilities (Pals). Figure 3a illustrates the Pat confusion matrix with marker area being proportional to exact numbers; the many off-diagonal markers indicate non-zero error rates. The main observations are (i) there is considerable variation among the diagonal entries, i.e. among the correct classifications by type, and (ii) longer chromosome types are frequently misclassified as much shorter ones. For example, 90% of the misclassified Denver group A chromosomes are assigned to Denver groups E-G. This general trend is due to a combination of UMN topology and the relative-frequency cost function [7]. The picture is quite different for CMNs. The four classification experiments show about the same average performance with less than 1% variation across them. The lowest error rate, 5.6%, is obtained when applying the length test, with or without normalization. Giving rise to a negligible decrease in average error rate, the main effect of the length test is to narrow the range of the individual error rates by lowering the highest and leaving the lowest unchanged. Analysis of the confusion matrices reveals that misclassifications lie near the main diagonal and are approximately evenly distributed Table 2. Classification Unconstrained

P.lg

Type

1 2 3 A

11

4 5 B 6

48.0 50.0 41.5 46.5 (f4.0 32.0 49.5

40.8

(f12.4)

50.5

1

92.5 91.5 78.5 87.5 (f7.3) 81.0 92.0 86.5 (f7.3) 85.0

1

Constrained MN

s

9t.5 91.0 86.0 90.2 (f3.3) 92.0 91.0 91.5 (fo.7) 88.5

pm

Pab

1 Pnrm

96.5 98.0 85.5 93.3 (f6.s)

98.5

I

;

/

nrnl

1

89.0 96.0 92.5 (f4.0) 94.0

Avrg

1158.8W6.0) 30.5-84.5

II

180.3W.l) 1 66.5-92.5

1 9O.OM4.4)

1 91.5Ci4.2)

I 80.0-98.5

I 77.5-98.0

100.0

I I

et 98.5

99.3 (fo.0)

96.8

96.0 96.OW.o) 93.5 94.0 95.0 91.0 98.5 98.0 92.0 94.6 (iz.6) 99.0 93.5 94.5 95.7W.r) 89.5 94.5

93.5 94.5 (f1.0) 90.0 93.5 97.5 92.5 96.0 97.5 90.5 93.9 We) 97.0 94.5 98.0 96.5 (*Ls) 91.5 92.0

96.0 96.OW.o) 94.0 94.0 95.0 91.0 98.5 98.0 92.0 96.8 (fl.7) 99.0 95.5 94.5 96;3 Ws) 91.5 94.5

93.8 (a4.0) 81.5-99.0

I en I

100.0

1: / ff / ‘ii; j

96.7 (a3.0)

89.5 91.2 (fm) Range

and CMNs

MN

pm

1 Pnrm

scores for UMNs

1

90.0 I

W.7)

90.5 I

99.3 (fo.o)

93.5 94.5W.o) 90.0 93.5 97.5 92.5 96.5 97.5 90.5 94.0 (f3.o) 97.0 95.5 98.0 96.8 (*I.@ 93.5 93.0

89.0 (*LO)

93.0 (fz.5)

91.5 90.8 W.0)

93.5 (f4.5) 80.0-100.0

94.4 (f3.3) 85.0-99.0

94.3 (i3.6) 85.5-100.0

I

G-banded chromosomes CO

Fig. 3. Stylized representation of classification confusion matrix for Pa using (a) UMNs or (b) CMNS.

among types either shorter or longer than the correct ones. This is illustrated in Fig. 3b for the Paleconfusion matrix. For comparison of UMNs with CMNs as band pattern models, without other factors influencing classification results, raw alignment probabilities are meaningful. As noted above, the average error rate obtained using UMNs is 41.2%. The corresponding 6.2% for CMNs is virtually as good as with normalization and length test. The implication is that CMNs are superior to UMNs as band pattern models when used for string alignments by probability-maximizing dynamic programming. 6. CMN PRUNING 6.1. Method and design Consider a stage in an inferred CMN. Let pc denote the probability of an unobservable symbol, i.e. pc is the relative-frequency estimate of making a transition to the empty node. If the stage has high p. it represents instances in which observable symbols in only a small percentage of the learning strings were aligned; by contrast, low pc indicates alignment of symbols in a large percentage of the learning strings. A reasonable premise holds that stages with high pe are due to variability or noise, rather than structure of recurrent substrings inferred from the learning data. Thus, it is possible that deleting a stage with high pe will actually make the CMN a purer model. CMN pruning entails deleting from a network all stages with a value of (1 -p,) at or below a specific level.* For instance, the second stage of the partial CMN in Fig. 2 has pe = 0.98, so pruning at level 0.02 would remove that stage and leave the other five stages of the network concatenated. Pruning may effectively reduce network size and thereby offer faster computation, but centromere finding and classification performance might suffer. To investigate the impact of pruning, centromere finding and classification experiments are repeated for a number of pruning levels. Relative network size, centromere finding and classification error rates are recorded at each level. Both experiments use all 22 autosomes. Centromere finding is again carried out for known class. Classification performance is tested for PA, as well as P$. All networks are pruned to the same level in all experiments. * Note that UMNs cannot be pruned by this simple method because they are not constrained to the form of concatenated stages.

C. GUTHRIE et al.

112

6.2. Result3 and discussion Figure 4 shows average CMN pruning results for relative network size, centromere finding error rates, and classification error rates for Palgand Pg. All four are plotted as a function of pruning level. A subset of the exact numbers is listed in Table 3. Individual chromosome type results do not show substantial discrepancies from the average results. Steep slopes of the relative network size graph indicate that many stages are removed for a small increase in pruning level, or, in other words, many stages exist with the same value of pe. Likewise, a flat slope shows that few stages exist at each pruning level. Thus, one-third of a typical CMN’s stages have high values of pe (0.9-1.0) and another third have low values (0.0-0.1). The remaining one-third of the stages are quite evenly distributed over a wide range of pe values (0.1-0.9) with few stages for each value. This pattern appears consistent for all the networks inferred for the chromosome data. Reduction of the relative network size implies proportionally faster string-to-network computation, but acceptable error rates must be maintained. Relatively minor fluctuations in centromere finding and classification error rates imply that pruning has not yet damaged the network structure that distinguishes internally among p-arms and q-arms and externally among classes; an escalating error rate implies that significant structural information is being removed. Centromere finding performance drops less than 1% until pruning level 0.5 is reached; this corresponds to removal of 40% of the CMN stages. Thus, even when pruned radically, CMNs retain a considerable amount of detailed structural information in both their p-arms and q-arms. Classification performance varies less than 1% with removal of stages up to pruning level 0.3 when not using the length test. Again, this allows for a relative reduction of the network size on the order of 40%. The effect of the length test is a reduction in error rate which initially is small but becomes larger as the error rate escalates. 7. CONCLUSION The experimental work indicates that CMNs are superior to UMNs as band pattern models when used for string alignments by probability-maximizing dynamic programming. Performance in centromere finding is about the same for the two. In classification of isolated autosomes by type, neither the alignment probability normalization nor the

.I

0

0.0

0.1

I

1

0.2

.X.”

-x

0.3

* ,...... R.

x

0.4

.E

0.5

.x’ ‘. .-

0.6

0.7

0.8

0.9

I

l.OO

Pruning Level Fig. 4. Average CMN pruning results: relative network size ( + ), centromere finding error rate ( x ), and classification error rate for Ptia ( 0) and I?& (‘).

G-banded chromosomes

113

Table 3. Average CMN pruning results 1Pruning

Relative size

l.W?l

d net work

0.00 0.05 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

100.0 74.7 70.5 66.0 62.7 60.3 57.6 54.8 51.5 47.7 43.0

(%)

Centr. find.

,mm

rate (%)

Pat Class.

+’ ab

&88

error rate (%)

,errorrate (%)

6.2 6.0 6.4 6.6 6.7 8.8 11.9 19.3 35.1 58.6 83.3

5.6 5.4 5.8 6.0 6.1 8.0 10.7 17.0 28.5 46.1 63.9

2.5 2.7 2.7 2.7 3.0 3.2 3.4 4.4 6.5 10.0 21.6

simple length test has significant impact on CMN error rates but are crucial for good UMN performance. Computation with CMNs is faster than with UMNs. In addition, a set of CMNs can be pruned substantially without seriously deteriorating centromere finding or classification results to produce even simpler models that maintain high performance. 8. SUMMARY Given chromosome band patterns represented as strings of symbols, a Markov network can be inferred automatically for each chromosome type from a set of learning strings. The resulting model provides a probabilistic description of intra- and interband relations between successive symbols in the learning strings. This paper presents a new model of chromosome band patterns, the constrained Markov network (CMN), which is a special case of its unconstrained predecessor (UMN). The CMN topology is a simple concatenation of stages, each of which gives a choice of a symbol from the string alphabet or an unobservable symbol. As with UMNs, transition probabilities are estimated by relative-frequencies and used in a dynamic programming algorithm for maximizing a string’s alignment probability. A series of three experiments with the 22 types of autosomes is reported. (i) Centromere finding by forced landmarking yields 96.5% and 97.5% correct results on average for UMNs and CMNs, respectively. (ii) Classification performance is evaluated for alignment probabilities and also with normalization and selection of candidate classes by length test. UMN average results vary from about 60% to 92% depending on whether the heuristics are applied, while CMN average results are about 94% regardless. (iii) CMN pruning, which entails deleting from a network all stages for which the probability of choosing a symbol from the string alphabet is at or below a specific level, indicates that a set of CMNs can be reduced by about 40% without affecting centromere finding and classification performance. Acknowledgements-The authors wish to thank Jim Piper, Medical Research Council, Edinburgh, for providing shape profile centromere finding results. Prof. Erik Granum, University of Aalborg, Denmark, suggested pruning Markov networks as an area of investigation. Computation on a CRAY Y-MP4/464 was supported by the National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign.

REFERENCES 1. ISCN, An International System for Human Cytogenetic Nomenclature, Report of the standing committee on human cytogenetic nomenclature. Karger, New York (1985). 2. E. Granum and M. G. Thomason, Automatically inferred Markov network models for classification of chromosomal band pattern structures, Cytometry 11,26-39 (1990). 3. E. Granum, M. G. Thomason and J. Gregor, On the use of automatically inferred Markov networks for chromosome analysis, Automation of Cytogenetics, C. Lundsteen and J. Piper, Eds, pp. 233-251. Springer, Berlin (1989). 4. J. Gregor and M. G. Thomason, Hybrid pattern recognition using Markov networks, IEEE Trans. Pattern Anal. Mach. Zntell. (forthcoming).

C. GUTHRIEet al.

114

5. J. Gregor and E. Granum, Finding chromosome Biol. Med. 21, 55-67 (1991).

centromeres

using band pattern information,

Comput.

6. J. Gregor, Inference of Markov networks with forced landmarks, Master’s thesis, Institute of Electronic Systems, University of Aalborg, Denmark (1988). 7. J. Gregor, Aspects of data-driven inference and dynamic programming analysis of pattern structure in strings, Ph.D. thesis, Laboratory of Image Analysis, University of Aalborg, Denmark (1991). 8. M. G. Thomason and E. Granum, Dynamic programming inference of Markov networks from finite sets of sample strings IEEE Trans. Pattern Anal. Much. Intell. 8, 491-501 (1986). 9. M. G. Thomason and C. E. Guthrie, Inference of constrained Markov networks, Technical Report CS 92150, Department of Computer Science, University of Tennessee, February (1992). 10. C. Lundsteen, J. Phillip and E. Granum, Quantitative analysis of 6985 digitized trypsin G-banded human metaphase chromosomes, Clin. Genet. 18, 355-370 (1980). 11. E. Granum, Pattern recognition aspects of chromosome analysis-Computerized and visual interpretation of banded human chromosomes, Ph.D. thesis, Laboratory of Electronics, Technical University of Denmark, Lyngby (1980). 12. J. Gregor and E. Granum, String segmentation and classification by forced landmark Markov networks. Znt. J. Pattern Recognition Artif. Intell. 5, 413-423 (1991). 13. J. Piper and E. Granum, On fully automatic measurements for banded chromosome classification, Cytometry 10,242-255 (1989). 14. J. Piper, Shape profile centromere finding results, Personal communication (1990). About the Author-CfwRms

E. GUTHRIE received the B.S. degree from The Ohio State University in 1974 and his M.S. degree from the University of Tennessee, Knoxville, in 1990, where he is pursuing his Ph.D. He has worked for Blue Cross and for Ohio State Life Insurance Co. He is a member of SIAM and is Assistant Professor of Computer Science at Tennessee Technological University. About the Author-JENs GREGORwas born in Copenhagen, Denmark, on 22 September 1%3. He received his M.S. degree in Electrical Engineering in 1988 and his Ph.D. in Technical Science in 1991, both from Aalborg University, Denmark. He has been with the University of Tennessee, Knoxville, as an Assistant Professor of Computer Science since 1991. He is a member of the Danish Pattern Recognition Society and IAPR.

About the Author-MICHAEL G. THOMASON received the B.S. degree from Clemson University, South Carolina, in 1965, his M.S. degree from the Johns Hopkins University, Baltimore, in 1970, and his Ph.D. from Duke University, Durham, in 1973. He has worked for Westinghouse (Baltimore), Perceptics (Knoxville) and research institutes as a consultant. Currently, he is Professor of Computer Science at the University of Tennessee, Knoxville. His research interests include syntactic and structural pattern analysis and application of stochastic processes in various areas of computer science. He is a senior member of IEEE.