Computer Methods and Programs in Biomedicine, 32 (1990) 37-44
37
Elsevier COMMET 01069
Probabilistic belief networks for genetic counseling N o m i L. Harris Laboratory for Computer Science, Massachusetts Institute of Technology. Cambridge, MA 02139, U.S.A.
This paper describes a program, GEN|NFER,which uses belief networks to calculate risks of inheriting genetic disorders. GENIIqFER is based on Pearl's (J. Pearl, Artif. Intell. 29 (1986) 241-288) algorithm for fusion and propagation in probabilistic belief networks. It is written in Common Lisp. GF-~qlNFER can calculate genotypes for any family affected with any single-gene inherited disorder. Besides considering both negative and positive information in the pedigree, GENINFER takes into account additional information about the specific disorder as well as supplementary information for family members. The output consists of genotype probabilities for all family members and estimated genetic risks for prospective children of the consultands. Belief networks provide a way to calculate probabilities for systems of conditionally dependent variables. The impacts of various pieces of information are propagated and fused in such a way that, when equilibrium is reached, each proposition can be assigned a degree of belief consistent with the axioms of probability theory. In Pearl's algorithm, information is communicated through the network by messages sent between nodes. Pearl's basic algorithm cannot directly handle multiple-connected networks, which arise in the genetic counseling domain whenever a family pedigree includes consanguinity or more than one child per couple. GENINFERmakes use of two cycle breaking methods, clustering and conditioning, to handle these situations. Probabilistic reasoning; Belief networks; Genetic counseling
I. Introduction
Genetic diseases account for a large proportion of birth defects. People with a family history of a genetic disorder may be concerned about the risk that future children will suffer from the disorder. The role of a genetic counselor is to assess a consultand's risk of passing on a genetic disorder and offer advice on the best course of action. The calculations involved can be very complicated, especially for large families. A program that could calculate genetic risks correctly and efficiently by combining different types of data would therefore
Correspondence: Nomi L. Harris, Laboratory for Computer Science, Massachusetts Institute of Technology, 545 Technology Square, Room 415, Cambridge, MA 02139, U,S.A.
be quite useful to genetic counselors. This paper d e s c r i b e s s u c h a program, GENINFER, w h i c h uses belief networks to represent families and propagate information. GENINFER makes use of Pearl's [8] algorithm for fusion and propagation in probabilistic belief networks. A description of any family with a single-gene inherited defect (which may be recessive, dominant, or X-linked) can serve as input to GENItqFER. Additional data pertaining to the specific disorder and the possible phenotypes of family members may also be entered; all data is combined in a manner consistent with probability theory. The output of GENINFER is an assessment of the probabilities of each possible genotype for each person in the family, and a risk estimate for future offspring of the consultand (if a consultand is specified).
0169-2607/90/$03.50 © 1990 Elsevier Science Publishers B.V. (Biomedical Division)
38 Arthur
Benjamin
Anne
Bill
B
Claude Fig. 1, Pedigree for Betty's family.
2. Pedigrees The most important source of information for a genetic counselor is the family history of the consultand, which can be represented by a pedigree. Pedigrees are family tree diagrams showing the incidence of a particular genetic disorder in a family. In pedigree diagrams, men are represented by squares, women by circles. The offspring of a couple are shown hanging from a line drawn between the two members of the couple. Information about phenotypes is shown by coloring in circles or squares to represent affected individuals. For example, Fig. 1 shows a pedigree for a family affected with hemophilia, an X-linked disorder.
3. Previous programs dealing with genetic risk Several researchers have investigated computer approaches to genetic counseling, although none have implemented a program using belief networks. p~DIG, written in the 1970s by Heuch and Li [3], handles a subset of the cases that GENINFER can handle. It cannot incorporate multiple sources of evidence, as GENINFER can, nor can it handle families with consanguinity. Spiegelhalter [11] has explored, from a theoretical standpoint, the application of Lauritzen and Spiegelhalter's method [5] to the problem of genetic inheritance, but has not yet implemented this approach. The Lauritzen and Spiegelhalter method should be applicable to the same problems that GENINFER handles. It is unclear, however, whether it would be as efficient, because the algorithm requires overhead time to normalize and triangulate the belief network before information can be propagated. For applications in which a single
large belief network is used repeatedly, this overhead time is not significant, but in the genetic counseling domain, a new belief network is constructed for each case. Also, although Spiegelhalter's algorithm has been shown to be efficient for networks with many small cycles [1], it has more trouble with large cycles such as those caused by consanguinity.
4. Propagation and fusion in probabilistic belief networks Belief networks (also called Bayesian networks, inference nets, or causal nets) provide a way to represent systems of conditionally dependent variables. A belief network consists of a set of nodes, which represent the variables or propositions, connected by directed links, which represent direct relationships between the variables. Belief networks allow the impacts of various pieces of information to be propagated and fused in such a way that, when equilibrium is reached, each proposition can be assigned a degree of belief consistent with the axioms of probability theory. In Pearl's [8] algorithm, the propagation of information through the belief network is accomplished by means of messages sent between nodes by two parameters, ¢t and ?~. ¢r represents causal support from a node's ancestors, while h represents diagnostic support from a node's descendants. Pearl's algorithm extends the idea of Bayesian revision to a network of variables; the ¢t messages can be regarded as analogous to the priors in Bayes' formula, while the h messages are analogous to likelihoods. In addition to the dynamic information transmitted by the ¢t and ?, messages, each node contains a conditional probability matrix, which characterizes the relationship between the node and its parents. In the genetic counseling domain, the conditional probability matrix is the way the mode of inheritance (recessive, dominant, or X-linked) is encoded.
4.1. Converting pedigrees to belief networks In order to use Pearl's algorithm for pedigree analysis, the pedigree is converted to a belief
39
Arthur
Anne
(where q = 1 - p ) , then p2 of the population is homozygous affected, 2pq is heterozygous, and q: is homozygous normal. The value of p differs for different diseases and different populations. GENINFER allow.,, ~he user to specify p for each case.
0 Fig. 2. Befief network for Betty's family.
network in which the nodes represent people in the family and the links between nodes represent parent-child relationships. Each of the three possible genotypes (homozygous affected, heterozygous, and homozygous normal) is considered as a 'hypothesis' for the genotype of a node.
4.2. Initializing the parameters Before information can be propagated through the network, the network must be initialized to reflect the evidence that is initially available. Evidence pertaining to individuals' genotypes is provided by their phenotypes. This evidence is represented by attaching a dummy leaf to each person node. For example, Fig. 2 shows the network that would be constructed for Betty's family (Fig. 1), including the dummy leaves. In effect, a dummy leaf represents the phenotype of its parent, while the parent itself represents the genotype. The dummy leaf sends a ;k message to its parent to communicate what is known about the genotype based on the phenotype. The meaning of each value ;k+ in the initial ;k message is P(phenotypelgenotypet), i.e., the probability that we would see the observed phenotype if the genotype of the person were i. If the pedigree contains members whose phenotypes are not known, GENINFF.R permits their phenotypes to be specified as 'unknown'. If an indi= vidual of unknown phenotype is a root node, it sends an initial ~" message that reflects the background level of the disease in the population. I assumed that the genotype distribution of the population follows the Hardy-Weinberg equilibrium, i.e., if the frequency of the defective allele is p, and the frequency of the normal allele is q
4. 3. Propagation and fusion Once initial values for ~t and ;k have been assigned, the information represented by these vectors can be propagated throughout the network (see Fig. 3). When a node receives a new ¢t message, it sends ~r messages to its children and a ;k message to its spouse. When a node receives a new ;k message, it sends ~r messages to its siblings and ;ks to its parents. Note that Pearl's algorithm is distributed: messages are passed between nodes, not through any central control. If a node receives a message containing information that it has already seen, it does not send messages to its neighbors. In this way, the network eventually reaches equilibrium (in time linearly proportional to the diameter of the network). The final messages can then be used to calculate genotype probabilities for each person. The belief that a person has genotype k is the product of ;kk and ~tk on the link from that person's dummy leaf multiplied by a scale factor, a, which normalizes the probabilities.
B
BXc
C ..=-.=-..==:~
x
B
X
Y
C
B~ C
Y
x
~
Y
Fig. 3. Propagation of messages in belief network.
40
4.4. Advantages of using belief networks for genetic counseling There are several reasons why belief networks are a good approach for a genetic counseling program. They provide a method that works for any family with any single-gene inherited defect. All available information, both positive and negative, is taken into account. Information outside of the pedigree itself, such as the results of enzyme tests, can be incorporated orthogonally, without disrupting the structure of the underlying family network (see Section 6.3). This supplementary information is automatically fused with the pedigree data to yield correct combined probabilities. Missing information, such as unknown phenotypes, can be handled consistently. The background risk of a disorder can be specified as input to GENINFER, which allows it to take advantage of increased knowledge about the prevalence of the disease in the population of interest. GENINFER can also handle disorders with incomplete penetrance or age-dependent presentation (see Sections 6.1 and 6.2). A key limitation of many programs or procedures for calculating genetic risk is that they cannot be used on families with consanguinity. I have extended Pearl's basic algorithm to handle such families. The methods I used to handle these multiply-connected family networks are described in the next section.
5. Dealing with cycles in belief networks The use of Pearl's propagation method is limited to singly connected graphs, i.e., graphs with at most one path between any two nodes. Because propagation of information is not under central control, information could cycle indefinitely if there were cycles in the network. Belief networks for families are not always singly connected; they may have two types of cycles. Certain families have cycles caused by consanguinity (for example, if two cousins marry). Another type of cycle is more ubiquitous: it appears every time two parents have two or more children in common. These cycles are an artifact
Consanguinity
Artifact
Fig. 4. Types of cycles in family networks.
of a representation that connects each child with both of its parents. If there are two children, this will lead to an undirected figure-eight cycle (see Fig. 4). GENINFER uses two loop-breaking techniques, clustering and conditioning, to deal with the two types of cycles that can appear in family networks. Although conditioning can be used to break any cycle, its exponential time complexity makes it computationally undesirable. Clustering is efficient only for cycles containing a small number of nodes. GENINFER therefore uses a combination of both approaches.
5.1. Clustering In clustering, instead of connecting each child directly to its parent, an intermediate node is introduced. I call this node a parental unit; Spiegelhalter [11] refers to it as a marriage node. The parental unit contains no new information, but rather combines the information provided by the parents and passes it on to the children. The parental unit structure is flexible enough to accommodate families with remarriages and halfsiblings, because each person can be connected to more than one parental unit. As Fig. 5 illustrates, the addition of a parental unit breaks up the Arthur
Parental Unit
.,"
Anne
. v: -,,.
,
()O,,=e Fig. 5. Clustered family network (parental unit added).
41
figure-eight cycle. Note that each person node must still be assigned a dummy leaf, which is connected directly to it. Because they contain no phenotypic information, parental units are not assigned dummy leaves. The use of clustering eliminates looping due to artifactual cycles. However, as all possible combinations of propositions from the individual nodes in a duster must be represented in the 'supernode' that comprises the duster, this method is not practical for large cycles such as those that result from matings between related individuals. These cycles can be broken by conditioning the network. 5.2. Conditioning A multiply-connected belief network is conditioned by selecting a loop-cutset and considering all possible combinations of values that nodes in the loop-cutset can take on [13]. Conditioning is sometimes referred to as reasoning by assumptions, because for each configuration of the loopcutset, we are assuming that the nodes in the loop-cutset have those values, and reasoning about the rest of the network based on those assumptions. By considering each possible case separately, conditioning prevents infinite looping without loss of information. Because conditioning breaks the cycles in a multiply-connected network, evidence can be propagated in the conditioned network in the normal manner. The resulting beliefs are then weighed by the joint probability of the instantiated nodes in the loop-cutset. Given a piece of evidence E and a loop-cutset consisting of nodes C~, .... C,,, then for any node A,
P(AIE)-- ~., P(AIE, C~=v~ . . . . . C.=~,)
5.2.1. Choosing a loop-cutset A Ioop-cutset must contain at least one node from every cycle in the network, with the additional constraint that a loop-cutset node may not have more than one parent in the same cycle. (If a loop-cutset node is the child of more than one other node in the loop, it will receive top-down information more than once, leading to incorrect updating). 5.2.2. How to condition Once the loop-cutset has been selected, the network must be physically disconnected at the nodes in the loop-cutset in order to break the cycles. The next step is to instantiate all the nodes in the loop-cutset and run the propagation algorithm once for all such instantiations, of which there will be an exponential number: I genotypesl tc,,,se,I. (In this domain, the loop-cutset very seldom contains more than one or two nodes, so this exponential complexity is not a major problem). In order to find the conditioned beliefs for each node, we sum the products of the values found for each loopcutset instantiation and the weights of the loopcutset instantiations: I G l l cl
aEL(A,)= E aEL (A,)P(C ) k=l
where Ck represents the k th instantiation of the loop-cutset, and Icl
e(q)= H j=l
v,),
where vj is the value assigned to loop-cutset node Cj in the k th instantiation.
C x ... C.
x£(C~
= v~. . . . .
C,, = v,, I E )
[13], where vl... v. are the possible values that the loop-cutset nodes can take on. P ( A [ E , C~ = vl . . . . . C , = v , ) can be calculated by running Pearl's algorithm on the conditioned network. The calculation of the joint probability of the loopcutset given evidence E P(C1 = v l . . . . . C,, = v,] E), will be discussed shortly.
5.2.3. Calculating joint probabifities of loop-cutset instantiations There is a problem with the formula above: how do we know what P(Cy = cj) is when we cannot run the propagation algorithm on the intact network? Suermondt [12] has derived a method for calculating joint probabilities for loop-outset instantiations. First, the nodes in the network are ordered according to the 'is-a-predecessor-of' rela-
42
tionship; this can be accomplished by a topological sort. Then the initial beliefs, or priors, for each node are calculated as follows. If a node has no predecessors, its prior is simply the normalized product of the ~r and h vectors on the link to its dummy leaf. If a node has predecessors, we will already have calculated their priors because of the order in which we are processing the nodes. The prior for node A then becomes:
Prior(Ai) = Y'~P( A, I Motherj, Fatherk ) j,k
× BEL(Mothe§.) BEL(Fatherk) These priors are used when calculating the joint probabilities of loop-cutset instantiations.
hood. The most familiar example of this kind of late-onset genetic defect is Huntington's disease, which is caused by a dominant gene. People with the Huntington's gene appear normal until some time in middle age, when the devastating symptoms begin to appear. Because the presentation of symptoms of Huntington's disease is age-dependent, the age of the people in the family are relevant in the consultation. GFNINFER can take this information into account if it is supplied with data about the percentage of people who express the disorder at each age range. The age-dependent probabilities of presentation are handled in a manner similar to penetrance probabilities, since the probability that the disorder is expressed at a given age is equivalent to its penetrance at that age.
6.3. Supplementary information 6. Incorporating additional information The facilities for calculating genetic risk that I have described thus far rely only on simple phenotypic evidence in the pedigree (i.e., affected vs. unaffected) and on the background risk of the disorder. G e s I s w R is capable of incorporating other sources of information pertaining to the disorder and to individual family members.
6.1. Penetrance In some genetic disorders, there may be i.dividuals who have affected genotypes, yet appear normal. These people can pass on the defecuve allele to their children. The probability that a person with a defective gene will exhibit the defect is called the penetrance of the gene. GENINFER allows each disease to be assigned a penetrance probability between 0 and 100%. This probability is then used when assigning the inP.ial ~- and h values. If the penetrance probability is not specified by the user, the program assumes 100% penetrance.
6.2. Age-dependent expressivity Some genetic disorders do not reveal their presence until the affected individual reaches adult-
Phenotypic information can take more than one form. In the simplest cases, it may be clear from simple observation whether an individual has the disorder of interest. Sometimes, however, there may be other sources of information, such as enzyme levels, that indicate the presence of a defective allele. Supplementary information is included in the network by allowing each person node to have more than one dummy leaf. Each dummy leaf represents some knowledge we have about the person. The information is entered by the user in the form P(findinglgenotype~). It is not necessary for the user to perform a Bayesian revision on the data, because this is done automatically by Pearl's algorithm. Genotype probabilities are then calculated by multiplying together the ~r and h vectors on each of the dummy leaves attached to a person node.
7. Input and output of GENINFER GENINFER'S user interfac ~. prompts users to enter information about the genetic disorder being investigated, and then lets them enter data for individuals in the pedigree. The user is asked to enter the family name, disorder, inheritance type, back-
43 Oenooype p r o b a b i l i t i e s f o r BETTY-FAMILYf e m i l y : PERSON HOMOZYGOUS AFFECTED HETEROZYGOUS HOMOZ¥GOUSNORMAL h~.l~oth-nals 0.16667 0.00000 0.83333 hypoth-femals 0.00000 0.16667 0.83333 ARTHUIt 0.00000 0.00000 1.00000 ANNE 0.00000 1.00000 0.00000 BF,~JAR%N 1.00000 0.00000 0.00000 BILL 1.00000 0.00000 0.00000 BZTTY 0.00000 0.33333 0.66667 BOB 0.00000 0.00000 1.00000 CLAUDZ 0.00000 0.00000 i.O0000 eonsultandn BETTY and BOB a r e concerned about t h e r i s k of passing on HEMOPHILIA, an X-LZNgEDd l s o r d s r . ~o future o f f s p r i n g . After analyzing a l l a v a i l a b l e infoz~atlon, I have a s s e s s e d the r i s k s as f o l l o u s : Female o g f a p r i n g have a 0~ chance of bstn K a~fecl;ed v i t h R~qnpHILIA and a 17~ chants of being c a r r i e r s . Male o f f s p r i n K have a ITZ chance of being a f f e c t e d and a 83Z chance of being h e m a l .
Fig. 6. Output of GENINFERon Betty's family (see Fig. 1).
ground risk, penetrance, etc. For each family member, the user is asked to enter the individual's gender, phenotype (which may he 'unknown'), parents, and any additional evidence that is available, such as the results of enzyme tests. The user can also specify a particular consultand and, optionally, the consultand's spouse or partner. The output of GENINFER is a list of genotype probabilities for each family member. If a consultand has been specified, GENINFER calculates the consultand's risk of bearing an affected child. (For X-linked disorders, separate risks are calculated for male and female offspring). For example, the table of genotype probabilities that GENINFER outputs for Betty's family (Fig. 1) is shown in Fig. 6.
7.1. Explaining anomalies Sometimes the information provided to GENINFER by a user contains apparent inconsistencies. For example, the child of two unaffected parents may be identified as" exhibiting a dominant disorder (one with 100% penetrance, let's assume). Situations of this type cause all of the beliefs calculated by GENIIqFER tO come out to zero for one or more individuals. If this occurs, the location in the pedigree of the unexpected event is pinpointed, and possible explanations for the apparent anomaly are proposed. In the situation just described, the following explanations would be proposed: - The penetrance of the disease is not really
100%; the child may have the gene and yet not express it. - The putative parents of the affected child are not the actual biological parents. - The mutation rate of the disorder is non-zero; a spontaneous mutation occurred in the affected child. - The user made one or more errors when entering the data.
8. Conclusions The Bayesian calculations that must be performed in order to advise consaltands about their probable risk can be very complex. However tempting it may he to the genetic counselor to neglect these calculations, it is essential to pe~form them correctly and completely in order to give consultands anaccurate assessment. As Edmond Murphy, a proponent of Bayesian methods in genetic counseling, phrased it, There can be no doubt but that an exhaustive analysis of a pedigree, even when the mode of inheritance is simple, may itself be complicated. In the practical situation, the ideal method may not be applied because the counselor either becomes lost in the logic or finds the method tedious ... I suggest that if they cannot find the time to do the calculations themselves, they should delegate the job to someone else. ([61, P. 395)
Murphy may not have had a computer in mind when he suggested delegating the arduous calculations of the tedious mathematics required by a consultation to 'someone else', but in many respects a computer is the ideal entity for such tasks. If risk calculations are taken care of by a program such a s GEN[NFER, genetic counselors will be able to devote more time and energy to the human side of genetic counsding.
Acknowledgements I am grateful to Stephen Pauker for suggesting the project, Peter Szolovits for advising me and offering helpful comments, Susan Pauker for domain expertise, Michael Well.man for invaluable help with Peaffs algorithm a n d conditioning, Jaap Suermondt for providing me with a pre-publica-
44 tion d r a f t o f his p a p e r a b o u t conditioning, a n d Judith H a r r i s for suggesting the n a m e (3ENII~FER.
References
[1] I.A. Beinlich, H.J. Suermondt, R.M. Chavez and G.F. Cooper, The alarm monitoring system: a case study with two probabilistic inference techniques for belief networks, in: Lecture Notes in Medical Informatics: AIME 89; eds. J. Hunter, J. Cookson and J. Wyatt, pp. 247-256 (Springer-Verlag, New York, 1989). [2] G.F. Cooper, Expert Systems Based on Belief Networks -Current Research Directions. Memo KSL 87-51 (Knowledge Systems Laboratory, Stanford University, Stanford, CA, 1987). [3] I. Heuch and F.H.F. Li, PEDIG - - a computer program for calculation of genotype probabilities using phenotype information, Clix~. Genet. 3 (1972) 501-504. [4] J. Hilden, Computerized derivations of Mendelian probability formulae: the GENEX processor, in: Nordic Symposium in Applied Statistics and Data Processing, eds. A. Hoskuldsson et aL, pp, 395-410 (NEUCC-Technical University of Denmark, Lyngby, 1982). [5] S.L. Lauritzen and D.J. Spiegelhaiter, Local computations with probabilities on graphical structures and their application to expert systems, J. R. Stat. Soc. BS0 (1988).
[6] E.A. Murphy. How much difference does the use of Bayesian probability make? in: Genetic Counseling, eds. H.A. Lubs and F. de la Cruz (Raven Press, New York, 1977). [7] E.A. Murphy and G.A. Chase, Principles of Genetic Counseling (Year Book Medical Publishers, Chicago, IL, 1975). [8] J. Pearl, Fusion, propagation, and structuring in belief networks, ArtiL Intell. 29 (1986) 241-2~8. [9] J. Pearl, Probabilistic Reasoning in Intemgent Systems: Networks of Plausible Inference (Morgan Kaufmann, New York, 1988). [10] H.U. Prokosch, S.A. Seuchter, E.A. Thompson and M.H. Skolnick, Applying expert system techniques to human genetics, Comput. Biomed. Res. (1988) (submitted). Ill] D.J. Spiegelhalter, Fast algorithms for pr~babilistic reasoning in influence diagrams, with applications in genetics and expert systems, in: Conference on Influence Diagrams (University of California, Berkeley, CA, 1988). [12] H.J. Suermondt and G.F. Cooper, Initialization for the Method of Conditioning. Memo KSL 89-29 (Knowledge Systems Laboratory, Stanford University, Stanford, CA, 1989). [13] H.J. Suermondt and G.F. Cooper, Updating probabilities in multiply-connected belief networks, in: T h e Fourth Workshop on Uncertainty in Artificial Intelligence, pp. 335-343 (University of Minnesota, St. Paul, MN, 1988).