Automated Detection of Hereditary Syndromes Using Data Mining

COMPUTERS AND BIOMEDICAL RESEARCH ARTICLE NO. 30, 337–348 (1997) CO971454 Automated Detection of Hereditary Syndromes Using Data Mining Steven Evan...

Download PDF

141KB Sizes 24 Downloads 17 Views

Report

PDF Reader
Full Text

COMPUTERS AND BIOMEDICAL RESEARCH ARTICLE NO.

30, 337–348 (1997)

CO971454

Automated Detection of Hereditary Syndromes Using Data Mining Steven Evans,*, † ,1 Stephen J. Lemon,*,‡ Carolyn A. Deters,† Ramon M. Fusaro,*,‡ and Henry T. Lynch*,‡ *Hereditary Cancer Institute and ‡Department of Preventive Medicine and Public Health, Creighton University School of Medicine, Omaha, Nebraska; and †Oncormed, Inc., 2027 Dodge Street, Suite 402, Omaha, Nebraska 68102

Received January 27, 1997

Computer-based data mining methodology applied to family history clinical data can algorithmically create highly accurate, clinically oriented hereditary disease pattern recognizers. For the example of hereditary colon cancer, the data mining’s selection of relevant factors to assess for hereditary colon cancer was statistically significant (P , 0.05). All final recognizerformulated patterns of hereditary colon cancer were independently confirmed by a clinical expert. Applied to previously analyzed family histories, the recognizer identified the definitive hereditary histories, correctly responded negatively to the putative hereditary histories, and correctly responded negatively to empirically elevated colon cancer risk situations. This capability facilitates patient selection for DNA studies in search of gene mutations. When genetic mutations are included as parameters in a patient database for a genetic disease, the process yields an expert system which characterizes variations in clinical disease presentations in terms of genetic mutations. Such information can greatly improve the efficiency of gene testing.  1997 Academic Press

INTRODUCTION The purpose of this paper is to introduce and describe the performance of an innovative, informatics-guided process permitting the automatic construction of an expert rule-based system which can identify hereditary patterns for inherited disorders. To demonstrate the process, we shall focus on inherited cancer in general and colorectal carcinoma in particular, giving some results as a case example. Identification of hereditary cancer patterns is predicated upon a well-orchestrated family history (i.e., the encountered cancer patients with descriptions of the cancer occurrences in their family trees) in order to reveal the role genetics has played in a cancer family history (1). Particularly in oncology, recognition 1

To whom correspondence should be addressed at Oncormed, Inc., 2027 Dodge Street, Suite 402, Omaha, NE 68102. Fax: (402) 341-9704. 337 0010-4809/97 $25.00 Copyright  1997 by Academic Press All rights of reproduction in any form reserved.

338

EVANS ET AL.

of the role of genetics in hereditary cancer and subsequent prospects for testbased mutation confirmation have increased steadily over time (2–8). As this recognition has increased, and concomitant costly cancer genetic testing has become more widely available, the need for a reliable and neutral analysis of hereditary cancer patterns is of paramount importance if we are to effectively ascertain who is or is not a potential candidate for genetic testing. Without such a neutral analysis, in concert with appropriate counseling both prior to and following any utilization of genetic testing, inappropriate testing will inevitably arise. The dangers of unwarranted genetic testing include not only unnecessary increased health costs but also insurance and employment discrimination as well as needless patient fear and anxieties. Yet like many subtle pattern recognition challenges, analysis of cancer family pedigrees has been open to wide interpretation and substantial differences of opinion. The approach we have taken permits classification of hereditary patterns with high sensitivity and specificity. This process, based on the automatic creation of classification algorithms (recognizers), is data dependent, not observer-dependent, which can be quite subjective. As noted, the process used is equally applicable to all hereditary cancer patterns as well as hereditary patterns in other systems as well (e.g., cardiovascular, neurologic, and metabolic disorders). METHOD DEFINITION The fundamental methodology employed is an adaptation of data mining. This methodology is somewhat similar to neural networks in that positive examples of the phenomenon of interest are presented to a data mining software module in conjunction with negative cases. The process extrapolates rules that characterize the positive examples of the data set, distinguishing them from the negative examples. The strong advantage this methodology has over neural networks is that (a) a set of explicit logical rules is derived and (b) these rules are formulated in terms of the usual clinical parameters that describe the phenomenon. Such rules may be evaluated by outside experts for their accuracy and appropriateness, while neural networks provide a response whose intellectual basis typically cannot be known or communicated effectively to practitioners (e.g., in the example we shall use, clinical oncologist). The data mining derived rules taken together define membership in the (positive) set. As a trivial example of the identification of rules that characterize a database of information, we may attempt to characterize a given set of (a) sweet and nonsweet, (b) round and nonround, and (c) red and nonred fruits and vegetables (comprising beets, eggplants, onions, and apples). With apples identified as the ‘‘positive’’ element we wish to characterize, then the rule ‘‘round and red’’ characterizes or predicts an apple among these fruits and vegetables. Data mining algorithms can deduce precisely such rules, if they exist, to characterize a set of entries with assigned attributes, thus creating in effect automated expert systems derived from just a database of descriptive elements. In our application, the entries are individual family histories each with descriptive attributes describing the family history. The ‘‘positive examples’’

DATA MINING TO DISCOVER HEREDITARY SYNDROMES

339

would ideally be the family histories that demonstrate hereditary patterns (e.g., a hereditary pattern of autosomal dominant transmission of colon cancer). However, unlike the traditional data mining application, we do not have prior knowledge of what constitutes a positive entry (i.e., a hereditary cancer family pattern). One could label the entries hereditary or not, arbitrarily based on one physician’s opinion or that of a composite group of physicians (if consensus could be reached), but then the derivation becomes essentially dependent upon the subjective opinion of a body of experts, rather than the data themselves. This fundamental determination of what constitute suitable positive examples is the key challenge and the heart of the innovation in the process we have devised. We solve the positive examples problem by writing a ‘‘metatheoretic’’ expert system which can estimate for a family history a quantifiable measure of its representation as a positive example. Thus, rather than identification by opinion, our top-level strategy is to create an expert system which uses acceptable genetic principles that characterize autosomal dominant Mendelian inheritance. In its first stage, the process (9) defines a pool of likely hereditary cancer family history candidates, within a large database of cancer family histories, using five broadlevel clinical and genetic principles that may apply in any hereditary disease setting: (a) an inherited disease may show up more than once in the same generation (horizontal inheritance), (b) an inherited disease may show up from the prior to a subsequent generation (vertical inheritance), (c) there may be instances of different manifestations of the disease over multiple generations (overall cancer intensity), (d) there may be instances of the same manifestation of the disease over multiple generations (specific cancer intensity), and (e) the disease may occur at an earlier age of onset than its sporadic counterpart. We defined a quantified assignment of values for the family histories in the database by applying each principle to each family history record to assign points depending on the data contained in each record under review. The more total points accrued, the more a family history corresponded to a phenotypical characterization of a hereditary cancer presentation with heterogenous cancer manifestations, and thus reflected an example of a hereditary etiology. The set of family histories, each of whose point total was above an experimentally derived threshold, was defined to constitute the set of positive examples. In the special situation of hereditary syndromes such as breast and colon cancer, the data set of candidates may be (or at least also include) confirmed hereditary family histories via test results of known germ-line mutations such as BRCA1, BRCA2, APC, hMSH2, and hMLH1. However, for the general situation, data sets containing only unambiguous test-positive (and test-negative) individuals await further gene discovery, and thus our process provides a valuable alternative strategy. Once such a pool of likely positive hereditary cancer candidate family histories is established, a second stage using the methodology of data mining (10, 11) extrapolates from these family histories specific hereditary cancer patterns. As noted above, the data mining methodology itself identifies and combines a logically minimal set of characteristics of the histories into rules so that the aggregate

340

EVANS ET AL.

of rules characterizes the candidates and excludes the noncandidates. The characteristics selected come from those that describe the histories in terms of the five genetic principles exhibited and the heterogenous cancers in the family history, their age of onset, etc. Data mining methodology identifies individual characteristics from the range of exhibited parameters contained in the data set of cancer family histories (e.g., associated coincident cancers) which may be minimally required to create rules that characterize the given data set, but as we shall see, the method does not necessarily select every pertinent characteristic if differentiation within the data set does not mandate it. In the third stage, the most useful of these coincident cancers as identified through data mining are combined with the clinical and genetic principles to create a complete data-generated recognizer comprising a collection of rules which defines membership in the set of hereditary cancer positive candidates. All parameters used by the recognizer are assembled from data defining elements, and thus each rule of a recognizer is an operationally verifiable pattern whose confirmation implies the presence of a hereditary risk (and the consideration of the case for concomitant genetic testing eligibility if such testing capacity exists). To clarify what such rules look like, a representative sample is provided in Table 1. These rules arose from a data set (every record stripped of each individual’s personal identification) covering 315 colon cancer families involving a total of 980 colon-affected individuals, 101 of whom were assigned hereditary cancer candidate status in the first stage of the data mining process described above.

EVALUATION OF METHOD Three studies were undertaken to evaluate the method devised. The first study was undertaken to compare the selection of potentially pertinent factors by a recognizer for hereditary colorectal cancer (HCRC) against expert opinion from one of the authors (H.T.L.), a medical/genetics oncology clinician with over 30 years of experience. The first factors considered were a list of pertinent organ systems/sites in which primary cancers can occur and which also encode the database the recognizer used. The expert (H.T.L.) was asked separately to rank the relevancy of all the cancer sites and genetic principles on the lists. The choices of the process were then compared to the clinical expert’s opinion. The purpose of the second study was to ascertain how well the recognizer assembled data elements into valid patterns of HCRC which it uses to evaluate patients and their cancer family trees. The patterns were created by considering all the theoretical combinations of all the recognizer-ranked pertinent HCRC-related cancer sites (enumerated in Table 2) in conjunction with all the correlated clinical and genetic principles combined together in all possible groupings as each combination characterized the data. Using data mining methodology to eliminate the combinatorial explosion of billions of combinations of possibilities, this method yielded 14 patterns used to characterize HCRC. Our clinical expert (H.T.L.) was asked to

DATA MINING TO DISCOVER HEREDITARY SYNDROMES

341

TABLE 1 EXAMPLES OF ASSEMBLED RULES CONSTRUCTED BY DATA MINING METHODS FOR HEREDITARY COLON CANCER The following patterns describe a hereditary colon cancer family history: 1. All on the same side of the family. a. The proband has colon cancer and b. there are two or more identical cancers (colon, endometrial, or kidney/ureter/renal pelvis) and c. there are identical cancers (colon, endometrial, or kidney/ureter/renal pelvis) in two or more generations and d. cancer(s) (colon, endometrial, or kidney/ureter/renal pelvis) with early onset total a minimum of 6 points (where total points are calculated by adding 3 points for each cancer by age 35, 2 points for each cancer between ages 36 and 45, and 1 point for each cancer between ages 46 and 50). 2. All on the same side of the family. a. The proband has colon cancer and b. there are more than three of the same cancer (colon, endometrial, or kidney/ureter/renal pelvis) and c. cancer(s) (colon, endometrial, or kidney/ureter/renal pelvis) with early onset total a minimum of 6 points (where total points are calculated by adding 3 points for each cancer by age 35, 2 points for each cancer between ages 36 and 45, and 1 point for each cancer between ages 46 and 50). 3. All on the same side of the family. a. The proband has colon cancer and b. there are identical cancers (colon, endometrial, or kidney/ureter/renal pelvis) in three or more generations and c. cancer(s) (colon, endometrial, or kidney/ureter/renal pelvis) with early onset total a minimum of 3 points (where total points are calculated by adding 3 points for each cancer by age 35, 2 points for each cancer between ages 36 and 45, and 1 point for each cancer between ages 46 and 50). 4. All on the same side of the family. a. The proband has colon cancer and b. there are one or more second-degree endometrial cancers and c. there are more than two identical cancers (colon, endometrial, or kidney/ureter/renal pelvis) and d. cancer(s) (colon, endometrial, or kidney/ureter/renal pelvis) with early onset total a minimum of 3 points (where total points are calculated by adding 3 points for each cancer by age 35, 2 points for each cancer between ages 36 and 45, and 1 point for each cancer between ages 46 and 50). 5. All on the same side of the family. a. The proband has colon cancer and b. there are two or more first-degree endometrial cancers and c. cancer(s) (colon, endometrial, or kidney/ureter/renal pelvis) with early onset total a minimum of 3 points (where total points are calculated by adding 3 points for each cancer by age 35, 2 points for each cancer between ages 36 and 45, and 1 point for each cancer between ages 46 and 50).

independently indicate agreement or disagreement with the 14 cancer clinical patterns to the extent each pattern would permit the designation of HCRC for an individual presenting with such a pattern.

342

EVANS ET AL. TABLE 2 POTENTIAL RELIANCE ON EXTRACOLONIC CANCER SITES ASSOCIATED WITH HCRC

Cancer site

Considered for use by recognizera

H.T.L. evaluation whether site is clinically relevantb

Endometrial/uterine Kidney/ ureter/ renal pelvis Stomach Pancreas Ovarian Breast Lip Liver/intrahepatic Small intestine Brain Testes Remaining 31 cancer sites

Yes Yes Yes Yes Yes Yes Yes No No No No No

Yes Yes Yes Yes Yes No No Yes Yes Yes Yes No

Note. Yes indicates definite or probable clinically relevant use as ranked by recognizer and H.T.L. as of 9/95. a Discriminant parameters implied from database. b Parameters selected based on clinical experience.

The third study focused on the accuracy of the recognizer in identifying definitive hereditary cancer family histories, using a clinical patient database of prior cases (with each patient’s anonymity completely preserved). The definitive hereditary colorectal cancer family histories were those that an expert clinician (H.T.L.) had diagnosed as unequivocally consistent with a cancer family history presentation of HCRC while a putative hereditary family history was one the expert (H.T.L.) had indicated should be treated clinically as HCRC although the cancer family history picture did not demonstrate such sufficient hereditary evidence that an expert clinician could affirm it without reservation. We collected into a databank all family histories involving colon cancer (N 5 49) occurring at two collaborative cancer centers which regularly obtain consultation from the Hereditary Cancer Institute through a consultation service supported by the Institute. This third study’s population was based at two cancer centers, one located in California, the other in Texas. The former obtained cohorts overwhelmingly by patient self-referral selection and self-motivated participation in community cancer screening programs provided by that cancer center. The second center obtained cohorts overwhelmingly by physician referral, based on physicians’ concerns regarding patients’ family histories. All individuals provided a selfreported family history about first- and second-degree relations, the occurrences of cancer, and ages of onset if known. The information available to the recognizer was identical to the information available to the expert (H.T.L.) with regard to elements of evaluation. In the past 5 years, using its consultation service, the

DATA MINING TO DISCOVER HEREDITARY SYNDROMES

343

Institute received three family histories representing definite hereditary colorectal cancer patterns, four with putative hereditary colorectal cancer patterns, and 42 histories of elevated empirical (but not putative or hereditary) risk of colon cancer, classified previously per an analysis of these family histories by H.T.L. The goal was to compare the family histories already analyzed by H.T.L. to the defining 14 patterns created by the recognizer to evaluate how its response compared to those of H.T.L. Each family history was matched against all patterns to determine if it would meet any pattern set. All pattern sets matched by each family history were identified. If the history fit none of the recognizer-created patterns, minimum additional requirements to match at least one pattern were noted. EVALUATION RESULTS In Study 1, the results of the recognizer’s selections of HCRC-associated cancer sites with the expert’s (H.T.L.) ranking are given in Table 2. In Table 2, of the nine cancers deemed relevant (by H.T.L.), the recognizer only required five of nine of the correlated cancers for rule development for the given data set being analyzed (endometrial/uterine, kidney/ureter/renal pelvis, stomach, pancreas, and ovarian), while four cancers identified by the expert were not flagged as needed parameters by the recognizer (liver/intrahepatic, small intestine, brain, and testes). Two cancers the recognizer considered were not identified by the expert (breast and lip). The system considered 31 other cancer sites as irrelevant, and of these, H.T.L. was in mutual concurrence. Thus even with a moderatesized data set used by data mining, there was concurrence in 36 of 42 cancer sites. Calculating the agreement between the recognizer and H.T.L. beyond that expected by chance alone, as indexed by k, we obtained the value k 5 0.54. With a standard of error of 0.15, the recognizer’s performance was statistically significant (P , 0.05) in successfully assessing the relevance/irrelevance of potentially heterogeneously associated cancer sites. The success of this approach is the fact that a process of neutral algorithmic data analysis of cancer family history studies uncovers correctly many of the predominant heterogenous manifestations of the index cancer of interest and disregards most of those considered irrelevant. Since this part of the total process was directed toward finding appropriate elements to assemble into rule patterns for the given data set, it was not mandatory to find every actual heterogenous manifestation, for it is not logically necessary that certain heterogenous manifestations be considered confirmatory evidence for a hereditary pattern for a particular data set. What is noteworthy is that this process neutrally selects for consideration many correlated cancers that a clinical expert would also identify. Since definitely known associated tumor sites not detected by the recognizer arise from a deficiency of the particular data set employed, more subtle and enhanced recognition will automatically accrue as the data set grows in extent and diversity. In summary, as the diversity of the data set is expanded, this data mining process will postulate additional pertinent heterogeneous cancers to characterize the data, approaching 100% fidelity with the experts as the data warrant.

344

EVANS ET AL. TABLE 3 COMPARISON OF RECOGNIZER’S DETECTION OF HCRC TO THAT OF CLINICAL EXPERT

Colon cancer case

Hereditary determination by H.T.L.

Recognizer signals a definitive hereditary pattern

1 2 3 4

Definitive hereditary Definitive hereditary Definitive hereditary Putative hereditary

Yes Yes Yes No

5

Putative hereditary

No

6

Putative hereditary

No

7

Putative hereditary

No

Empirically elevated risk

No

8–49

Additional clinical findings needed in family history to obtain match to at least one of recognizer’s definitive hereditary patterns N/A N/A N/A Additional colon cancer 1 an early onset , age 35. Change ovarian cancer diagnosis to colon with age of onset of 2 years earlier. Change of stomach diagnoses to endometrial with any age of onset ,35. Age of onset of one of the colon cancers prior to age 35. N/A

For any interim version of a recognizer, those already well-known associated sites of malignancy which may not have yet been derived from the data set can be easily incorporated. Although not the focus of this paper, the authors have developed other recognizers which hypothesize strongly correlated cancers that are not yet identified in the medical literature or that have not been unambiguously confirmed in the medical literature until quite recently [e.g., pancreatic cancers correlated with melanoma (12)]. Thus it is possible such correlated cancers may help affect the selection of cohorts for future linkage studies to improve the likelihood that they would be more informative. It is important to emphasize that no prior programming incorporates specific parameter choices; the data set itself drives the process, independent of preconceived concepts of cancer correlations. This makes the value of the approach more attractive since no normative stand is taken by developers regarding discrepancies; simply increasingly larger diverse data sets provide a neutral way to address the issues. It is also now apparent what the primary recognizer methodology limitation is: although the data set drives the method without developer bias, data bias can mitigate the effectiveness of the recognizer derived, whether these data limitations be overly homogeneous, incorrect, or otherwise skewed cancer family histories. In the second study, the results were complete agreement between H.T.L. and the recognizer in terms of the appropriateness of the 14 patterns to permit a definitive HCRC diagnosis for any cancer family history meeting the criteria for any pattern. For study 3, the results of the recognizer’s detection of hereditary cancer are summarized in Table 3. All the definitive hereditary patterns of colon cancer

DATA MINING TO DISCOVER HEREDITARY SYNDROMES

345

(N 5 3) as determined by H.T.L. matched 1 or more of the 14 recognizer patterns. None of the H.T.L. putative HCRC family histories (N 5 4) matched any of the 14 recognizer patterns. However, the missing data elements required for a putative hereditary family history to match at least 1 of the recognizer’s definitive hereditary patterns were minimal, indicating the subtle differences between putative hereditary and definitive hereditary cancer patterns. None of the H.T.L. empirically elevated (i.e., nonhereditary, nonputative) family histories (N 5 42) matched any of the recognizer patterns. Thus the recognizer, as desired, gives a positive response for all definitive hereditary colon cases and a negative response for all other cases. Since study 1 indicated that the recognizer assembled its rule elements from among the genetic principles as well as only selected (heterogenous) correlated cancers, this pattern set was sufficient to characterize the independent cancer family histories available. It is logically possible that other family histories whose determination of a hereditary pattern depended only upon recognition of precisely those correlated cancers for which the recognizer was deficient could be confronted. On the one hand, the recognizer would by definition fail in such instances; on the other hand, when all such test (or real) family histories were added to the data set for induction by the recognizer, the new version of the recognizer would then encompass this additional experience and improve commensurately in its capability. DISCUSSION Studies 1 and 2 show that both the recognizer and the experts concur that there is heterogeneity in the presentation of hereditary colorectal cancer whose various signature patterns can be rigorously derived and defended by this approach. Using current medical literature to provide an assessment of the colon recognizer’s derivation of correlated cancers (extracolonic tumors), in a recent review (13) itemized tumors have included extracolonic cancers inclusive of the endometrium (second most common tumor in HCRC), ovary, stomach (most common in older generations), ureter and (renal) pelvis, and pancreas (for which there is a documented trend for increased incidence). All five were correctly identified by both expert and recognizer. Also itemized in the same review article were cancers of the small bowel (a rare tumor), hepatobiliary tract, and certain brain tumors, all of which were identified by the expert but were too rare to be detected in the data set which was available for recognizer development. There is no current medical literature support for the selection of lip cancer by the recognizer (and appropriately not selected by the expert). On the other hand, the recognizer identified breast cancer while the expert did not, at the time of inquiry, yet in the most recent medical literature there has been molecular genetic evidence of the occurrence of breast cancer as an integral tumor in patients with HCRC (14). Thus there may be clinical diagnostic value in the process’ independent discovery of associated primary cancers which help confirm a hereditary pattern. Study 3 shows that the patterns derived by this process can be used to accurately

346

EVANS ET AL.

identify presenting family histories for possible hereditary patterns. Supporting the prospects for a useful supplemental role in clinical decision-making, initial results with the recognizer demonstrated its high sensitivity (100%) and specificity (100%) with regard to definitive HCRC family histories, although further future evaluations should compare pattern recognizer predictions against a gold standard of DNA-confirmed unambiguous test results. Of note is the recognizer’s differentiation between those family histories labeled clearly hereditary by H.T.L. with those labeled putative by H.T.L.; one or more of the patterns are met by hereditary family histories, yet no rule is met by putative family histories. The small amount of supplementary clinical data needed to meet the recognizer’s criteria supports the fine line of distinction between designations of putative hereditary and hereditary patterns. We have also extended our data set descriptions to include the specific genetic mutations that have arisen for positive-testing individuals. This permits us to use the exact same methodological approach (devoid of any new programming) to characterize candidate family histories with rule patterns which also incorporate genotypic data (or alternatively, the exclusive use of just genotypic data) for the elements of classification and pattern definition. In the development of computer-based expert systems to mimic clinical experts, clinicians may be queried and observations made as to what rules they are applying for specific cases (15); such processes, dependent upon the reporting of the expert, are observer dependent, not data dependent. The process we have presented is an (1) automatic and (2) data-dependent (thus objective as opposed to subjective) method to create a recognizer to identify hereditary disease. This approach demonstrates that the clinical decision-making process involving pattern recognition induction from family histories can be grounded in a quantifiable methodology which yields results comparable to those of experts. As noted, the primary recognizer methodology limitation is data bias which can mitigate the effectiveness of the recognizer derived, although enhanced recognition automatically accrues as the data set grows in extent and diversity. This process provides clinicians with a data-driven neutral analysis to quickly detect pattern-positive individuals (for whatever index cancer is chosen) who would thus have their eligibility for genetic testing significantly but neutrally advanced. It should be strongly emphasized, however, that such recognizerbased identification of hereditary patterns is only a prelude to the still-required confirmation of the significant family history data before any medical responses are initiated. Moreover, even with such confirmation in hand, it cannot be stated too strongly that the need for genetic counseling (preferably the expertise of a certified genetic counselor) is in no way reduced or mitigated. This same method has been adapted to numerous other hereditary cancers of various anatomic sites and can be applied to a wide range of other hereditary pattern recognition problems (e.g., hereditary cardiovascular, neurologic, and metabolic disorders). Applying the recognizers to clinical patient databases, researchers can facilitate discovery of candidates for DNA studies in their search for gene mutations. Phenotype–genotype relationships can be derived by corre-

DATA MINING TO DISCOVER HEREDITARY SYNDROMES

347

lating the recognizer’s derivation of factors constituting a specific hereditary pattern with associated genomic mutations for that hereditary disease. When genetic mutation testing results are included as parameters in a patient database for a specific genetic disease (e.g., hereditary breast cancer), the process yields an expert system which characterizes variations in phenotypical presentations of the disease in terms of specific genetic mutations. Such information can greatly improve the accuracy and efficiency of gene testing. Clinicians’ use of recognizers may be of significant, supplemental benefit in the recognition of hereditary patterns, permitting earlier and accurate diagnosis, more accurate and efficient gene testing, with concomitant earlier and appropriate disease management. The availability of molecular genetic testing for hereditary cancer confirmation, with its significantly high costs and increasing diversity of available tests, mandates the need for a neutral gatekeeper to quickly, easily, inexpensively, and accurately identify candidates who may and who may not benefit from various gene mutation tests. Although research centers may aggregate large data sets for recognizer development, the actual recognizer itself is a miniscule program that can conveniently reside on any desktop computer. Coupled with this has been our development of a paper-based form to collect a family history that can be scanned into a computer and then assessed by a recognizer. Thus hereditary cancer recognizers could provide an extremely attractive, cost-benefit addition to the physician’s armamentarium to address this gatekeeper need. However, as noted, any patient’s candidacy derived through the use of this process must then be further evaluated with the help of genetic counseling (involving history confirmation, patient emotional status, informed consent, etc.) before the patient actually undergoes further medical evaluation. ACKNOWLEDGMENTS We are indebted to Rod Hoden and Chris Connolly for programming support; to Theresa Conway, Jennifer Cavalieri, and Lavonne Fusaro for database collection; to Tami Richardson-Nelson and Carole Rhedin for database organization; and to Dr. Patrice Watson for her helpful assistance on statistical analysis.

REFERENCES 1. Lynch, H. T. Cancer and the family history trail. N. Y. State J. Med. April, 145–147 (1985). 2. Lynch, H. T., and Hirayama, T. ‘‘Genetic Epidemiology of Cancer.’’ CRC Press, Boca Raton, FL, 1989. 3. Lynch, H. T., Fitzgibbons, R. J., Jr., and Lynch, J. F. Heterogeneity and natural history of breast cancer. Surg. Clin. North Am. 70, 753–774 (1990). 4. Hall, J. M., Lee, M. K., Newman, B., Morrow, J. E., Anderson, L. A., Hury, B., and King, M. C. Linkage of early onset familial breast cancer to chromosome 17q21. Science 250, 1684–1689 (1990). 5. Wooster, R., Neuhausen, S. L., Mangion, J., Quirk, Y., Ford, D., Collins, N., Nguyen, K., Seal, S., Tran, T., Averill, D., Fields, P., Marshall, G., Narod, S., Lenoir, G. M., Lynch, H. T., Feunteun, J., Devilee, P., Cornelisse, C. J., Menko, F. H., Daly, P. A., Ormiston, W., McManus, R., Pye, C., Lewis, C. M., Cannon-Albright, L. A., Peto, J., Ponder, B. A. J., Skolnick, M. J., Easton, D. F., Goldgar, D. E., and Stratton, M. R. Localization of a breast cancer susceptibility gene, BRCA2, to chromosome 13q12–13. Science 265, 2088–2090 (1994).

348

EVANS ET AL.

6. Fishel, R., Lescoe, M. K., Rao, M. R. S., Copeland, N. G., Jenkins, N. A., Garber, J., Kane, M., and Kolodner, R. The human mutator gene homolog MSH2 and its association with hereditary nonpolyposis colon cancer. Cell 75, 1027–1038 (1993). 7. Bronner, C. E., Baker, S. M., Morrison, P. T., Warren, G., Smith, L. G., Lescoe, M. K., Kane, M., Earabino, C., Lipford, J., Lindblom, A., Tannergard, P., Bollag, R. J., Godwin, A. R., Ward, D. C., Nordenskjold, M., Fishel, R., Kolodner, R., and Liskay, R. M. Mutation in the DNA mismatch repair gene homologue hMLH1 is associated with hereditary non-polyposis colon cancer. Nature 368, 258–261 (1994). 8. Kinzler, K. W., Nilbert, M. C., Su, L. K., Vogelstein, B., Byran, T. M., Levy, D. B., Smith, K. J., Preisinger, A. C., Hedge, P., McKechnie, D., Finniear, R., Markham, A., Groffen, J., Boguski, M. S., Altschul, S. F., Hori, A., Ando, H., Mioshi, Y., Miki, Y., Nishisho, I., and Nakamura, Y. Identification of FAP locus genes from chromosome 5q21. Science 253, 661– 665 (1991). 9. Evans, S. Methods for identifying human hereditary disease patterns. U.S. patent. July 1, 1997. 10. Pawlak, Z. ‘‘Rough Sets: Theoretical Aspects of Reasoning about Data.’’ Kluwer Academic, Dordrecht, Netherlands, 1991. 11. Piatetsky-Shapiro, G., and Frawley, W. J., Eds. ‘‘Knowledge Discovery in Databases.’’ MIT Press, Cambridge, MA, 1991. 12. Goldstein, A. M., Fraser, M. C., Struewing, J. P., Hussussian, C. J., Ranade, K., Zametkin, D. P., Fontaine, L. S., Organic, S. M., Dracopoli, N. C., Clark, W. H., Jr., and Tucker, M. A. Increased risk of pancreatic cancer in melanoma-prone kindreds with p16INK4 mutations. N. Engl. J. Med. 333, 970–974 (1995). 13. Lynch, H. T., and Smyrk, T. Hereditary nonpolyposis colorectal cancer (Lynch syndrome). Cancer 78, 1149–1167 (1996). 14. Risinger, J. I., Barrett, J. C., Watson, P., Lynch, H. T., and Boyd, J. Molecular genetic evidence of the occurrence of breast cancer as an integral tumor in patients with the hereditary nonpolyposis colorectal carcinoma syndrome. Cancer 77, 1836–1843 (1996). 15. Liebowitz, J. ‘‘Introduction to Expert Systems.’’ Mitchell Pub., Santa Cruz, CA, 1988.

Automated Detection of Hereditary Syndromes Using Data Mining

Automated Detection of Hereditary Syndromes Using Data Mining

Recommend Documents