Functional proteomic pattern identification under low dose ionizing radiation

Functional proteomic pattern identification under low dose ionizing radiation

Artificial Intelligence in Medicine 49 (2010) 177–185 Contents lists available at ScienceDirect Artificial Intelligence in Medicine journal homepage:...

1MB Sizes 0 Downloads 47 Views

Artificial Intelligence in Medicine 49 (2010) 177–185

Contents lists available at ScienceDirect

Artificial Intelligence in Medicine journal homepage: www.elsevier.com/locate/aiim

Functional proteomic pattern identification under low dose ionizing radiation Young Bun Kim a, Chin-Rang Yang b, Jean Gao c,* a

The Department of Pathology, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States c The Department of Computer Science and Engineering, University of Texas, Arlington, TX 76019, United States b

A R T I C L E I N F O

A B S T R A C T

Article history: Received 15 March 2009 Received in revised form 18 March 2010 Accepted 23 March 2010

Objective: High dose radiation has been well known for increasing the risk of carcinogenesis. However, the understanding of biological effects of low dose radiation is limited. Low dose radiation is reported to affect several signaling pathways including deoxyribonucleic acid repair, survival, cell cycle, cell growth, and cell death. The goal of this study is to reveal the proteomic patterns influencing these pathways. Methods and materials: To detect the possibly regulatory proteins/kinases, an emerging reverse-phase protein microarray (RPPM) in conjunction with quantum dots nano-crystal technology is used as a quantitative detection system. The dynamic responses are observed under different time points and radiation doses. To quantitatively determine the responsive protein/kinases and to discover the network motifs, we present a discriminative feature pattern identification system (DFPIS). Instead of simply identifying proteins contributing to the pathways, our methodology takes into consideration of protein dependencies which are represented as strong jumping emerging patterns (SJEPs). Furthermore, infrequent patterns, though occurred, will be considered irrelevant. Results: Computational results using DFPIS to analyze ataxia-telangiectasia mutated (ATM) cells treated under six different ionizing radiation doses (0 cGy, 4 cGy, 10 cGy, 50 cGy, 1 Gy, and 5 Gy) are presented. For each dose, the dynamic response was observed at different time points (1, 6, 24, 48, and 72 h). The sets of different responsive proteins/kinases at different dose are reported. For each dose, the SJEPs for ATM-proficient and ATM-deficient cells are shown and compared. Conclusion: By using the new RPPM technology and the DFPIS algorithm, we can observe the change of signaling patterns even at a very low radiation dosage where conventional technologies tend to fail. ß 2010 Elsevier B.V. All rights reserved.

Keywords: Proteomic signaling patterns Jumping emerging identification Low dose radiation Feature selection

1. Introduction The exposure to low dose (10 cGy or lower) ionizing radiation (IR) occurred to nuclear plant workers, astronauts, and X-ray operators affects several signaling pathways including deoxyribonucleic acid (DNA) damage, DNA repair, cell cycle checkpoints, and cell apoptosis [1–6]. To understand the possible molecular signaling pathways thus affected, we study the dynamic responses of the networks under different patterns considering both time and dosage changes. An emerging protein microarry called reversephase protein microarray (RPPM), in conjunction with the quantum dots (Qdot) nano-technology, is used as the detection system. This technology (RPPM-Qdot) offers us the ability to monitor the time series and dosage responses of cells exposed to low dose radiation. Different from the matured gene microarray technology, protein microarray is a new technology. RPPM is a quantitative

* Corresponding author. Tel.: +1 817 272 3628; fax: +1 817 272 3784. E-mail address: [email protected] (J. Gao). 0933-3657/$ – see front matter ß 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.artmed.2010.04.001

assay much like a miniature ‘‘enzyme-linked immunosorbent assays-on-a-chip’’ platform. In contrast to other protein arrays that immobilize probes, RPPM immobilizes the whole repertoire of sample proteins. It allows numerous samples to be analyzed in parallel using only minute (nanoliter) amounts of sample for making quantitative measurements to profile changes in activity of different candidate signaling molecules in cell lines [7]. The RPPM technology was especially designed for profiling changes in protein activity (e.g. phosphorylation, cleavage activation, etc.) rather than just protein expression levels. The marriage of RPPM with Qdot nano-technology due to its high yield of bright fluorescence and resistance to bleaching offers us an innovative detection technique. Therefore with RPPM-Qdot, we are able to elucidate the ongoing kinase activities and post translational modifications to generate a dynamic view for the functional proteomic analysis. Isogenic human Ataxia Telangietasia (A-T) cells are employed to study the central role of ataxia-telangiectasia mutated (ATM) in the cellular response to ionizing radiation. Cellular phenotype of A-T cells showed defects in ATM signal transduction and hypersensitivity to ionization radiation [8,9]. ATM is a DNA double strand break sensor and can be activated by change of chromatin

178

Y.B. Kim et al. / Artificial Intelligence in Medicine 49 (2010) 177–185

structure. It plays a pivotal role in both cell cycle arrest and DNA repair. A-T cells therefore provide a great model for the study of DNA damage responses induced by low dose IR. For the data output from the Qdot-RPPM technology under different dosages and at different time points, to quantitatively determine the responsive proteins/kinases and to discover the pathway motifs formed by them, visual inspections are not always obvious or accurate. Sophisticated computational algorithms have to be explored to robustly discover and identify these complicatedly expressed molecular patterns and their interactions. While a lot of research deals with classification methods in applications to gene microarray data, only a few of them are explicitly designed to consider the dependency relationships among the investigated features (proteins). Hence, to capture the global picture of the signaling pathway, the dependency among proteins/kinases needs to be taken into account [10]. Feature pattern (combination of features) identification techniques should be used to provide more underlying semantics than single features. Nevertheless, it is very difficult to find meaningful patterns in large datasets like microarray data because of the huge search space. The difficulty also comes from the existence of infrequent network patterns that exist but are often irrelevant or do not improve the accuracy of network finding [11,12]. To identify the different proteins/kinases involved in the signaling pathways for low vs. high dose ionizing radiation of ATM cells, we developed a discriminative feature pattern identification system (DFPIS). Instead of simply identifying proteins contributing to possible pathways, this methodology takes into consideration of protein interaction and dependency that are represented as strong jumping emerging patterns (SJEPs). The whole framework consists of three steps: feature (proteins, kinases)1 selection, network pattern identification, and network pattern annotation. For feature selection, the responsive proteins/ kinases contributing most to distinguishing dosage and temporal difference are identified. The network motifs of those selected proteins are discovered by employing SJEP pattern mining using a contrast pattern-tree (CP-tree). The last step of feature annotation provides a complete protein pattern characterization such as individual protein significance, protein dependency measurement, and network motif significance under IR. In the following sections, we will describe the system in detail. 1.1. Problem formulation For each numerical attribute from RPPM data output, the value range is discretized into two or more intervals. Each (attribute, continuous-interval) pair is called an item. As an example shown in Table 1, geneM63391 is discretized into two intervals and (geneM63391 , [1700, +1)) is an example of items. Let I be a set of all items. Then a set X of items for certain attribute is called an item set which is defined as a subset of I. Xð f i Þ is defined as an item set of the feature f i which contains all continuous-interval items of the attribute f i . For example, the discretization method partitions genes each into two disjoint intervals. XðM 63391 Þ ¼ fðgeneM63391 ; ð1; 1700ÞÞ; ðgeneM63391 ; ½1700; þ1ÞÞg:s pD ðXÞ is the support of an item set X in a data set D calculated by count D ðXÞ=jDj, where count D ðXÞ=jDj is the number of samples in D containing X. Suppose D contains two different classes: D1 and D2 . For an item i 2 I, there is a single item set fig  I. If the significance of an item SðfigÞ ¼ s pD1 ðfigÞ or SðfigÞ ¼ s pD2 ðfigÞ, we call an item fig as a SJEP which is the shortest jumping emerging pattern satisfying the support constraint. A feature significance, Sð f Þ, is the averaged summation of pattern significance from all items. 1 Nomenclatures ‘‘feature’’ and ‘‘proteins/probes/kinases’’ are used interchangeably.

Table 1 An example data set with two classes. Features

Class Cancer

Gene_M26383 Gene_M63391 Gene_M76378

1 4 6

Normal 1 4 6

2 4 6

2 5 6

3 5 6

3 5 7

Example 1. S(geneM63391 )=(S((geneM63391 , (1; 1700)))+ S((geneM63391 , [1700; þ1))))/2. Given significance measures, we can define the relative significance between two features. Let J ¼ f j1 ; j2 ; . . . ; j p g be the set of all items appearing in Xð f i Þ and K ¼ fk1 ; k2 ; . . . ; kq g be the set of all items appearing in Xð f j Þ. The relative significance Sð f j j f i Þ is defined as, 2 3 2 3 p X q p X q X X 4 5 4 SðKð jÞjJðiÞÞ SðKð jÞÞ  RðJðiÞ; Kð jÞÞ5 Sð f j j f i Þ ¼

i¼1 j¼1

ðjKj  jJjÞ

¼

i¼1 j¼1

ðjKj  jJjÞ

; (1)

where SðJðiÞÞ; SðKð jÞÞ > 0 and RðJðiÞ; Kð jÞÞ denotes the redundancy between two patterns JðiÞ and Kð jÞ. The ideal redundancy measure RðJðiÞ; Kð jÞÞ is hard to gain. In this paper, we use approximate redundancy based on distance between patterns as proposed by Xin et al. [11,12]. Example 2. Let f 1 ; f 2 and f 3 be three features: Xð f 1 Þ ¼{(geneM26383 ; ð1; 38:7Þ), (geneM26383 , [38.7, 59.8)), (geneM26383 , [59:8; þ1))}, Xð f 2 Þ={(geneM63391 , (1,1700)), (geneM63391 , [1700, þ1))} and Xð f 3 Þ={(geneM76378 , (1,500)), (geneM76378 , [500, þ1))}. For the convenience, we index them as the 1st and 2nd items and so on. Xð f 1 Þ ¼ f1; 2; 3g, Xð f 2 Þ ¼ f4; 5g and Xð f 3 Þ ¼ f6; 7g. A set of all items I is f1; 2; 3; 4; 5; 6; 7g. The significance of items with a minimum support threshold j ¼ 1 is calculated as Sðf1gÞ ¼ 2=3, Sðf2gÞ ¼ 0, Sðf3gÞ ¼ 2=3, Sðf4gÞ ¼ 1, Sðf5gÞ ¼ 1, Sðf6gÞ ¼ j1  2=3j ¼ 1=3, Sðf7gÞ ¼ 0. Then, feature significance is calculated from the sum of items’ significance. SðgeneM26383 Þ ¼ Sðf1gÞþSðf3gÞ ¼ ð2=3 þ2=3Þ=2 ¼ 2=3, SðgeneM63391 Þ ¼ 1, SðgeneM76378 Þ ¼ 1=3. Relative significance between two features is obtained based on relative pattern significance. Sð f 2 j f 1 Þ ¼ Sðf4gjf1gÞ þ Sðf5gjf1gÞ þ Sðf4gjf3gÞ þ Sðf5gjf3gÞ. Note that item f2g is not considered because of Sðf2gÞ ¼ 0. For Sðf4gjf1gÞ, distance between two items is calculated as Dðf1g; f4gÞ ¼ 1  2=3 ¼ 1=3. From Rðf1g; f4gÞ ¼ ð1  1=3Þð2=3Þ ¼ 4=9, Sðf4gjf1gÞ ¼ 1  4=9 ¼ 5=9. In the same way, the relative significance between other items can be obtained as Sðf5gjf1gÞ ¼ 1, Sðf4gjf3gÞ ¼ 1, and Sðf5gjf3gÞ ¼ 5=9. Finally, Sð f 2 j f 1 Þ ¼ ð5=9 þ 1 þ 1 þ 5=9Þ=4 ¼ 7=9. Note that when geneM26383 is given, the significance of geneM63391 is reduced from 1 to 7/9 because of redundancy between geneM26383 and geneM63391 . 1.2. Support vector machines (SVMs) SVMs are well-known machine learning algorithms which have been successfully applied to numerous classifications and pattern recognition problems such as text categorization, image recognition, and bioinformatics [13]. Suppose that there are N training samples, ðx1 ; y1 Þ; . . . ; ðxN ; yN Þ from two classes, where xi 2 Rn is an n-dimensional feature vector representing the i th training sample labeled by yi 2 fþ1; 1g for i ¼ 1; . . . ; N. SVMs search for an optimal hyperplane which maximizes the margin between the two classes. The hyperplane classifying an input pattern x can be described by the following function : f ðxÞ ¼ hw; FðxÞi þ b;

(2)

Y.B. Kim et al. / Artificial Intelligence in Medicine 49 (2010) 177–185

179

Fig. 1. The diagram of the proposed feature selection algorithm.

where w is a weight vector and b is a scalar. We can compute the weight vector by solving a quadratic programming problem formulated to find the optimal hyperplane. A linear SVM is calculated as following, w¼

l X

ai yi xi ;

(3)

i¼1

where ai ; i ¼ 1; . . . ; l, are Lagrange multipliers and l is the number of support vectors. 1.3. A discriminative feature pattern identification system (DFPIS) To discover discriminant feature patterns in a possibly prohibitive search space, we design a discriminative feature patterns identification system named DFPIS. The framework starts with a feature selection performed by building a connection between pattern frequency (pattern support value) and discriminative measures. This method finds a feature subset relevant to each feature which includes the d lowest correlated features for a given feature based on a relative feature significance measure. With the low correlated feature subset, we run the linear SVMs algorithm where two-thirds of samples are utilized for training and the remaining one-third for testing. Then, we compute the weight for each feature based on the idea proposed in [14], 8 jwk jSð f k Þ > > for g  b; b  d; > dþ1 > > X > > > jw jSð f Þ > j j > > > j¼1 > 1 <0 (4) Zk ¼ B C > > C >B > jwk jSð f k Þ C B > > B1  dþ1 C  ðg  bÞ  d; for g > b; > > B C > X > @ A > > jw jSð f Þ > j j : j¼1

where





1; 1;

for for

gb g >b

(5)

and b is the accuracy using testing samples, g is a predefined threshold, and jwk j is the absolute SVM weight obtained using Eq. (3). Each jwk jSð f k Þ is normalized by dividing it by the summed jwk jSð f k Þ value of all the features in the subset. Sð f k Þ is the feature significance. To prevent the feature weight from being multiplied by zero, a very small value is summed to jwk j and Sð f k Þ. In our approach, Sð f k Þ is added in the equation from Oh et al. [14] since feature significance is an important measure to show if the feature is globally discriminant, not locally in the feature subset. Finally, backward selection (elimination) starts with a certain number of features ranked by feature weights. The process stops when decreasing the size of current best subset leads to a lower prediction rate. This algorithm is summarized in Fig. 1. Once redundant features are removed, feature patterns identification algorithm is performed. To efficiently mine SJEPs, we employed SJEPs mining algorithm based on the CP-tree [15]. The CP-tree is constructed by using the new ordering of each transaction based on the feature weight from Eq. (4), while the original CP-tree reorders transactions based on the feature support value. The order of CP-tree is very important to extract SJEPs. However, there are some critical issues when we use only the feature support value for reordering. First, there are many cases that the support values of features are equivalent. Second, feature support value only is not enough to rank features. Therefore, reordering based on the feature weight has the strong advantage to efficiently extract SJEPs. Because every training instance is sorted by its weight when inserting into the CP-tree, items with high weight, which are more likely to appear in an SJEP, are closer to the root. Using the predefined order, we can produce the complete set of paths (item sets) systematically through

180

Y.B. Kim et al. / Artificial Intelligence in Medicine 49 (2010) 177–185

Fig. 2. Finding SJEPs using the Contrast Pattern Tree (CP-tree).

depth-first searches of the CP-tree. We start from the root to search the CP-tree depth-first for SJEPs. The item set, which is initially empty, will grow one item at a time. After completing the search of the CP-tree, we select only those minimal patterns by filtering out those that are supersets of others. The remaining minimal ones are SJEPs since they satisfy the minimum support threshold. Fig. 2 shows an example of finding SJEPs using the CP-tree. In this figure, three selected probes are given: pEGPR, Belklin, and Ku70. As can be seen from the upper left table in Fig. 2, antibody pEGPR has three items numbered as 92, 93, and 94. Inside each node of the CP-tree, the top number indicates the item number, the lower left number shows the support value for ATM+, and the lower right number shows the support value for ATM at the current tree level. The final selected protein motif patterns are listed as SJEPs at the bottom of the figure. As an example, one SJEP is composed of items 94 ! 115 ! 84 with support value as 0 for ATM+ and 2 for ATM. The final step is to provide feature pattern annotation. Feature pattern annotation is important to assign a set of characteristics to feature patterns and thus obtain relevant information for the interpretation of experimental results. Our goal is to generate annotations to provide complete and homogeneous feature pattern characterization such as feature significance, relative feature significance, feature prediction ability (classification accuracy), feature pattern significance, and so on, to domain experts. 2. Experimental results We applied quantum dot reverse-phase protein microarray [16] to profile the dynamic responses of several cellular signaling pathways, including DNA damage sensors/repair, cell growth/ proliferation, cell cycle checkpoints/regulation, tumor suppressor p53, anti-apoptotic NF B and apoptosis to low dose IR [8,9]. ATM

deficient (ATM) human fibroblasts were isolated from a patient with A-T phenotype, and ATM proficient (ATM+, clone YZ5) cells were those cells complemented with wild-type ATM gene [18]. The isogenic pair of ATM cells was treated with a series of IR doses (0 cGy, 4 cGy, 10 cGy, 50 cGy, 1 Gy, and 5 Gy); cell lysates were collected at different time points (1, 6, 24, 48, and 72 h), serial diluted and spotted on protein arrays in triplicate. To evaluate the dynamic responses of different signaling pathways within biological network, commercially available antibody sampler kits (Cell Signaling Technology, Inc.) directly against proteins/kinases within specific pathways were incubated with protein arrays. Total 55 antibodies (listed under Fig. 5) were chosen. The signals were amplified by biotinylated secondary antibodies and streptoavidinconjugated Quantum Dot 655 (Invitrogen, Inc.). Signal readouts were the intensities of EC50 of each dilution curve; corrected for protein loading with intensities of total protein stain (SyproRuby, Invitrogen, Inc.), then normalized to values within zero to one. To test the performance of the proposed DFPIS algorithm, classification was carried out by the linear SVM (soft margin C=1. Leave-one-out cross validation (LOOCV) evaluation was employed due to a small number of samples. Table 2 shows the data sets used in this experiment. These data sets treat intensities of certain dose at five different time points, intensities of all different dose level at certain time, and intensities of all different dose level at all time points as samples and have 55 antibodies as features. The classes of these datasets are labeled as either ATM+ or ATM. 2.1. Computational analysis: feature selection The discovery of different responsive probe sets for different dosages and at different time points are given in Table 3. In this

Y.B. Kim et al. / Artificial Intelligence in Medicine 49 (2010) 177–185

181

Table 2 Data description. Dataset

# of classes

# of samples

# of features

Description

Data1 Data2 Data3 Data4 Data5 Data6 Data7 Data8 Data9 Data10 Data11

2 2 2 2 2 2 2 2 2 2 2

10 10 10 10 10 10 10 10 10 10 50

55 55 55 55 55 55 55 55 55 55 55

4 c dose, 5 time points 10 c dose, 5 time points 50 c dose, 5 time points 1 Gy dose, 5 time points 5 Gy dose, 5 time points 1 h, 5 doses 6 h, 5 doses 24 h, 5 doses 48 h, 5 doses 72 h, 5 doses All times, all doses

(5/5) (5/5) (5/5) (5/5) (5/5) (5/5) (5/5) (5/5) (5/5) (5/5) (25/25)

Table 3 The number of minimum and maximum responsive protein sets under different doses and at different time points. Dataset

DFPIS-Feature selection # of features

Data1 Data2 Data3 Data4 Data5 Data6 Data7 Data8 Data9 Data10 Data11

Min

Max

5 5 4 3 4 5 6 3 7 3 7

15 55 31 55 55 55 55 55 21 55 55

SVM-RFE Accuracy

100 100 100 100 100 100 100 100 100 100 100

Sensitivity

100 100 100 100 100 100 100 100 100 100 100

Specificity

100 100 100 100 100 100 100 100 100 100 100

table, the minimum feature set indicates the list of selected features by DFPIS feature selection, and the maximum feature set indicates all the other relevant probes with respect to each selected probe in the minimum set. This table shows that A-T cells has been significantly effected by low dose IR as well as high dose IR. However, we note that only 5– 15 features were selected in Data1 under 4 cGy dose. It shows that many of features significantly effected by high dose IR have been functioned by row dose IR not as much as by high dose IR. We also could observe different effects on low dose IR and high dose IR in Fig. 3. To evaluate the performance of our algorithm, we carried out comparison experiments with support vector machine-recursive feature elimination (SVM-RFE) feature selection. As seen from Table 3, the accuracy rates using DFPIS-FS generally outperform the SVM-RFE.

# of features Min

Max

18 5 7 15 7 8 14 12 39 7 10

26 55 18 55 55 55 55 55 55 55 55

Accuracy

Sensitivity

Specificity

84 100 80 90 100 100 100 80 90 100 100

76 100 80 80 100 100 100 80 80 100 100

91 100 80 100 100 100 100 80 100 100 100

2.2. Computational analysis: feature pattern identification To analyze the dynamic network responses induced by different IR levels, we give examples on two feature interaction diagrams on Data1 4 cGy dose and Data5 5 Gy dose in feature pattern annotation. We found six SJEPs for both ATM+ and ATM on Data1. From these patterns, seven relationships between five representative features were found. As shown in Fig. 4, the first and sixth feature relationships were found in both classes. However, note that fluorescence intensities of features are expressed differently during these interactions. For instance, the dependency of feature f0 (pATM) causes the intensity of f20 (p21) to go up in ATM+ class but brings it down in ATM class. The seventh relationship disappeared in ATM. According to the support ratio and the

Fig. 3. Performance of feature selection: (a) Box plot of accuracy at different dose levels; (b) accuracy of DFPIS when different top-ranked features are selected.

182

Y.B. Kim et al. / Artificial Intelligence in Medicine 49 (2010) 177–185

Fig. 4. Interaction diagram of five representative probes on Data1 using 4 cGy dose

Fig. 5. Interaction diagram of four representative probes on Data5 using 5 Gy dose.

Y.B. Kim et al. / Artificial Intelligence in Medicine 49 (2010) 177–185

Fig. 6. (a–e) Expression levels of the top five most responsive probes in 4 cGy dose data.

Fig. 7. (a–d) Expression levels of the top four most responsive probes in 5 Gy dose data.

183

Y.B. Kim et al. / Artificial Intelligence in Medicine 49 (2010) 177–185

184

Table 4 Comparison of interactions for 4 cGy and 5 Gy dose. No

4 c dose

Mapping to 5 Gy dose network

5 Gy dose

No

5 Gy dose

Mapping to 4 c dose network

4 c dose

1 2 3 4

33 ! 0 0 ! 20 20 ! 25 25 ! 5

44 44 46 54

44 46 46 54

1 2

46 ! 54 54 ! 44

25 (46’s rep) ! 25 (54’srep) 25 (54’s rep) ! 0 (44’s rep)

25 in rep set 0 ! 25

3

44 ! 1

0 (44’s rep) ! 5 (1’s rep)

0 ! 20 ! 25 ! 5, 0 ! 20 ! 5, 0 ! 25 ! 5

5 6 7

0 ! 25 33 ! 20 20 ! 5

44 (0’s rep) ! 54 (25’s rep) 44 (33’s rep) ! 46 (20’s rep) 46 (20’s rep) ! 54 (5’s rep)

4 5

46 ! 44 54 ! 1

25 (46’s rep) ! 0 (44’s rep) 25 (54’s rep) ! 5 (1’s rep)

0 ! 25 25 ! 5

(33’s rep) ! 44 (0’s rep) (0’s rep) ! 46 (20’s rep) (20’s rep) ! 54 (25’s rep) (25’s rep) ! 54 (5’s rep)

in rep set ! 44 ! 54 in rep set

55 ! 44 46 ! 44 46 ! 54

rep: representative.

relative feature significance assigned to each relationship, the first, second, fifth, and the sixth relationships are slightly stronger than the third, fourth, and seventh relationships. We found three SJEPs for ATM+ and five SJEPs for ATM on Data5. From these patterns, five relationships among four representative features were found. As shown in Fig. 5, all of the five feature relationships were found in both classes. However, note that fluorescence intensities of features are expressed differently during these interactions except for the fifth relationship. According to the support ratio assigned to each relationship, the strength of all relationships in ATM+ is slightly reduced in ATM. The expression levels of the top five selected probes in Data1 are shown in Fig. 6. Fig. 7 shows the expression levels of the top four selected probes for Data5. 2.3. Biological observations As shown in Table 4, we investigated whether interactions of selected features at different IR dose levels are related to each other. First, all of the four representative probes including pSmad3, Becklin, pEGFR, and pBRCA1 on Data5 (5 Gy dose) were found in the maximum feature set on Data1 (4 cGy dose). It shows that these antibodies still play an important role under low dose IR level. Second, all of the relationships in Data5 are related to

those of Data1. The comparison and relationship of the top selected probes in the two data sets are shown in Fig. 8. Green nodes here represent the selected five features for 4 cGy dose data. Blue nodes represent the selected four features for 5 Gy dose data. The red lines represent interactions inferred from SJEPs of the selected features. The highly correlated features with blue features for which have more than 0.7 correlation coefficient value were presented by blue lines in 5 Gy dose data. The highly correlated features with green features which have more than 0.7 correlation coefficient value were presented by green lines in 4 cGy dose data. In DFPIS-feature pattern identification, we assume that a family member has the same or similar relationships as the ones of its representative features. Thus five relationships in Data5 were matched with similar five relationships in Data1 in Table 4. For instance, the fifth relationship in Data5 was assigned to the fourth relationship in Data1 since feature f25 (PUMA) was a representative of f54 (Becklin) that has a 0.90 correlation coefficient with f25, f5 (pDNAPK) was a representative of f1 (pBRCA1) holding a 0.88 correlation coefficient with f5, and there exists the fourth relationship between f25 and f5 in Data1. Finally, we observe some reverse relationships. As an example, the second relationship in Data5 corresponds to the reverse of the seventh relationship in Data1. In our research, the direction of

Fig. 8. Relationships among top selected features and highly-correlated features for 4 cGy and 5 Gy data sets.

Y.B. Kim et al. / Artificial Intelligence in Medicine 49 (2010) 177–185

dependence was determined by a new feature rank identified in DFPIS-feature selection. Thus this reverse relationship can be identified if major features are changed as dose IR levels are changed. However, to provide more information about directions of relationships, we need further study by considering all possible directions of relationships. 3. Discussion and conclusion This paper presented exploratory work on identifying signaling molecules under low dose ionizing radiation by using RPPM in conjunction with quantum dot nano-technology. A computational framework, DFPIS, is developed to recognize the contributing probes in different pathways and to take into the consideration of protein dependence. For feature selection, the most responsive proteins at different time points are identified. The interaction patterns of those selected probes are discovered by employing SJEPs pattern mining based on a CP-tree. The last step of feature pattern annotation provides a complete pattern characterization such as single probe significance, relative pairwise probe dependence, and pattern significance. The pilot study does reveal the quantitative change of different protein/kinase expression levels in different patterns. For future work, we plan to increase the sample size and the number of probes. In addition, we will investigate and biologically validate the individual signaling pathways affected under different dose and in time series. Acknowledgments This research was supported by the Office of Science (BER), U.S. Department of Energy under Grant no. DE-FG02-07ER64335. We also thank Dr. Y. Dong and Dr. D. Boothman for their supports on the biological experiments.

185

References [1] Fachin AL, Mello SS, Sandrin-Garcia P, Junta CM, Ghilardi-Netto T, Donadi EA, et al. Gene expression profiles in radiation workers occupationally exposed to ionizing radiation. J Radiat Res 2009;50(1):61–71. [2] Fakir H, Hofmann W, Tan WY, Sachs RK. Triggering-response model for radiation-induced bystander effects. Radiat Res 2009;171(3):320–31. [3] Grillo CA, Dulout FN, Guerci AM. Evaluation of radioadaptive response induced in CHO-K1 cells in a non-traditional model. Int J Radiat Biol 2009;85(2):159–66. [4] Tseng CW, Trimble C, Zeng Q, Monie A, Alvarez RD, Huh WK, et al. Low-dose radiation enhances therapeutic HPV DNA vaccination in tumor-bearing hosts. Cancer Immunol Immunother 2009;58(5):737–48. [5] Tsukimoto M, Homma T, Mutou Y, Kojima S. 0.5 Gy gamma radiation suppresses production of TNF-alpha through up-regulation of MKP-1 in mouse macrophage RAW264.7 cells. Radiat Res 2009;171(2):219–24. [6] Zhuang HQ, Wang JJ, Liao AY, Wang JD, Zhao Y. The biological effect of 125I seed continuous low dose rate irradiation in CL187 cells. J Exp Clin Cancer Res 2009;28:12. [7] Geho D, Lahar N, Gurnani P, Huebschman M, Herrmann P, Espina V, et al. Pegylated, steptavidin-conjugated quantum dots are effective detection elements for reverse-phase protein microarrays. Bioconjug Chem 2005;16(3): 559–66. [8] Marchetti F, Coleman M, Jones I, Wyrobek A. Candidate protein biodosimeters of human exposure to ionizing radiation. Int J Radiat Biol 2006;82(9):605–39. [9] Ziv Y, Bar-Shira A, Pecker I, Russel P, Jorgensen T, Tsarfati I, et al. Recombinant ATM protein complements the cellular A-T phenotype. Oncogene 1997;15:159–67. [10] Cheng H, Yan X, Han J, Hsu CW. Discriminative frequent pattern analysis for effective classification. In: Proceedings of the 2007 IEEE international conference on data engineering (ICDE 07); 2007. p. 716–25. [11] Xin D, Cheng H, Yan X, Han J. Extracting redundancy-aware top-k patterns. In: Ungar L, editor. Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining; 2006. p. 20–3. [12] Xin D, Han J, Yan X, Cheng H. On compressing frequent patterns. Data Knowledge Eng 2007;60:5–29. [13] Burges CJC. A tutorial on support vector machines for pattern recognition. Data Mining Knowledge Discov 1998;2:121–67. [14] Oh JH, Gurnani P, Schorge J, Rosenblatt KP, Gao J. An extended Markov blanket approach to proteomic biomarker detection from high-resolution mass spectrometry data. IEEE Trans Inform Technol Biomed 2009;13(2):195–206. [15] Fan H, Ramamohanarao K. Fast discovery and the generalization of strong jumping emerging patterns for building compact and accurate classifiers. IEEE Trans Knowledge Data Eng 2006;18(6):721–37. [16] Shingyoji M, Gerion D, Pinkel D, Gray JW, Chen F. Quantum dots-based reverse phase protein microarray. Talanta 2005;67(3):472–8.