Gene selection in autism – Comparative study

Neurocomputing 250 (2017) 37–44 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Gene sele...

Download PDF

2MB Sizes 1 Downloads 86 Views

Report

PDF Reader
Full Text

Neurocomputing 250 (2017) 37–44

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Gene selection in autism – Comparative study Tomasz Latkowski a, Stanislaw Osowski a,b,∗ a b

Faculty of Electronics, Military University of Technology, 01-476 Warsaw, Kaliskiego 2, Poland Faculty of Electrical Engineering, Warsaw University of Technology, 00-661 Warsaw, Koszykowa 75, Poland

a r t i c l e

i n f o

Article history: Received 24 February 2016 Revised 21 June 2016 Accepted 16 August 2016 Available online 7 February 2017 Keywords: Autism Gene expression microarrays Feature selection Clustering Genetic algorithm Random forest

a b s t r a c t The paper investigates application of several methods of feature selection to identiﬁcation of the most important genes in autism disorder. The study is based on the expression microarray of genes. The applied methods analyze the importance of genes on the basis of different principles of selection. The most important step is to fuse the results of these selections into common set of genes, which are the best associated with autism. These genes may be treated as the biomarkers of this disorder and used in early prediction of autism. The paper proposes and compares three different methods of such fusion: purity of the clusterization space, application of genetic algorithm and random forest in the role of integrator. The numerical experiments are concerned with the identiﬁcation of the most important biomarkers and their application in autism recognition. They show the applied fusion strategy of many independent selection methods leads to the signiﬁcant improvement of the autism recognition rate. © 2017 Elsevier B.V. All rights reserved.

1. Introduction Autism disorder belongs to the pervasive neurodevelopmental disorders, affecting a broad spectrum of human functions [1,2]. The important problem is early recognition of this disorder, enabling the proper treatment of the autistic individuals. Nowadays, microarray gene expression data are studied to ﬁnd the genes or sequences of genes which are the best associated with autism and might be treated as biomarkers. The diﬃculty in identifying these genes are many outliers, high variance of data and bad conditioning of the problem [1,3], manifested by the small number of available observations (usually measured in hundreds) in comparison to very huge number of genes (dozens of thousands). These complexities raise the challenge of how to identify the genes, that are the most informative for this disorder and that can be used to distinguish the class of autistic from the other individuals. Many methods developed in feature selection have been used in solving the task of gene selection in different problems. They include clustering methods [4], application of neural networks and Support Vector Machines [5–7], statistical tests [8], linear regression methods applying forward and backward selection [9], fuzzy expert system based algorithms [10,11], rough set theory [12], use of global optimization methods, including genetic algorithms, chaotic binary particle swarm optimization and ∗

Corresponding author. E-mail addresses: [email protected] (T. Latkowski), [email protected] (S. Osowski). URL: http://www.iem.pw.edu.pl/˜sto (S. Osowski) http://dx.doi.org/10.1016/j.neucom.2016.08.123 0925-2312/© 2017 Elsevier B.V. All rights reserved.

artiﬁcial bee colony (ABC) in connection with kNN classiﬁer [13–15], application of ReliefF method combined with different classiﬁers [7], various statistical methods [16,17], as well as fusion of many selection methods [6,18]. Although most of these methods have been applied in cancer research, they might be also adopted to the autism. Many solutions have used the specialized methods, from which the best one was chosen as the most appropriate in the particular problem. However, it should be mention that each selection method uses specialized procedure of assessing the class discriminative features. The results depend on the applied mechanism of selection, which might work well in some data mining problems and be not eﬃcient in the other. The additional diﬃculty in autism recognition is very high variance of gene expression of the individuals belonging to the same group. For example in NCBI data base of autism [19] containing 146 observations and 54,614 genes the variance of gene expression values of different individuals change from 0.099 to 24.19 × 106 with the mean 2.38 × 104 , median 38.84 and 13,245 genes of the expression variance higher than 10 0 0. Fig. 1 presents the mean value and variance of gene expressions in the analyzed data. It conﬁrms very high variability of the gene expressions and existence of many outliers. This high variance of data means that the particular choice of sets of observations for selection procedure may lead to completely different results. This problem makes the application of single method ineﬃcient for autistic data and needs elaboration of special procedure. It will be based on application of many selection methods in multiple runs. The important task in such approach is to fuse the selection results into the ﬁnal solution.

38

T. Latkowski, S. Osowski / Neurocomputing 250 (2017) 37–44

Fig. 1. The change of mean value (upper ﬁgure) and variance (bottom ﬁgure) of gene expressions of autistic data.

The primal aim of the paper is to ﬁnd the small population of the most informative genes strongly associated with autism. These genes might be useful as biomarkers of this neurodevelopmental disorder and at the same time serve as the input attributes to the automatic system in autism prediction. The application of many different feature selection methods cooperating in an ensemble will be proposed as the best tool to solve this task. The most important requirement is to use the methods, which are based on different principles of operation, guarantying the independent performance. Their number is not strictly deﬁned. In this solution we have used eight methods, which in our opinion are satisfactory from the diversity of operation. Such approach to autistic data was never applied by other authors. There are some works showing an ensemble of methods for microarray data regarding cancer problems [6,18], however the strategy was different. The authors of [18] have proposed the multicriterion fusion-based approach, in which the integration is done on the level of features. In our approach the fusion is performed on two levels: the level of individual methods and the level of many classiﬁers, forming an ensemble creating the ﬁnal decision. Moreover, we propose different strategies of selecting the size of the optimal gene set. The applied methods rely on different principles and therefore, assess the discrimination ability of the gene in an independent way. The important point is to fuse their results into one ﬁnal group of genes that might be treated as the biomarkers of autism. In this paper we will present three different approaches to gene fusion: the purity of the clusterization space, the genetic algorithm and random forest. The limited set of genes may be also used as the input attributes in the classiﬁcation system, responsible for early identiﬁcation of autistic individuals. This system is composed of many classiﬁers arranged in an ensemble integrated by the random forest. The results of numerical experiments performed on the NCBI data base [18] will be presented and discussed.

2. Materials The basic numerical experiments of gene selection have been performed on the NCBI dataset related to autism. The database is publicly available and was downloaded from GEO (NCBI) repository[19]. The number of observations in this dataset equals 146 and number of genes 54,613. The database consists of two classes: the ﬁrst one is related to children with autism (number of such

observations equal 82) and the second to the control group of healthy children (64). All subjects in the base are male. Probands and controls were all recruited from the Phoenix area. Blood draws for all subjects were done between the spring and summer of 2004. Total RNA was extracted for microarray experiments with Affymetrix Human U133 Plus 2.0 39 Expression Arrays. Children with autism were diagnosed by a medical professional (developmental pediatrician, psychologist or child psychiatrist) according to the DSM-IV criteria and the diagnosis was conﬁrmed on the basis of ADOS and ADI-R criteria [1,17]. The non-classic forms of autism were excluded from the base. They include only autism with regression and Asperger’s syndrome, a higher functioning form of autism, where individuals have language skills within normal range. In addition, each subject had a normal high-resolution chromosome analysis, and a negative Fragile X DNA test. The main task of this work is to ﬁnd a small subset of genes, which are the best associated with autism. This problem is resolved by using several gene selection methods combined into one ﬁnal gene recognition system. 3. Methods The gene expression array of autism considered in the work contains more than 50,0 0 0 genes. It is natural, that most of them have no class discrimination ability. Therefore, the ﬁrst ﬁltration of genes should be done in the introductory stage to reduce this number in a signiﬁcant way. We have applied a strategy in which the genes with similar mean values of expression for autistic and reference (control) classes within all observations should be eliminated ﬁrst as not discriminative in class recognition. 3.1. Initial ﬁltering The diﬃcult problem in this stage is the existence of huge number of outliers in gene expressions observed in the data base. These outliers distort comparison of discrimination ability of genes in both classes. To avoid such problem we have applied the introductory ﬁltration based on the median and not the mean value. Fig. 2a and b present the comparison of two genes of the highest distance between the means (Fig. 2a) and medians (Fig. 2b) for two classes of data. The outliers existing in the data have distorted signiﬁcantly the calculation of mean value, especially for autistic individuals, for which the mean value was 140.57 at standard deviation 569.69. Hence the results of such gene selection are not reliable. In the case of median principle (Fig. 2b) the results are much better. We can see only the natural variability of expression values in both classes. Therefore, in further experiments we have applied the median ﬁltration in the introductory step of reduction. The investigations presented in [20] have shown that the most genes depict very similar ratio of the medians in both classes. There were no genes of this ratio below the value of 0.7. The active range of reduction is from 0.8 to 1.0. To reduce the number of less signiﬁcant genes to a reasonable level the threshold value of this ratio was chosen heuristically and set to level of 0.96. It resulted in the introductory elimination of 35,762 genes. Therefore in further analysis we used only 18,851 genes, for which the ratio of median in both classes was below 0.96. It means a relatively considerable reduction, however, eliminating only the genes of very similar expressions in both classes, not very useful in the recognition process. Moreover, we have observed the signiﬁcant difference between the genes selected on the basis of median and mean value principle. In the group of ten, the best class discriminative genes, there

T. Latkowski, S. Osowski / Neurocomputing 250 (2017) 37–44

39

Fig. 2. The expression values of the best class discriminating gene at application of (a) mean value and (b) median value. Table 1 The repeatability of the best genes selected commonly after ﬁltration based on median and mean principles. The population of the best genes

100

10 0 0

50 0 0

10,0 0 0

20,0 0 0

30,0 0 0

40,0 0 0

50,0 0 0

Number of commonly selected genes Repeatability of the commonly selected genes

30 30%

416 41.6%

2955 59.1%

6489 64.9%

14,257 71.3%

22,939 76.5%

32,912 82.3%

46,255 92.5%

was no common one. In the group of 100 the best genes there were only 30 common genes. Table 1 presents the number of commonly selected genes at application of the median and mean principle and different population of the best genes. 3.2. Individual gene selection methods The reduced set of genes is subject to further selection process. This step was organized in two stages. In the ﬁrst one we apply 8 methods of selection, differing in the principle of operation. The following methods have been applied: Fisher discriminant analysis (FD), ReliefF algorithm (RF), two sample t-test (TT), Kolmogorov– Smirnov test (KS), Kruskal–Wallis test (KW), stepwise regression method (SW), feature correlation with a class (FC) and support vector machine recursive feature elimination (SVM-RE). A short description of them was provided in [20]. The operation principle of these methods relies on different foundations which allows analyzing the selection problem from different points of view. Fisher method compares the mean values and standard deviations in both classes for each feature separately [17]. The ReliefF algorithm ranks the features according to the highest correlation with the observed class while taking into account the distances between opposite classes [21]. The TT, KS and KW methods apply different statistical hypothesis tests, on the basis of which the feature class discrimination ability is assessed [22]. Stepwise regression adds and removes systematically the features to the set of input attributes based on their statistical signiﬁcance in a regression [24]. The FC method assesses the feature on the basis of its correlation with a class while eliminating the features well correlated with each other [9,23,24]. The SVM-RE method uses the linear SVM as a classiﬁer supplied by all features at the same time [5]. The ranking criterion of the particular feature is based on its weight magnitude. The SVM is re-trained many times after eliminating some the least signiﬁcant features (usually from 10% to 20%). As a result the more and more compact gene subsets are created in the succeeding iterations. The application of these methods leads to 8 different sets of genes, ordered according to their rank, which reﬂect the class

Table 2 The population of commonly selected genes in different methods. Size of sets (methods)

8

7

6

5

4

3

2

1

Number of common genes

0

3

23

21

26

31

35

269

discrimination ability (from the most to the least discriminative). Because of various principles of gene assessment the results of such selections are different. Among 100 the most important genes in our data base the repeatability rate ranged from 5% to 63%, even for the same method at different contents of observation samples used in selection. This is due to large variety of expression levels of different individuals forming the non-uniform autistic group. The particular composition of samples taking part in selection has high impact on these results. Therefore, the single run of selection process performed on the chosen samples of observations is not a good solution. Limiting further considerations to only 100 genes, the most often appearing in 10 runs of each method (treated further as the best), the total selected population of different genes contained 408 items (still too large for treating them as the biomarkers). However, there was not even single gene which appeared commonly in all 8 sets. Table 2 presents the number of genes commonly selected by eight applied methods in the group of 100 the best. Seven selection methods have selected simultaneously only 3 identical genes: HIST1H2BG, TRPV6 and CAPS2. They might be treated as the most important biomarkers for autism. It should be noted, that none method was able to indicate the reliable optimal size of genes, enabling acceptable recognition of the autistic samples from the reference group. Therefore, the additional step is needed. We have applied and compared three different approaches to this problem: the purity of clusterization space, the genetic algorithm and random forest. 3.3. Purity of clusterization space In this approach the self-organizing clusterization of the data space using the K-means algorithm [25] was done ﬁrst. Two

40

T. Latkowski, S. Osowski / Neurocomputing 250 (2017) 37–44

After this step the number of commonly selected genes has been signiﬁcantly decreased. Only 2 genes were common in 7 selection methods, 7 genes common in 6 methods of selection, etc. 3.4. Random forest application

Fig. 3. The change of the overall purity of clustering as a function of population of the best genes in FC method.

Table 3 The maximal values of purity and the minimum population of genes for eight methods of gene selection in clustering approach. Selection method

FD

RF

TT

KS

KW

SW

FC

SVM-RE

Overall purity measure Population of genes

0.84 52

0.79 21

0.83 30

0.77 53

0.82 25

0.66 60

0.83 30

0.67 86

Table 4 The population of commonly selected genes in different methods after clusterization approach. Number of sets (methods) Number of common genes

8 0

7 2

6 7

5 7

4 5

3 9

2 22

1 175

clusters have been formed in this way, each grouping the data of mixed classes (autistic and reference). The purity of ith cluster pi = max pij deﬁned for i, j = 1,2 where pij = nij /ni (nij – number of jth class objects in ith cluster, ni - number of objects in ith cluster, n – total number of observations) allows to determine the overall purity p of a clustering as

p=

n1 n2 p1 + p2 n n

(1)

This deﬁnition gives the measure of extent to which the clusters contain objects of a single class. To avoid the trap of local minimum 10 repetitions of K-means algorithm have been made. This K-means procedure, followed by the purity estimation, was performed for all eight subsets of the best genes extracted by each selection method. The population of genes taking part in K-mean clustering was changing in experiments from 1 to 100. Fig. 3 presents the dependence of the overall purity index p on the number of genes forming the subset. The results depicted in the ﬁgure correspond to FC method. The highest purity measure 0.83 corresponds to the minimum population of 30 the best genes. Table 3 shows the numerical results of this approach for all applied methods of gene selection, presenting the maximum overall purity and the minimum population of genes. The highest purity index equal 0.84 was obtained at 52 the best genes and corresponded to the application of Fisher discriminant method (FD). Only two genes: HIST1H2BG and TRPV6 have been selected commonly by 7 methods of selection. Table 4 presents the size of the gene sets commonly selected by different methods among these optimized subsets.

Random forest (RF) is a very powerful Breiman method of classiﬁcation, which can be used also to assess the class discrimination ability of the particular input attribute [26]. RF constructs many decision trees in learning phase and outputs the class that is the majority of classes pointed by the individual trees. The learning data for each tree are selected randomly and used to learn the trees at application of some limited number of also randomly selected input attributes. The input attributes that provide the best split of tree, according to the objective function, are used to do a binary split on that node. The importance of the particular gene is measured by taking into account how inclusion of this input attribute (the gene) changes the accuracy of class recognition. It is deﬁned on the basis of the relative change of the rate of prediction error for validation data if the values of this gene are permuted among the testing samples. The out-of-bag prediction error computed on this perturbed data set and compared to the error before perturbation is estimated for every tree, then averaged over the entire ensemble and divided by the standard deviation over the ensemble. The genes are ordered according to their statistical impact on the change of the classiﬁcation accuracy after permuting their values. The positive value of this measure means positive inﬂuence of the gene to class recognition, while the negative value – the reduction of the class discrimination ability. The optimal size of genes has been obtained by trying different numbers of the successive best genes in the classiﬁcation procedure and choosing their population size, which provides the best class recognition results on the validation data. The random forest method was combined with the recursive elimination of genes in order to ﬁnd the optimal size of genes. The set of 100 the best genes selected by the particular method was delivered in the ﬁrst step to the input of SVM, responsible for recognition of autistic and reference classes. 10-fold cross validation approach was used in the learning process. The genes have been ordered according to their importance in class recognition. In the succeeding steps the population of the gene set was reduced by eliminating 10% of the least important genes. The process of gene ordering and elimination was repeated until the classiﬁcation error on the validation data reached the minimum value. The genes which survived this process represent the ﬁnal optimal population. These experiments have been performed using 50 trees. The number of variables used in the binary split of nodes was the square root of the actual size of the input vector. Fig. 4 presents the changing importance of genes (from the most to the least signiﬁcant) in the succeeding steps of recursive elimination. The vertical axis represents the relative measure of importance and horizontal axis – the succeeding genes ordered according to their decreasing importance. The recursive elimination process was continued until the classiﬁcation error on the validation data achieved the minimum. The optimal size of genes corresponds to this minimum. The same procedure has been repeated for all 8 selection methods. Table 5 presents the optimal number of genes and corresponding accuracy of class recognition. As it is seen each selection method leads to different size of optimal sets of genes and also different ﬁnal accuracy of class recognition. This time only one gene TRPV6 has been chosen commonly by 7 methods of selection, three genes by 6 methods, etc. Table 6 presents the number of genes commonly selected by different methods in these optimized subsets.

T. Latkowski, S. Osowski / Neurocomputing 250 (2017) 37–44

41

Fig. 4. The diagrams presenting the changing importance of genes selected by FC method in the process of recursive elimination by random forest. Table 5 The best accuracy of class recognition and the minimum population of genes for eight methods of gene selection in random forest approach. Selection method

FD

RF

TT

KS

KW

SW

FC

SVM-RE

Accuracy of class recognition Population of genes

84.25% 39

80.14% 73

82.88% 39

82.88% 22

82.88% 22

83.56% 22

81.51% 22

88.36% 81

Table 6 The population of commonly selected genes in different methods after application of random forest. Number of sets (methods) Number of common genes

8 0

7 1

6 3

5 3

4 7

3 14

2 29

1 152

3.5. Application of genetic algorithm The last fusing approach analyzed in the paper is based on application of the genetic algorithm (GA) [27,28]. The limited number of the highest rank genes extracted by each method (here 100 genes) creates the input attributes (chromosome vectors) to the SVM classiﬁer [29]. The genes in chromosome are coded in a binary way. The value one means acceptation of the particular gene as the input signal to the classiﬁer and zero – elimination of such gene in the chromosome. GA consists of selecting parents for reproduction, crossover and mutation to some bits representing the children. Initially generated population of chromosomes contains the elements equal zero or one, chosen in random way. The ﬁtness function is the inverse of the classiﬁcation error on the validation data set. In each generation the ﬁtness of every chromosome in the population is evaluated, the chromosome parents are stochastically selected from the current population (based on their ﬁtness function values and application of the fortune wheel), and then modiﬁed (recombined and possibly mutated) to form a new population. This new population is then used in the next iteration of the algorithm. The size of genetic population applied in experiments was 100 chromosomes, crossover rate 80% and mutation rate 1%.

Each binary chromosome is associated with the input vector x applied to the SVM classiﬁer of Gaussian kernel (the value 1 – inclusion of the gene and zero – exclusion). The classiﬁer is trained on the learning data set (60% of available data) and then tested on the validation data (the remaining 40%). The genetic operations performed on the succeeding generations lead ﬁnally to the minimum of the objective function. Ten repetitions of GA at randomly selected learning and testing data have been performed and the results averaged. In this way we get the optimal set of genes corresponding to the minimum of validation error. Table 7 presents the optimal populations of genes corresponding to the maximum accuracy of class recognition in the validation mode for all investigated selection methods. They represent the average of 10 independent runs, each associated with random choice of learning and validation subsets. The applied selection methods have resulted in different size of the optimal sets of genes. There were two genes: HIST1H2BG and CAPS2, which have been selected commonly by 7 methods. Table 8 presents the number of common genes, which appear in the sets created by the applied methods of selection.

4. Comparative analysis of selection results Three different methods applied in the second step of gene selection have resulted in different contents of the most important genes. Among 24 sets corresponding to eight methods of the ﬁrst stage and three approaches to the ﬁnal stage of selection there were only 13 sets containing 10 commonly selected genes. They include: HIST1H2BG, TRPV6, CAPS2, ZSCAN18, SNHG7, CFC1B, RHPN1, Clone FP18821 unknown mRNA, EVPLL and PSENEN.

42

T. Latkowski, S. Osowski / Neurocomputing 250 (2017) 37–44 Table 7 The best accuracy of class recognition and the minimum population of genes for eight methods of gene selection in genetic algorithm approach. Selection method

FD

RF

TT

KS

KW

SW

FC

SVM-RE

Accuracy of class recognition Population of genes

91.78% 63

89.04% 47

91.78% 63

84.25% 60

91.78% 52

95.89% 66

92.47% 56

96,7% 65

Table 8 The population of commonly selected genes in different methods after application of genetic algorithm. Number of sets (methods) Number of common genes

8 0

7 2

6 6

5 8

4 17

3 15

2 38

1 193

These genes can be treated as the extended set of the most representative biomarkers of autism. Their class discrimination ability is well illustrated by comparing the distance between the centers of samples belonging to the autistic and reference classes for different choice of genes and also the mean distances between the samples of data and their representative center in both classes. Three different cases have been compared: • the data represented by the mentioned above set of 10 genes, • 10 genes selected randomly from the set of 18,851 genes (the mean value of 50 random trials) and • 10 the least signiﬁcant genes among 18,851 selected in the single method related to Fisher. Table 9 shows the results of such comparison. As we can see the set of 10 the best genes has provided the least dispersion within classes and the largest distance between centers of both classes. Ten the least signiﬁcant genes have shown the smallest distance between the centers of both classes conﬁrming their uselessness in class discrimination. To illustrate the space distribution of observations belonging to both classes the multidimensional data samples have been mapped into 2-dimensional coordinate system using the principal component analysis (PCA) [25]. Fig. 5 shows the distribution of points belonging to autistic (the x symbol) and reference (circle) classes. They are presented on the plane formed by two the most important principal components (PCA 1 and PCA 2) and refer to the representation of data by 10 the best genes, 10 randomly selected genes and all 18,851 genes. The enlarged symbols (circle and x) represent centers of both classes. The graphical results conﬁrm the superiority of the genes selected by the presented system of selection. The classes are the least intermixed and the distances between the centers of both classes are the largest. These conclusions are also conﬁrmed by the numerical values presented in Table 10. 5. Classiﬁcation results Final experiments have been directed to compare the class recognition ability of the selected sets of genes. This time the available data set was split into two independent parts: 40% of samples have been used only in selection of the best genes and the remaining 60% only in class recognition. This process of random splitting was repeated 10 times and the results averaged. The classiﬁcation stage was performed using the genes selected on the basis of the ﬁrst subset. Thanks to such organization of experiments both phases (selection and classiﬁcation) were independent from each other. Hence the results are expected to be the most objective. The selected genes chosen by different methods have formed the input attributes to the SVM of Gaussian kernel and RF classiﬁers in recognition phase. The genes selected by different methods

Fig. 5. The distribution of points belonging to autistic and reference classes on the plane formed by PC1 and PC2 at representation of data by (a) 10 the best genes, (b) 10 randomly selected genes and (c) all 18,851 genes.

T. Latkowski, S. Osowski / Neurocomputing 250 (2017) 37–44

43

Table 9 The distance between centers of autistic and reference classes and mean distances ± standard deviations between data points and their centers for different choice of genes.

10 the best genes 10 random genes 10 the least signiﬁcant genes

Distance between class centers

Mean distance between data and their center for autistic class

0.446 0.210 0.0 0 03

0.606 ± 0.290 0.700 ± 0.296 0.652 ± 0.222

Mean distance between data and their center for reference class 0.477 ± 0.182 0.788 ± 0.338 0.724 ± 0.300

Table 10 The distance between centers of autistic and reference classes and mean distance ± standard deviation between data points and their centers for different choice of genes in their 2-dimensional PCA mapping.

10 the best genes 10 random genes All 18,851 genes

Distance between class centers

Mean distance between data and center for autistic class

0.323 0.018 0.052

0.064 ± 0.037 0.097 ± 0.049 1.214 ± 0.667

Table 11 The results of 10-fold cross-validation (mean value±std) in autism recognition at application of different methods in gene fusion (only testing data not taking part in learning). Accuracy [%] Clustering (8 methods) GA (8 methods) RF (8 methods) RF (the best 6 methods)

78.43 79.32 85.82 86.92

± ± ± ±

2.66 3.27 1.62 1.18

Sensitivity [%]

Speciﬁcity [%]

82.03 ± 2.57 84.81 ± 2.78 88.12 ± 1.75 89.03 ± 1.15

77.11 ± 2.75 78.32 ± 2.56 83.34 ± 1.84 84.08 ± 1.13

were treated independently and served as the input signals to the classiﬁers. We have applied the strategy, that the results of selection at application of random forest formed the input attributes to SVM classiﬁer and vice versa. Thanks to this the maximum independence of the ensemble members has been obtained. The ﬁnal fusion of results is done by RF, serving as an integrator. The decisions of individual classiﬁers are served as input attributes to the RF and this particular system of decision trees is responsible for creating the ﬁnal decision of autism recognition. The succeeding steps (classiﬁcation and integration) were performed in the 10-fold cross validation mode followed by integration of their results into the ﬁnal score. Then the average error over the testing parts across all 10 trials was computed. The advantage of this cross validation method is that it matters less how the observation data sets are split. Every observation result gets to be in a test set exactly once, and in a training set 9 times. In general case the ensemble was composed of 24 units (3 approaches to fusing the results of 8 selection methods). However, the experimental results have shown that application of all ensemble members is not optimal. Different conﬁgurations of the individually best gene sets have been tried. In the ﬁnal stage only 6 the best selection methods (FDA, RFA, KST, KWT, COR and SVM) have been applied. The statistical results of these experiments in the form of mean and standard deviation values are shown in Table 11. They present the total accuracy, sensitivity and speciﬁcity of class recognition for testing data. The best results with respect to accuracy, sensitivity and speciﬁcity correspond to the fusion of the most eﬃcient six selection methods combined with RF as a classiﬁer. The additional experiments have been done to compare our approach to the method presented in [18], where the fusion was performed only on the level of features. Recursive feature elimination with 20% of least important genes eliminated in each cycle was applied. SVM of Gaussian kernel was used as the ﬁnal classiﬁcation system. However, the results were worse. The

Mean distance between data and center for reference class 0.063 ± 0.038 0.084 ± 0.071 1.002 ± 0.618

average accuracy was only 82.73%, signiﬁcantly worse than these presented in Table 11. The next experiments have been also performed on the data base of autism [30] using the presented here approach. This base contained 87 autistic samples and 29 non-autistic controls. The obtained highest accuracy of class recognition ranged from 96.33% in clustering method, through 97.05% in RF up to 98.26% in GA. The sensitivity and speciﬁcity ranged from 95% to 98% and 90% to 96%, respectively. These results are well compared to the best result of 82% accuracy, 90% sensitivity and 75% speciﬁcity, reported in the most recent paper [3] for the same data base. The paper [14] has presented the approach of autism recognition based on two diagnostic standards: Autism Diagnostic Observation Schedule (ADOS) and Diagnostic and Statistical Manual of Mental Disorders (DSM-IV) [31]. The dataset created in Ramathibodi hospital in Bangkok contained the records of 140 patients and each patient record was categorized into either autism or Pervasive Developmental Disorder—Not Otherwise Speciﬁed (PDDNOS) class. The results of application of ABC-kNN to the autism recognition have shown the accuracy 85%. This is also worse than the results presented by us (of course on different data set).

6. Conclusions The paper has presented and compared the collective approach to the selection of the most important genes/transcripts, which are most informative for autism and can be used as biomarkers to distinguish two classes of data. It was shown that multistep collective approach by applying many different, properly integrated feature selection methods, is able to extract the small subset containing the most informative genes. The theoretical results were validated and supported by the experiments performed on the publically available NCBI data bases. They conﬁrmed that the genes appearing the most often in the multiple repeats of the algorithm runs are good candidates for biomarkers of autism. The numerical measures of quality of gene representation of data, deﬁned in the form of distance between two class centers and dispersion of clusters, have shown good performance of the proposed approach. The best selected genes were used as the input attributes to the classiﬁers arranged in the form of an ensemble, integrated using the random forest. The results of autism recognition conducted on two available data bases have shown good results, outperforming the other achievements reported in the actual publications. The proposed fusion of many individual classiﬁcation results leads to the signiﬁcant increase of the accuracy, sensitivity and speciﬁcity of class recognition.

44

T. Latkowski, S. Osowski / Neurocomputing 250 (2017) 37–44

Supplementary materials Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.neucom.2016.08.123. References [1] M. Alter, R. Kharkar, K. Ramsey, D. Craig, R. Melmed, T. Grebe, R. Curtis-Bay, S. Ober-Reynolds, J. Kirwan, J. Jones, J. Blake-Turner, R. Hen, D. Stephan, Autism, increased patternal age related changes in global levels of gene expression regulation, Plos One 6 (2011) 1–10. [2] M.S. Yang, M. Gill, A review of gene linkage, association, expression studies in autism, in assessment of convergent evidence, Int. J. Dev. Neurosci. 25 (2007) 69–85. [3] V Hu, L. Yinglei, Developing a predictive gene classiﬁer for autism spectrum disorders based upon differential gene expression proﬁles of phenotypic subgroups, North Am. J. Med.Sci. 6 (3) (2013) 107–116. [4] M. Eisen, P. Spellman, P. Brown, Cluster analysis and display of genome wide expression patterns, Proc. Natl. Acad. Sci. U.S.A. 95 (1998) 14863–14868. [5] I. Guyon, A.J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classiﬁcation using SVM, Mach. Learn. 46 (2002) 389–422. ´ , S. Osowski, Data mining methods for gene selection on the ba[6] M. Muszynski sis of gene expression arrays, Int. J. Appl. Math. Comput. Sci. 24 (3) (2014) 657–668. [7] C.J. Alonso-González, Q.I. Moro-Sancho, Microarray gene expression classiﬁcation with few genes: Criteria to combine attribute selection and classiﬁcation methods, Expert Syst. Appl. 39 (2012) 7270–7280. [8] P. Baldi, A.D. Long, A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inference of gene changes, Bioinformatics 17 (2001) 509–519. [9] X. Huang, W. Pan, Linear regression and two-class classiﬁcation with gene expression data, Bioinformatics 19 (2003) 2072–2078. [10] P.J. Woolf, Y. Wang, A fuzzy logic approach to analyzing gene expression data, Physiol. Genom. 3 (20 0 0) 9–15. [11] P.G. Kumar, T.A. Victoire, P. Renukadevi, D. Devaraj, Design of fuzzy expert system for microarray data classiﬁcation using a novel genetic swarm algorithm, Expert Syst. Appl. 38 (2) (2012) 1811–1821. [12] X Wang, O. Gotoh, A robust gene selection method for microarray-based cancer classiﬁcation, Cancer Inform 9 (2010) 15–30. [13] L Chuang, C. Yang, K. Wu, C. Yang, Gene selection, classiﬁcation using Taguchi chaotic binary particle swarm optimization, Expert Syst. Appl. 38 (2011) 13367–13377. [14] T. Prasartvit, A. Banharnsakun, B. Kaewkamnerdpong, T. Achalakul, Reducing bioinformatics data dimension with ABC-kNN, Neurocomputing 116 (2013) 367–381. [15] E.B. Huerta, B. Duval, J.K. Hao, A hybrid LDA and genetic algorithm for gene selection and classiﬁcation of microarray data, Neurocomputing 73 (2010) 2375–2383. [16] H. Mitsubayashi, S. Aso, T. Nagashima, Y. Okada, Accurate and robust gene selection for disease classiﬁcation using a simple statistics, Biomed. Inform. 391 (2008) 68–71. [17] T. Golub, Molecular classiﬁcation of cancer: class discovery and class prediction by gene expression monitoring, Science 286 (1999) 531–537.

[18] F. Yang, K.Z. Mao, Robust feature selection for microarray data based on multicriterion fusion, IEEE Trans. Comput. Biol. Bioinform. 8 (4) (2011) 1080–1092. [19] NCBI data base http://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4431 (2009). [20] T. Latkowski, S. Osowski, Proceedings of the International Work-Conference on Artiﬁcial Neural Networks Advances in Computational Intelligence, Lecture Notes in Computer Science, in: Developing gene classiﬁer system for autism recognition, 9095, Palma de Mallorca, 2015, pp. 3–14. [21] R. Robnik-Sikonja, I. Kononenko, Theoretical, empirical analysis of ReliefF and RReliefF, Mach. Learn. 53 (2003) 23–69. [22] P. Sprent, N.C. Smeeton, Applied Nonparametric Statistical Method, Chapman & Hall/CRC, Boca Raton, 2007. [23] Matlab user manual – Statistics toolbox. Natick, USA: MathWorks, 2014. [24] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003) 1158–1182. [25] P.N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Pearson Education Inc, Boston, 2006. [26] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32. [27] R. Siroic, S. Osowski, T. Markiewicz, K. Siwek, Application of support vector machine and genetic algorithm for improved blood cell recognition, IEEE Trans. Meas. Instrum 58 (2) (2009) 2159–2168. [28] R.M. Luque-Baena, D. Urda, M.G. Claros, L. Franco, J. Jerez, Robust gene signatures from microarray data using genetic algorithms enriched with biological pathway keywords, J. Biomed. Inform. 49 (2014) 32–44. [29] B. Schölkopf, A. Smola, Learning With Kernels, MIT Press, Cambridge MA, 2002. [30] http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15402 (2009). [31] The American Psychiatric Association, Autistic Disorder, Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition Text Revision (DSM-IV-TR), 20 0 0. Tomasz Latkowski was born in Poland, 1987. He received the M.Sc. and Ph.D. degrees from the Military University of Technology, Warsaw, Poland, in 2011 and 2016, respectively, all in electronic engineering. His research interest is in the area of artiﬁcial intelligence methods, data mining and their application in biomedical signal processing.

Stanislaw Osowski was born in Poland in 1948. He received the M.Sc., Ph.D., and Dr. Sc. degrees from the Warsaw University of Technology, Warsaw, Poland, in 1972, 1975, and 1981, respectively, all in electrical engineering. Currently he is a professor of electrical engineering at the Institute of the Theory of Electrical Engineering, Measurement and Information Systems, Warsaw University of Technology and is also employed in Electronic Faculty of Military University of Technology, Warsaw, Poland. His research and teaching interest are in the areas of artiﬁcial intelligence, neural networks, data mining, biomedical signal and image processing. He is a Senior member of IEEE.

Gene selection in autism – Comparative study

Gene selection in autism – Comparative study

Recommend Documents