Gene selection in autism – Comparative study

Gene selection in autism – Comparative study

Neurocomputing 250 (2017) 37–44 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Gene sele...

2MB Sizes 1 Downloads 86 Views

Neurocomputing 250 (2017) 37–44

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Gene selection in autism – Comparative study Tomasz Latkowski a, Stanislaw Osowski a,b,∗ a b

Faculty of Electronics, Military University of Technology, 01-476 Warsaw, Kaliskiego 2, Poland Faculty of Electrical Engineering, Warsaw University of Technology, 00-661 Warsaw, Koszykowa 75, Poland

a r t i c l e

i n f o

Article history: Received 24 February 2016 Revised 21 June 2016 Accepted 16 August 2016 Available online 7 February 2017 Keywords: Autism Gene expression microarrays Feature selection Clustering Genetic algorithm Random forest

a b s t r a c t The paper investigates application of several methods of feature selection to identification of the most important genes in autism disorder. The study is based on the expression microarray of genes. The applied methods analyze the importance of genes on the basis of different principles of selection. The most important step is to fuse the results of these selections into common set of genes, which are the best associated with autism. These genes may be treated as the biomarkers of this disorder and used in early prediction of autism. The paper proposes and compares three different methods of such fusion: purity of the clusterization space, application of genetic algorithm and random forest in the role of integrator. The numerical experiments are concerned with the identification of the most important biomarkers and their application in autism recognition. They show the applied fusion strategy of many independent selection methods leads to the significant improvement of the autism recognition rate. © 2017 Elsevier B.V. All rights reserved.

1. Introduction Autism disorder belongs to the pervasive neurodevelopmental disorders, affecting a broad spectrum of human functions [1,2]. The important problem is early recognition of this disorder, enabling the proper treatment of the autistic individuals. Nowadays, microarray gene expression data are studied to find the genes or sequences of genes which are the best associated with autism and might be treated as biomarkers. The difficulty in identifying these genes are many outliers, high variance of data and bad conditioning of the problem [1,3], manifested by the small number of available observations (usually measured in hundreds) in comparison to very huge number of genes (dozens of thousands). These complexities raise the challenge of how to identify the genes, that are the most informative for this disorder and that can be used to distinguish the class of autistic from the other individuals. Many methods developed in feature selection have been used in solving the task of gene selection in different problems. They include clustering methods [4], application of neural networks and Support Vector Machines [5–7], statistical tests [8], linear regression methods applying forward and backward selection [9], fuzzy expert system based algorithms [10,11], rough set theory [12], use of global optimization methods, including genetic algorithms, chaotic binary particle swarm optimization and ∗

Corresponding author. E-mail addresses: [email protected] (T. Latkowski), [email protected] (S. Osowski). URL: http://www.iem.pw.edu.pl/˜sto (S. Osowski) http://dx.doi.org/10.1016/j.neucom.2016.08.123 0925-2312/© 2017 Elsevier B.V. All rights reserved.

artificial bee colony (ABC) in connection with kNN classifier [13–15], application of ReliefF method combined with different classifiers [7], various statistical methods [16,17], as well as fusion of many selection methods [6,18]. Although most of these methods have been applied in cancer research, they might be also adopted to the autism. Many solutions have used the specialized methods, from which the best one was chosen as the most appropriate in the particular problem. However, it should be mention that each selection method uses specialized procedure of assessing the class discriminative features. The results depend on the applied mechanism of selection, which might work well in some data mining problems and be not efficient in the other. The additional difficulty in autism recognition is very high variance of gene expression of the individuals belonging to the same group. For example in NCBI data base of autism [19] containing 146 observations and 54,614 genes the variance of gene expression values of different individuals change from 0.099 to 24.19 × 106 with the mean 2.38 × 104 , median 38.84 and 13,245 genes of the expression variance higher than 10 0 0. Fig. 1 presents the mean value and variance of gene expressions in the analyzed data. It confirms very high variability of the gene expressions and existence of many outliers. This high variance of data means that the particular choice of sets of observations for selection procedure may lead to completely different results. This problem makes the application of single method inefficient for autistic data and needs elaboration of special procedure. It will be based on application of many selection methods in multiple runs. The important task in such approach is to fuse the selection results into the final solution.

38

T. Latkowski, S. Osowski / Neurocomputing 250 (2017) 37–44

Fig. 1. The change of mean value (upper figure) and variance (bottom figure) of gene expressions of autistic data.

The primal aim of the paper is to find the small population of the most informative genes strongly associated with autism. These genes might be useful as biomarkers of this neurodevelopmental disorder and at the same time serve as the input attributes to the automatic system in autism prediction. The application of many different feature selection methods cooperating in an ensemble will be proposed as the best tool to solve this task. The most important requirement is to use the methods, which are based on different principles of operation, guarantying the independent performance. Their number is not strictly defined. In this solution we have used eight methods, which in our opinion are satisfactory from the diversity of operation. Such approach to autistic data was never applied by other authors. There are some works showing an ensemble of methods for microarray data regarding cancer problems [6,18], however the strategy was different. The authors of [18] have proposed the multicriterion fusion-based approach, in which the integration is done on the level of features. In our approach the fusion is performed on two levels: the level of individual methods and the level of many classifiers, forming an ensemble creating the final decision. Moreover, we propose different strategies of selecting the size of the optimal gene set. The applied methods rely on different principles and therefore, assess the discrimination ability of the gene in an independent way. The important point is to fuse their results into one final group of genes that might be treated as the biomarkers of autism. In this paper we will present three different approaches to gene fusion: the purity of the clusterization space, the genetic algorithm and random forest. The limited set of genes may be also used as the input attributes in the classification system, responsible for early identification of autistic individuals. This system is composed of many classifiers arranged in an ensemble integrated by the random forest. The results of numerical experiments performed on the NCBI data base [18] will be presented and discussed.

2. Materials The basic numerical experiments of gene selection have been performed on the NCBI dataset related to autism. The database is publicly available and was downloaded from GEO (NCBI) repository[19]. The number of observations in this dataset equals 146 and number of genes 54,613. The database consists of two classes: the first one is related to children with autism (number of such

observations equal 82) and the second to the control group of healthy children (64). All subjects in the base are male. Probands and controls were all recruited from the Phoenix area. Blood draws for all subjects were done between the spring and summer of 2004. Total RNA was extracted for microarray experiments with Affymetrix Human U133 Plus 2.0 39 Expression Arrays. Children with autism were diagnosed by a medical professional (developmental pediatrician, psychologist or child psychiatrist) according to the DSM-IV criteria and the diagnosis was confirmed on the basis of ADOS and ADI-R criteria [1,17]. The non-classic forms of autism were excluded from the base. They include only autism with regression and Asperger’s syndrome, a higher functioning form of autism, where individuals have language skills within normal range. In addition, each subject had a normal high-resolution chromosome analysis, and a negative Fragile X DNA test. The main task of this work is to find a small subset of genes, which are the best associated with autism. This problem is resolved by using several gene selection methods combined into one final gene recognition system. 3. Methods The gene expression array of autism considered in the work contains more than 50,0 0 0 genes. It is natural, that most of them have no class discrimination ability. Therefore, the first filtration of genes should be done in the introductory stage to reduce this number in a significant way. We have applied a strategy in which the genes with similar mean values of expression for autistic and reference (control) classes within all observations should be eliminated first as not discriminative in class recognition. 3.1. Initial filtering The difficult problem in this stage is the existence of huge number of outliers in gene expressions observed in the data base. These outliers distort comparison of discrimination ability of genes in both classes. To avoid such problem we have applied the introductory filtration based on the median and not the mean value. Fig. 2a and b present the comparison of two genes of the highest distance between the means (Fig. 2a) and medians (Fig. 2b) for two classes of data. The outliers existing in the data have distorted significantly the calculation of mean value, especially for autistic individuals, for which the mean value was 140.57 at standard deviation 569.69. Hence the results of such gene selection are not reliable. In the case of median principle (Fig. 2b) the results are much better. We can see only the natural variability of expression values in both classes. Therefore, in further experiments we have applied the median filtration in the introductory step of reduction. The investigations presented in [20] have shown that the most genes depict very similar ratio of the medians in both classes. There were no genes of this ratio below the value of 0.7. The active range of reduction is from 0.8 to 1.0. To reduce the number of less significant genes to a reasonable level the threshold value of this ratio was chosen heuristically and set to level of 0.96. It resulted in the introductory elimination of 35,762 genes. Therefore in further analysis we used only 18,851 genes, for which the ratio of median in both classes was below 0.96. It means a relatively considerable reduction, however, eliminating only the genes of very similar expressions in both classes, not very useful in the recognition process. Moreover, we have observed the significant difference between the genes selected on the basis of median and mean value principle. In the group of ten, the best class discriminative genes, there

T. Latkowski, S. Osowski / Neurocomputing 250 (2017) 37–44

39

Fig. 2. The expression values of the best class discriminating gene at application of (a) mean value and (b) median value. Table 1 The repeatability of the best genes selected commonly after filtration based on median and mean principles. The population of the best genes

100

10 0 0

50 0 0

10,0 0 0

20,0 0 0

30,0 0 0

40,0 0 0

50,0 0 0

Number of commonly selected genes Repeatability of the commonly selected genes

30 30%

416 41.6%

2955 59.1%

6489 64.9%

14,257 71.3%

22,939 76.5%

32,912 82.3%

46,255 92.5%

was no common one. In the group of 100 the best genes there were only 30 common genes. Table 1 presents the number of commonly selected genes at application of the median and mean principle and different population of the best genes. 3.2. Individual gene selection methods The reduced set of genes is subject to further selection process. This step was organized in two stages. In the first one we apply 8 methods of selection, differing in the principle of operation. The following methods have been applied: Fisher discriminant analysis (FD), ReliefF algorithm (RF), two sample t-test (TT), Kolmogorov– Smirnov test (KS), Kruskal–Wallis test (KW), stepwise regression method (SW), feature correlation with a class (FC) and support vector machine recursive feature elimination (SVM-RE). A short description of them was provided in [20]. The operation principle of these methods relies on different foundations which allows analyzing the selection problem from different points of view. Fisher method compares the mean values and standard deviations in both classes for each feature separately [17]. The ReliefF algorithm ranks the features according to the highest correlation with the observed class while taking into account the distances between opposite classes [21]. The TT, KS and KW methods apply different statistical hypothesis tests, on the basis of which the feature class discrimination ability is assessed [22]. Stepwise regression adds and removes systematically the features to the set of input attributes based on their statistical significance in a regression [24]. The FC method assesses the feature on the basis of its correlation with a class while eliminating the features well correlated with each other [9,23,24]. The SVM-RE method uses the linear SVM as a classifier supplied by all features at the same time [5]. The ranking criterion of the particular feature is based on its weight magnitude. The SVM is re-trained many times after eliminating some the least significant features (usually from 10% to 20%). As a result the more and more compact gene subsets are created in the succeeding iterations. The application of these methods leads to 8 different sets of genes, ordered according to their rank, which reflect the class

Table 2 The population of commonly selected genes in different methods. Size of sets (methods)

8

7

6

5

4

3

2

1

Number of common genes

0

3

23

21

26

31

35

269

discrimination ability (from the most to the least discriminative). Because of various principles of gene assessment the results of such selections are different. Among 100 the most important genes in our data base the repeatability rate ranged from 5% to 63%, even for the same method at different contents of observation samples used in selection. This is due to large variety of expression levels of different individuals forming the non-uniform autistic group. The particular composition of samples taking part in selection has high impact on these results. Therefore, the single run of selection process performed on the chosen samples of observations is not a good solution. Limiting further considerations to only 100 genes, the most often appearing in 10 runs of each method (treated further as the best), the total selected population of different genes contained 408 items (still too large for treating them as the biomarkers). However, there was not even single gene which appeared commonly in all 8 sets. Table 2 presents the number of genes commonly selected by eight applied methods in the group of 100 the best. Seven selection methods have selected simultaneously only 3 identical genes: HIST1H2BG, TRPV6 and CAPS2. They might be treated as the most important biomarkers for autism. It should be noted, that none method was able to indicate the reliable optimal size of genes, enabling acceptable recognition of the autistic samples from the reference group. Therefore, the additional step is needed. We have applied and compared three different approaches to this problem: the purity of clusterization space, the genetic algorithm and random forest. 3.3. Purity of clusterization space In this approach the self-organizing clusterization of the data space using the K-means algorithm [25] was done first. Two

40

T. Latkowski, S. Osowski / Neurocomputing 250 (2017) 37–44

After this step the number of commonly selected genes has been significantly decreased. Only 2 genes were common in 7 selection methods, 7 genes common in 6 methods of selection, etc. 3.4. Random forest application

Fig. 3. The change of the overall purity of clustering as a function of population of the best genes in FC method.

Table 3 The maximal values of purity and the minimum population of genes for eight methods of gene selection in clustering approach. Selection method

FD

RF

TT

KS

KW

SW

FC

SVM-RE

Overall purity measure Population of genes

0.84 52

0.79 21

0.83 30

0.77 53

0.82 25

0.66 60

0.83 30

0.67 86

Table 4 The population of commonly selected genes in different methods after clusterization approach. Number of sets (methods) Number of common genes

8 0

7 2

6 7

5 7

4 5

3 9

2 22

1 175

clusters have been formed in this way, each grouping the data of mixed classes (autistic and reference). The purity of ith cluster pi = max pij defined for i, j = 1,2 where pij = nij /ni (nij – number of jth class objects in ith cluster, ni - number of objects in ith cluster, n – total number of observations) allows to determine the overall purity p of a clustering as

p=

n1 n2 p1 + p2 n n

(1)

This definition gives the measure of extent to which the clusters contain objects of a single class. To avoid the trap of local minimum 10 repetitions of K-means algorithm have been made. This K-means procedure, followed by the purity estimation, was performed for all eight subsets of the best genes extracted by each selection method. The population of genes taking part in K-mean clustering was changing in experiments from 1 to 100. Fig. 3 presents the dependence of the overall purity index p on the number of genes forming the subset. The results depicted in the figure correspond to FC method. The highest purity measure 0.83 corresponds to the minimum population of 30 the best genes. Table 3 shows the numerical results of this approach for all applied methods of gene selection, presenting the maximum overall purity and the minimum population of genes. The highest purity index equal 0.84 was obtained at 52 the best genes and corresponded to the application of Fisher discriminant method (FD). Only two genes: HIST1H2BG and TRPV6 have been selected commonly by 7 methods of selection. Table 4 presents the size of the gene sets commonly selected by different methods among these optimized subsets.

Random forest (RF) is a very powerful Breiman method of classification, which can be used also to assess the class discrimination ability of the particular input attribute [26]. RF constructs many decision trees in learning phase and outputs the class that is the majority of classes pointed by the individual trees. The learning data for each tree are selected randomly and used to learn the trees at application of some limited number of also randomly selected input attributes. The input attributes that provide the best split of tree, according to the objective function, are used to do a binary split on that node. The importance of the particular gene is measured by taking into account how inclusion of this input attribute (the gene) changes the accuracy of class recognition. It is defined on the basis of the relative change of the rate of prediction error for validation data if the values of this gene are permuted among the testing samples. The out-of-bag prediction error computed on this perturbed data set and compared to the error before perturbation is estimated for every tree, then averaged over the entire ensemble and divided by the standard deviation over the ensemble. The genes are ordered according to their statistical impact on the change of the classification accuracy after permuting their values. The positive value of this measure means positive influence of the gene to class recognition, while the negative value – the reduction of the class discrimination ability. The optimal size of genes has been obtained by trying different numbers of the successive best genes in the classification procedure and choosing their population size, which provides the best class recognition results on the validation data. The random forest method was combined with the recursive elimination of genes in order to find the optimal size of genes. The set of 100 the best genes selected by the particular method was delivered in the first step to the input of SVM, responsible for recognition of autistic and reference classes. 10-fold cross validation approach was used in the learning process. The genes have been ordered according to their importance in class recognition. In the succeeding steps the population of the gene set was reduced by eliminating 10% of the least important genes. The process of gene ordering and elimination was repeated until the classification error on the validation data reached the minimum value. The genes which survived this process represent the final optimal population. These experiments have been performed using 50 trees. The number of variables used in the binary split of nodes was the square root of the actual size of the input vector. Fig. 4 presents the changing importance of genes (from the most to the least significant) in the succeeding steps of recursive elimination. The vertical axis represents the relative measure of importance and horizontal axis – the succeeding genes ordered according to their decreasing importance. The recursive elimination process was continued until the classification error on the validation data achieved the minimum. The optimal size of genes corresponds to this minimum. The same procedure has been repeated for all 8 selection methods. Table 5 presents the optimal number of genes and corresponding accuracy of class recognition. As it is seen each selection method leads to different size of optimal sets of genes and also different final accuracy of class recognition. This time only one gene TRPV6 has been chosen commonly by 7 methods of selection, three genes by 6 methods, etc. Table 6 presents the number of genes commonly selected by different methods in these optimized subsets.

T. Latkowski, S. Osowski / Neurocomputing 250 (2017) 37–44

41

Fig. 4. The diagrams presenting the changing importance of genes selected by FC method in the process of recursive elimination by random forest. Table 5 The best accuracy of class recognition and the minimum population of genes for eight methods of gene selection in random forest approach. Selection method

FD

RF

TT

KS

KW

SW

FC

SVM-RE

Accuracy of class recognition Population of genes

84.25% 39

80.14% 73

82.88% 39

82.88% 22

82.88% 22

83.56% 22

81.51% 22

88.36% 81

Table 6 The population of commonly selected genes in different methods after application of random forest. Number of sets (methods) Number of common genes

8 0

7 1

6 3

5 3

4 7

3 14

2 29

1 152

3.5. Application of genetic algorithm The last fusing approach analyzed in the paper is based on application of the genetic algorithm (GA) [27,28]. The limited number of the highest rank genes extracted by each method (here 100 genes) creates the input attributes (chromosome vectors) to the SVM classifier [29]. The genes in chromosome are coded in a binary way. The value one means acceptation of the particular gene as the input signal to the classifier and zero – elimination of such gene in the chromosome. GA consists of selecting parents for reproduction, crossover and mutation to some bits representing the children. Initially generated population of chromosomes contains the elements equal zero or one, chosen in random way. The fitness function is the inverse of the classification error on the validation data set. In each generation the fitness of every chromosome in the population is evaluated, the chromosome parents are stochastically selected from the current population (based on their fitness function values and application of the fortune wheel), and then modified (recombined and possibly mutated) to form a new population. This new population is then used in the next iteration of the algorithm. The size of genetic population applied in experiments was 100 chromosomes, crossover rate 80% and mutation rate 1%.

Each binary chromosome is associated with the input vector x applied to the SVM classifier of Gaussian kernel (the value 1 – inclusion of the gene and zero – exclusion). The classifier is trained on the learning data set (60% of available data) and then tested on the validation data (the remaining 40%). The genetic operations performed on the succeeding generations lead finally to the minimum of the objective function. Ten repetitions of GA at randomly selected learning and testing data have been performed and the results averaged. In this way we get the optimal set of genes corresponding to the minimum of validation error. Table 7 presents the optimal populations of genes corresponding to the maximum accuracy of class recognition in the validation mode for all investigated selection methods. They represent the average of 10 independent runs, each associated with random choice of learning and validation subsets. The applied selection methods have resulted in different size of the optimal sets of genes. There were two genes: HIST1H2BG and CAPS2, which have been selected commonly by 7 methods. Table 8 presents the number of common genes, which appear in the sets created by the applied methods of selection.

4. Comparative analysis of selection results Three different methods applied in the second step of gene selection have resulted in different contents of the most important genes. Among 24 sets corresponding to eight methods of the first stage and three approaches to the final stage of selection there were only 13 sets containing 10 commonly selected genes. They include: HIST1H2BG, TRPV6, CAPS2, ZSCAN18, SNHG7, CFC1B, RHPN1, Clone FP18821 unknown mRNA, EVPLL and PSENEN.

42

T. Latkowski, S. Osowski / Neurocomputing 250 (2017) 37–44 Table 7 The best accuracy of class recognition and the minimum population of genes for eight methods of gene selection in genetic algorithm approach. Selection method

FD

RF

TT

KS

KW

SW

FC

SVM-RE

Accuracy of class recognition Population of genes

91.78% 63

89.04% 47

91.78% 63

84.25% 60

91.78% 52

95.89% 66

92.47% 56

96,7% 65

Table 8 The population of commonly selected genes in different methods after application of genetic algorithm. Number of sets (methods) Number of common genes

8 0

7 2

6 6

5 8

4 17

3 15

2 38

1 193

These genes can be treated as the extended set of the most representative biomarkers of autism. Their class discrimination ability is well illustrated by comparing the distance between the centers of samples belonging to the autistic and reference classes for different choice of genes and also the mean distances between the samples of data and their representative center in both classes. Three different cases have been compared: • the data represented by the mentioned above set of 10 genes, • 10 genes selected randomly from the set of 18,851 genes (the mean value of 50 random trials) and • 10 the least significant genes among 18,851 selected in the single method related to Fisher. Table 9 shows the results of such comparison. As we can see the set of 10 the best genes has provided the least dispersion within classes and the largest distance between centers of both classes. Ten the least significant genes have shown the smallest distance between the centers of both classes confirming their uselessness in class discrimination. To illustrate the space distribution of observations belonging to both classes the multidimensional data samples have been mapped into 2-dimensional coordinate system using the principal component analysis (PCA) [25]. Fig. 5 shows the distribution of points belonging to autistic (the x symbol) and reference (circle) classes. They are presented on the plane formed by two the most important principal components (PCA 1 and PCA 2) and refer to the representation of data by 10 the best genes, 10 randomly selected genes and all 18,851 genes. The enlarged symbols (circle and x) represent centers of both classes. The graphical results confirm the superiority of the genes selected by the presented system of selection. The classes are the least intermixed and the distances between the centers of both classes are the largest. These conclusions are also confirmed by the numerical values presented in Table 10. 5. Classification results Final experiments have been directed to compare the class recognition ability of the selected sets of genes. This time the available data set was split into two independent parts: 40% of samples have been used only in selection of the best genes and the remaining 60% only in class recognition. This process of random splitting was repeated 10 times and the results averaged. The classification stage was performed using the genes selected on the basis of the first subset. Thanks to such organization of experiments both phases (selection and classification) were independent from each other. Hence the results are expected to be the most objective. The selected genes chosen by different methods have formed the input attributes to the SVM of Gaussian kernel and RF classifiers in recognition phase. The genes selected by different methods

Fig. 5. The distribution of points belonging to autistic and reference classes on the plane formed by PC1 and PC2 at representation of data by (a) 10 the best genes, (b) 10 randomly selected genes and (c) all 18,851 genes.

T. Latkowski, S. Osowski / Neurocomputing 250 (2017) 37–44

43

Table 9 The distance between centers of autistic and reference classes and mean distances ± standard deviations between data points and their centers for different choice of genes.

10 the best genes 10 random genes 10 the least significant genes

Distance between class centers

Mean distance between data and their center for autistic class

0.446 0.210 0.0 0 03

0.606 ± 0.290 0.700 ± 0.296 0.652 ± 0.222

Mean distance between data and their center for reference class 0.477 ± 0.182 0.788 ± 0.338 0.724 ± 0.300

Table 10 The distance between centers of autistic and reference classes and mean distance ± standard deviation between data points and their centers for different choice of genes in their 2-dimensional PCA mapping.

10 the best genes 10 random genes All 18,851 genes

Distance between class centers

Mean distance between data and center for autistic class

0.323 0.018 0.052

0.064 ± 0.037 0.097 ± 0.049 1.214 ± 0.667

Table 11 The results of 10-fold cross-validation (mean value±std) in autism recognition at application of different methods in gene fusion (only testing data not taking part in learning). Accuracy [%] Clustering (8 methods) GA (8 methods) RF (8 methods) RF (the best 6 methods)

78.43 79.32 85.82 86.92

± ± ± ±

2.66 3.27 1.62 1.18

Sensitivity [%]

Specificity [%]

82.03 ± 2.57 84.81 ± 2.78 88.12 ± 1.75 89.03 ± 1.15

77.11 ± 2.75 78.32 ± 2.56 83.34 ± 1.84 84.08 ± 1.13

were treated independently and served as the input signals to the classifiers. We have applied the strategy, that the results of selection at application of random forest formed the input attributes to SVM classifier and vice versa. Thanks to this the maximum independence of the ensemble members has been obtained. The final fusion of results is done by RF, serving as an integrator. The decisions of individual classifiers are served as input attributes to the RF and this particular system of decision trees is responsible for creating the final decision of autism recognition. The succeeding steps (classification and integration) were performed in the 10-fold cross validation mode followed by integration of their results into the final score. Then the average error over the testing parts across all 10 trials was computed. The advantage of this cross validation method is that it matters less how the observation data sets are split. Every observation result gets to be in a test set exactly once, and in a training set 9 times. In general case the ensemble was composed of 24 units (3 approaches to fusing the results of 8 selection methods). However, the experimental results have shown that application of all ensemble members is not optimal. Different configurations of the individually best gene sets have been tried. In the final stage only 6 the best selection methods (FDA, RFA, KST, KWT, COR and SVM) have been applied. The statistical results of these experiments in the form of mean and standard deviation values are shown in Table 11. They present the total accuracy, sensitivity and specificity of class recognition for testing data. The best results with respect to accuracy, sensitivity and specificity correspond to the fusion of the most efficient six selection methods combined with RF as a classifier. The additional experiments have been done to compare our approach to the method presented in [18], where the fusion was performed only on the level of features. Recursive feature elimination with 20% of least important genes eliminated in each cycle was applied. SVM of Gaussian kernel was used as the final classification system. However, the results were worse. The

Mean distance between data and center for reference class 0.063 ± 0.038 0.084 ± 0.071 1.002 ± 0.618

average accuracy was only 82.73%, significantly worse than these presented in Table 11. The next experiments have been also performed on the data base of autism [30] using the presented here approach. This base contained 87 autistic samples and 29 non-autistic controls. The obtained highest accuracy of class recognition ranged from 96.33% in clustering method, through 97.05% in RF up to 98.26% in GA. The sensitivity and specificity ranged from 95% to 98% and 90% to 96%, respectively. These results are well compared to the best result of 82% accuracy, 90% sensitivity and 75% specificity, reported in the most recent paper [3] for the same data base. The paper [14] has presented the approach of autism recognition based on two diagnostic standards: Autism Diagnostic Observation Schedule (ADOS) and Diagnostic and Statistical Manual of Mental Disorders (DSM-IV) [31]. The dataset created in Ramathibodi hospital in Bangkok contained the records of 140 patients and each patient record was categorized into either autism or Pervasive Developmental Disorder—Not Otherwise Specified (PDDNOS) class. The results of application of ABC-kNN to the autism recognition have shown the accuracy 85%. This is also worse than the results presented by us (of course on different data set).

6. Conclusions The paper has presented and compared the collective approach to the selection of the most important genes/transcripts, which are most informative for autism and can be used as biomarkers to distinguish two classes of data. It was shown that multistep collective approach by applying many different, properly integrated feature selection methods, is able to extract the small subset containing the most informative genes. The theoretical results were validated and supported by the experiments performed on the publically available NCBI data bases. They confirmed that the genes appearing the most often in the multiple repeats of the algorithm runs are good candidates for biomarkers of autism. The numerical measures of quality of gene representation of data, defined in the form of distance between two class centers and dispersion of clusters, have shown good performance of the proposed approach. The best selected genes were used as the input attributes to the classifiers arranged in the form of an ensemble, integrated using the random forest. The results of autism recognition conducted on two available data bases have shown good results, outperforming the other achievements reported in the actual publications. The proposed fusion of many individual classification results leads to the significant increase of the accuracy, sensitivity and specificity of class recognition.

44

T. Latkowski, S. Osowski / Neurocomputing 250 (2017) 37–44

Supplementary materials Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.neucom.2016.08.123. References [1] M. Alter, R. Kharkar, K. Ramsey, D. Craig, R. Melmed, T. Grebe, R. Curtis-Bay, S. Ober-Reynolds, J. Kirwan, J. Jones, J. Blake-Turner, R. Hen, D. Stephan, Autism, increased patternal age related changes in global levels of gene expression regulation, Plos One 6 (2011) 1–10. [2] M.S. Yang, M. Gill, A review of gene linkage, association, expression studies in autism, in assessment of convergent evidence, Int. J. Dev. Neurosci. 25 (2007) 69–85. [3] V Hu, L. Yinglei, Developing a predictive gene classifier for autism spectrum disorders based upon differential gene expression profiles of phenotypic subgroups, North Am. J. Med.Sci. 6 (3) (2013) 107–116. [4] M. Eisen, P. Spellman, P. Brown, Cluster analysis and display of genome wide expression patterns, Proc. Natl. Acad. Sci. U.S.A. 95 (1998) 14863–14868. [5] I. Guyon, A.J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classification using SVM, Mach. Learn. 46 (2002) 389–422. ´ , S. Osowski, Data mining methods for gene selection on the ba[6] M. Muszynski sis of gene expression arrays, Int. J. Appl. Math. Comput. Sci. 24 (3) (2014) 657–668. [7] C.J. Alonso-González, Q.I. Moro-Sancho, Microarray gene expression classification with few genes: Criteria to combine attribute selection and classification methods, Expert Syst. Appl. 39 (2012) 7270–7280. [8] P. Baldi, A.D. Long, A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inference of gene changes, Bioinformatics 17 (2001) 509–519. [9] X. Huang, W. Pan, Linear regression and two-class classification with gene expression data, Bioinformatics 19 (2003) 2072–2078. [10] P.J. Woolf, Y. Wang, A fuzzy logic approach to analyzing gene expression data, Physiol. Genom. 3 (20 0 0) 9–15. [11] P.G. Kumar, T.A. Victoire, P. Renukadevi, D. Devaraj, Design of fuzzy expert system for microarray data classification using a novel genetic swarm algorithm, Expert Syst. Appl. 38 (2) (2012) 1811–1821. [12] X Wang, O. Gotoh, A robust gene selection method for microarray-based cancer classification, Cancer Inform 9 (2010) 15–30. [13] L Chuang, C. Yang, K. Wu, C. Yang, Gene selection, classification using Taguchi chaotic binary particle swarm optimization, Expert Syst. Appl. 38 (2011) 13367–13377. [14] T. Prasartvit, A. Banharnsakun, B. Kaewkamnerdpong, T. Achalakul, Reducing bioinformatics data dimension with ABC-kNN, Neurocomputing 116 (2013) 367–381. [15] E.B. Huerta, B. Duval, J.K. Hao, A hybrid LDA and genetic algorithm for gene selection and classification of microarray data, Neurocomputing 73 (2010) 2375–2383. [16] H. Mitsubayashi, S. Aso, T. Nagashima, Y. Okada, Accurate and robust gene selection for disease classification using a simple statistics, Biomed. Inform. 391 (2008) 68–71. [17] T. Golub, Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286 (1999) 531–537.

[18] F. Yang, K.Z. Mao, Robust feature selection for microarray data based on multicriterion fusion, IEEE Trans. Comput. Biol. Bioinform. 8 (4) (2011) 1080–1092. [19] NCBI data base http://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4431 (2009). [20] T. Latkowski, S. Osowski, Proceedings of the International Work-Conference on Artificial Neural Networks Advances in Computational Intelligence, Lecture Notes in Computer Science, in: Developing gene classifier system for autism recognition, 9095, Palma de Mallorca, 2015, pp. 3–14. [21] R. Robnik-Sikonja, I. Kononenko, Theoretical, empirical analysis of ReliefF and RReliefF, Mach. Learn. 53 (2003) 23–69. [22] P. Sprent, N.C. Smeeton, Applied Nonparametric Statistical Method, Chapman & Hall/CRC, Boca Raton, 2007. [23] Matlab user manual – Statistics toolbox. Natick, USA: MathWorks, 2014. [24] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003) 1158–1182. [25] P.N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Pearson Education Inc, Boston, 2006. [26] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32. [27] R. Siroic, S. Osowski, T. Markiewicz, K. Siwek, Application of support vector machine and genetic algorithm for improved blood cell recognition, IEEE Trans. Meas. Instrum 58 (2) (2009) 2159–2168. [28] R.M. Luque-Baena, D. Urda, M.G. Claros, L. Franco, J. Jerez, Robust gene signatures from microarray data using genetic algorithms enriched with biological pathway keywords, J. Biomed. Inform. 49 (2014) 32–44. [29] B. Schölkopf, A. Smola, Learning With Kernels, MIT Press, Cambridge MA, 2002. [30] http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15402 (2009). [31] The American Psychiatric Association, Autistic Disorder, Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition Text Revision (DSM-IV-TR), 20 0 0. Tomasz Latkowski was born in Poland, 1987. He received the M.Sc. and Ph.D. degrees from the Military University of Technology, Warsaw, Poland, in 2011 and 2016, respectively, all in electronic engineering. His research interest is in the area of artificial intelligence methods, data mining and their application in biomedical signal processing.

Stanislaw Osowski was born in Poland in 1948. He received the M.Sc., Ph.D., and Dr. Sc. degrees from the Warsaw University of Technology, Warsaw, Poland, in 1972, 1975, and 1981, respectively, all in electrical engineering. Currently he is a professor of electrical engineering at the Institute of the Theory of Electrical Engineering, Measurement and Information Systems, Warsaw University of Technology and is also employed in Electronic Faculty of Military University of Technology, Warsaw, Poland. His research and teaching interest are in the areas of artificial intelligence, neural networks, data mining, biomedical signal and image processing. He is a Senior member of IEEE.