Available online at www.sciencedirect.com
Procedia Computer Science 6 (2011) 189–194
Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 2011- Chicago, IL
Feature Selection for Multiclass Problems Based on Information Weights George Georgiev1, Iren Valova2, Natacha Gueorguieva3 Computer Science, University of Wisconsin Oshkosh; 2Computer and Information Science, University of Massachusetts Dartmouth; 3Computer Science, City University of New York/ College of Staten Island
1
Abstract Before a pattern classifier can be properly designed, it is necessary to consider the feature extraction and data reduction problems. It is evident that the number of features needed to successfully perform a given recognition task depends on the discriminatory qualities of the chosen feature. We propose a new hybrid approach addressing feature selection based on information weights which allows feature categorization on the basis of specified classification task. The purpose is to efficiently achieve high degree of dimensionality reduction and enhance or maintain predictive accuracy with selected features. The novelty is to combine the competitiveness of the filter approach which makes it undependable from the nature of the pattern classifier and embed the algorithm within the pattern classifier structure in order to increase the accuracy of the learning phase as wrapper algorithms do. The algorithm is generalized for multiclass implementation. © 2011 2010 Published by Elsevier B.V. .H\ZRUGV: pattern recognition; feature informative weight; support set; proximity function; decision rule
Introduction In principle, as well as in practice, pattern recognition is concerned not only with training and classification, but more generally, also with estimation of attributes. The feature selection plays a central role in pattern recognition as it has a direct influence on the accuracy and processing time of pattern recognition applications. In fact, the selection of an appropriate set of features which take into account the difficulties present in the extraction or selection process, and at the same time result in acceptable performance, is one of the most difficult tasks in the design of pattern recognition systems [1, 2]. The most important objective of feature selection is avoiding over-fitting and improving the model performance. In many applications, specifically in bioinformatics, data mining, signal processing etc., the feature space dimension tends to be very large, making difficult both learning and classification tasks. The so called "curse of dimensionality" [3] is a phenomenon in which the number of training samples necessary to assure a satisfactory classification performance is given by an exponential function of the feature space dimension. Theoretically more features should provide more separation power, but taking into account the limited size of training data, unnecessary features would considerably decelerate the learning process, because of it overfitting with irrelevant data. Feature selection has been very active area of research and development of pattern
1877–0509 © 2011 Published by Elsevier Ltd. doi:10.1016/j.procs.2011.08.036
190
George Georgiev et al. / Procedia Computer Science 6 (2011) 189–194
recognition in biology [4], machine learning [5], data mining [6], and statistics [7]. It was noticed that: a) a large number of features are not informative because some of them are either irrelevant or redundant with respect to the class concept; b) learning can be achieved more efficiently and effectively with just relevant and non-redundant features. However, finding an optimal subset is usually enormously difficult [8] and many problems related to feature selection have been shown to be NP-hard [9]. Based on a review of previous definitions of feature relevance authors in [10] classified features into the following three disjoint categories: strongly relevant, weakly relevant, and irrelevant where: a) strong relevance of a feature indicates that the feature is always necessary for an optimal subset and therefore it cannot be removed without affecting the original conditional class distribution; b) weak relevance suggests that the feature is not always necessary but may become necessary for an optimal subset at certain conditions; c) irrelevance shows that the feature is not necessary at all. Analyzing later research in pattern recognition authors in [11] modified the above categories of features into irrelevant features, redundant features, weakly relevant but non-redundant features and strongly relevant features. According to their studies an optimal feature subset should include all strongly relevant features and a subset of the weakly relevant but not redundant one. In [12] authors propose methods for selecting relevant but non-redundant features. Their experiments showed that implementing different sets of relevant but nonredundant features improves classification accuracy. In the context of classification, feature selection techniques can be organized into ILOWHUV and ZUDSSHUV [11] depending on how they combine the feature selection search with the construction of the classification. Filter methods select subset of features as a preprocessing step, independently of the learning algorithm. Algorithms using filter techniques are computationally simple, fast and adaptable to the very high-dimensional datasets. Because they are independent of the classification algorithm feature selection needs to be performed only once, and then implemented on different classifiers. Wrappers use the classifier performance in order to evaluate the quality of feature subsets by implementing different criteria based on distance measures [12], dependency measures [11], information measures [13], classification error measures [14] etc. Wrapper methods require more computational time but in most cases they are more accurate. In [15] the author successively builds tree-structured classifiers taking into consideration some statistical properties as correlations and empirical probabilities to achieve good discriminant attributes. Using mutual information approach and recursive selection of relevant features led to acceptable performance with fewer features in some applications [16]. A different recursive selection method for optimization of generalization ability with a gradient descent algorithm on the margin of Support Vector classifiers is presented in [17]. In this paper we present a hybrid algorithm based on filter selection techniques which can be embedded into the pattern classifier to further improve its accuracy and optimize its generalization ability. Our approach introduces a mathematical model to formalize the search for optimal features and includes a computational scheme for calculating the features information weights in order to make a decision about their importance for the classification task. Primary Notations Before a pattern classifier can be properly designed, it is necessary to consider the feature extraction and data reduction problems. It is evident that the number of features needed to successfully perform a given recognition task depends on the discriminatory qualities of the chosen feature [1, 2]. Let us suppose that a) R M D M D M D QM )' be designated as a permissible pattern (object) containing elements D LM , L
Q M
P , where D LM L ( L is a subset of feature alphabet ).
b) The set of training patterns is presented as a sequence of rows denoted by 7PQ called a permissible table. c) ^R R RP RP RP RP RP
` ^
`
^
classes N N NO respectively, where NL N M
O 1
RP
O
R R RP , where 7PQ is
RP ` are the sets of patterns associated with the
if L z M . The latter means that the training samples taken from
different classes form disjoint sets. We transform the table 7PQ in accordance with the division of the training set, following the assumption that the number of patterns included into the class NX is PX PX :
191
George Georgiev et al. / Procedia Computer Science 6 (2011) 189–194
^R R RP `
N : N :
^R
P RP RP
k l : RP RP
^
O 1
O
`
RP `
We denote the table 7PQ the rows of which are divided into O classes as 7PQO . ~
~
d) D - part of 7PQ : we consider a combination of all boolean vectors D with a length equals to Q and disjoin all ~
coordinates of D having value 1. If the numbers of these coordinates are L L LN , we transform the table 7PQ into ~
~
~
~
a new table containing the rows D R D R D RP . We call this table D - part of 7PQ . Designed Concepts in Weighting the Feature Space
1.
The proposed in this paper algorithm for solving 9 - models is based on the following six concepts: Determination of system support sets. We denote all possible subsets of the set ^ Q` by : . The first concept is based on the determination of the system of support subsets : A : . It might include the combination of all elements of : with equal cardinality or the combination of all tests over the table 7PQO or the set : . ~
2.
~
~
Determination of the function of proximity U (D R D RT ) over the D part of R and RT . Depending on the feature alphabet this function can take one of the following forms: ~
~
a) U (D R D RT )
~
~
b) U (D R D RT )
~
~
c) U (D R D RT )
~
where D R
~
(D D D N ) , D RT
~ ~ °LI D R D RT ® ~ ~ °LI D R z D R T ¯
~ ~ ° , LI D R D RT ® °¯, RWKHUZLVH
(1)
1
(2)
~ ~ °LI H ( R RT ) d P ° (3) ® ~ ~ ° °¯, LI H ( R RT ) ! P
~
( E E E N ) and P P P N P are positive. H ( R RT ) is the
number of non-satisfied inequalities of the form D E d P D E d P D N E N d P N . ~
3.
D 9 ( R Rq ) . It may include different
Evaluation of estimation over the rows of fixed support set
parameters as significance of the pattern, its reliability etc. The evaluation follows the following form ~
D9 ( R Rq )
~
~
I [J ( RT ), J ( RT ),..., J N ( RT ), U (D R, D RT )] (4)
where J ( RT ), J ( RT ),..., J N ( RT ) are given parameters.
192
George Georgiev et al. / Procedia Computer Science 6 (2011) 189–194 ~
4.
Evaluation of estimation for a given class over the fixed support set denoted as 9L (D ) , L O . Let us assume that we have class NX which includes the rows RP RP RP . If the estimations over the
^
u 1
X
) D9 ( R, RP )
X
X
`
rows of this class ~
~
D 9 ( R, RP
u 1
~
) D9 ( R, RP
X
are evaluated then ~
~
~
9X (D ) \ [D9 ( R, RP
u 1
~
) D9 ( R, RP
X
) D9 ( R, RP )] (5) X
The latter expression can be modified in the following ways: PX
~
~
D9 ( R, RT ) (6)
¦
9X (D )
T PX
P ~ LI ¦ D9 ( R, RT ) t T (7) ° T P ® ° RWKHUZLVH ¯ X
~
9X (D )
5.
X
where T is a given parameter. Evaluation of estimation for a given class NX , X
O over the system of support set. Let us consider ~
the systems of support subsets : $ . After the calculation of 9 (D ) for every element 0 G : $ , we determine 9X (R) by using one of the following formulas: 9X ( R)
~
¦ 9X (D )
(8)
0 G : $
9X ( R)
~
~
¦ G (D )9X (D )
(9)
0 G : $
~
where G (D is a parameter representing the importance of subset 0 G . ~
We define the estimations 9X (D ) and 9X (R) , X O as the number of votes of the class NX over the fixed support set and over the system of support sets. 6.
Determination of decision rule. Let us assume that 9 ( R), 9 ( R),..., 9 ( R) have been calculated. The decision rule is a function of them which can be expressed in one of the following forms ) [9 ( R),9 ( R),...,9 ( R)]
X , LI 9X 9 M t G O L z M d M d O (10) ® ¯, RWKHUZLVH
10 9X 9 M t G ° ° O °X , LI ® 0 ) [9 ( R), 9 ( R),..., 9 ( R)] ® °2 9X ¦ 9 M t G (11) M ° ¯ °, RWKHUZLVH ¯
where G and G are constants. Theorem: Let us make the following assumptions: a) the system of support sets is determined as a combination of all nonempty subsets { Q} ; ~
~
b) the function of proximity U (D R D RT ) has the form (2);
George Georgiev et al. / Procedia Computer Science 6 (2011) 189–194
193
~
c)
the estimations over the rows of fixed support sets D 9 ( R Rq ) has the form ~
~
~
D9 ( R Rq ) U (D R D RT ) ; ~
d) the estimations for a given class over the fixed support set 9L (D ) , L O is expressed by (6); e) the alphabet sets 0 G are of binary type. On the above assumptions, the number of votes over the row R for the class NX , X O is equal to mX
9X ( R)
~
¦
(
U ( R RT )
) (12)
T PX ~
where U ( R RT ) is the number of equal rows of R and RT . In this case ~
U ( R RT )
~
P- K ( R RT )
~
where K ( R RT ) is the Hamming distance. The proof of this theorem is based on Newton binomial and its attributes. Algorithm for Estimation of Feature Informative Weights Step 1. Evaluation of votes 9X (R) . We consider the set of training patterns R R RP presented in table 7PQO . The above voting procedure is applied over the rows of 7PQO and the results of calculations are presented in the following form: 9 ( R ) 9 ( R ) 9 ( R ) ... 9 ( RP )
9 ( R )
9 ( R )
9 ( R )
9 ( RP )
9O ( R )
9O ( R )
9O ( R ) ... 9O ( RP )
...
9 ( RP
9 ( RP
)
9 ( RP
)
9 ( RP
)
9 ( RP )
...
9 ( RP )
)
9 ( RP )
...
9 ( RP )
9O ( RP ) 9O ( RP )
9O ( RP ) ... 9O ( RP )
9 ( RP
)
9 ( RP
)
9 ( RP
9O ( RP ) 9O ( RP )
9O ( RP
l 1
9 ( RP
l 1
)
9 ( RP
)
9 ( RP
l 1
l 1 l 1
l 1
l 1
)
...
9 ( RP )
)
...
9 ( RP )
l 1
l 1
)
... 9O ( RP )
Step 2. Transformation of table 7PQO . We delete the M -th column from 7PQO and denote the new rows of the
transformed table 7PQO by RM RM RPM . We repeat again the voting procedure from Step 1 calculating 9X ( RTM ) , T
P , X
O , M
Q .
Step 3. Computation of the differences
194
9X ( RT ) - 9X ( RTM ) , T
George Georgiev et al. / Procedia Computer Science 6 (2011) 189–194
P , X
O , M
Q .
M X
Step 4. Computation of information weights Z of the M -th feature for class NX , X O in accordance with ^ [ 9X ( RP ) - VX ( RPM )] [9X ( RP ) - VX ( RPM )] ... 9X ( RP ) - VX ( RPM )]` (13) ZXM PX PX X
X
X
X
X
X
Step 5. Computation of information weight of M -th feature denoted by , M .
,M
O
¦
X
ZXM
O
¦
X
PX PX
PX
¦
[9X ( RT ) - VX ( RTM )] (14)
T PX
Experimental and Discussions The algorithms have been implemented in solving graph segmentation problems as well as in informative weights estimation of objects belonging to a given complex. In both cases the offered approach was used for initial reduction of the feature space. The presented hybrid algorithm for feature selection combines the competitiveness of the filter approach which makes it undependable from the nature of the pattern classifier and can be embedded in it in order to increase the accuracy of a particular learning algorithm as wrapper algorithms do. It can be easily implemented with parallel techniques since, at each step of the algorithm, one can calculate the information weights of all individual features or their predefined subsets. Our method demonstrates its efficiency and effectiveness for feature selection in supervised learning in domains where data contains many irrelevant and/or redundant features. The algorithm is generalized for multiclass implementation. Future research includes automatic analyzer in order to separate weakly relevant features but nonredundant features from redundant one. References [1] Duda RO, Hart PE, Stork DG, “Pattern Classification 2nd edition”, NY: Wiley-Interscience, 2000. [2] Theodoridis S, Koutroumbas K, “Pattern Recognition 4st edition”, USA: Academic Press, 2008. [3] Jain AK, Duin RP, Mao J, “Statistical Pattern Recognition: A Review”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22:4-37. [4] P. A. Estévez and R. Caballero, “A niching genetic algorithm for selecting features for neural networks classifiers,” in Perspectivesin Neural Computation (ICANN’98), New York: Springer-Verlag, 1998, pp. 311–316 [5] H. Liu, H. Motoda, and L. Yu, “Feature selection with selective sampling”, in Proceedings of theNineteenth International Conference on Machine Learning, pages 395–402, 2002. [6] Y. Kim, W. Street, and F. Menczer, “Feature selection for unsupervised learning via evolutionary search”, In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining, pp. 365–369, 2000. [7] A. Miller, “Subset Selection in Regression”, Chapman & Hall/CRC, 2 edition, 2002. [8] R. Kohavi and G. H. John, “Wrappers for feature subset selection”, Artificial Intelligence, 97(1-2): 273–324, 1997. [9] L. Blum and P. Langley, “Selection of relevant features and examples in machine learning”, Artificial Intelligence, 97:245– 271, 1997. [10] G. H. John, R. Kohavi, and K. Pfleger, “Irrelevant feature and the subset selection problem”, in Proceedings of the Eleventh International Conference on Machine Learning, pages 121–129, 1994. [11] L. Yu and H. Liu, “Efficient feature selection via analysis of relevance and redundancy,” J. Mach. Learn. Res., vol. 5, pp. 1205–1224, Oct. 2004. [12] J. Bins and B. Draper, “Feature selection from huge feature sets,” in Proc. Int. Conf. Comput. Vis., Vancouver, BC, Canada, Jul. 2001, pp.159–165. [13] N. Kwak and C.-H. Choi, “Input feature selection for classification problems”, IEEE Trans. Neural Netw., vol. 3, no. 1, pp. 143–159, Jan. 2002. [14] P. A. Estévez and R. Caballero, “A niching genetic algorithm for selecting features for neural networks classifiers,” in Perspectivesin Neural Computation (ICANN’98). New York: Springer-Verlag, 1998, pp. 311–316 [15] L. Breiman, “Random forests”, Machine Learning, 45(1):5–32, 2001. [16] F. Fleuret, “Fast binary feature selection with conditional mutual information”, Journal of MachineLearning Research, 5:1531–1555, 2004. [17] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, “Choosing multiple parameters for support vector machines”, Machine Learning, 46(1):131–159, 2002.