Computational Intelligence and Telematics in Control June 22-24, 2015. Slovenia Proceedings of theMaribor, 2nd IFAC Conference on Embedded Systems, Proceedings of Intelligence the 2nd IFAC Conference oninEmbedded Available online at Systems, www.sciencedirect.com Computational and Telematics Control Computational Intelligence and Telematics in Control June 22-24, 2015. Maribor, Slovenia June 22-24, 2015. Maribor, Slovenia
ScienceDirect
A Randomized Approximation Convex Hull Algorithm for High Dimensions IFAC-PapersOnLine 48-10 (2015) 123–128 A Randomized Approximation Convex Hull Algorithm High Dimensions Antonio Ruano*. Hamid Reza Khosravani**. Pedro M. for Ferreira*** A Randomized Approximation Convex Hull Algorithm for High Dimensions Antonio Ruano*. Hamid Reza Khosravani**. Pedro M. Ferreira*** Antonio Ruano*. Hamid Reza Khosravani**. Pedro M. Ferreira*** *University Of Algarve 8005-139 Faro, Portugal (e-mail: aruano@ ualg.pt). ** University Of Algarve 8005-139 Faro, Portugal (e-mail:
[email protected]) *University AlgarveUniversity 8005-139 Faro, Portugal (e-mail: aruano@ ualg.pt). *** LaSIGE, Faculty ofOf Sciences, of Lisbon, Portugal(e-mail:
[email protected]) *University Of Algarve 8005-139 Faro, Portugal (e-mail: aruano@ ualg.pt). ** University Of Algarve 8005-139 Faro, Portugal (e-mail:
[email protected]) ** University Of Algarve 8005-139 Faro, Portugal (e-mail:
[email protected]) *** LaSIGE, Faculty of Sciences, University of Lisbon, Portugal(e-mail:
[email protected]) *** LaSIGE, Facultyofofclassification Sciences, University of Lisbon,
[email protected]) Abstract: The accuracy and regression tasksPortugal(e-mail: based on data driven models, such as Neural Networks or Support Vector Machines, relies to a good extent on selecting proper data for designing Abstract: The accuracy of classification and regression data models, such as Neural these models that covers the whole input ranges in tasks whichbased they on will bedriven employed. The convex hull Abstract: The accuracy of classification and regression tasks based data driven models, as Neural Networks orapplied Support Vector Machines, relies to a however good extent ononselecting proper datasuch for designing algorithm is as a method for data selection; the use of conventional implementations of Networks or Support Vector relies to a in good extent onwill selecting proper data for designing these models that covers the Machines, whole input ranges which they be employed. Thewe convex hulla this method in high dimensions, due to its high complexity, is not feasible. In this paper, propose these models that covers the whole input ranges in which they will conventional be employed.implementations The convex hull algorithm is applied as a method for data of randomized approximation convex hull selection; algorithmhowever which the can use be of used for high implementations dimensions in an algorithm is applied as a method for data selection; however the use of conventional ofa this method in high dimensions, due to its high complexity, is not feasible. In this paper, we propose acceptable execution time. this method in high dimensions, due to its high complexity, is not feasible. In this paper, we propose randomized approximation convex hull algorithm which can be used for high dimensions in ana Keywords: Convex Hull, Data Selection Problem, Classification, Support randomized approximation convex hull algorithm which Hosting can Regression, bebyused forNeural highAllNetworks, dimensions in an © 2015, IFAC (International of Automatic Control) Elsevier Ltd. rights reserved. acceptable execution time. Federation Vector Machines. acceptable execution time. Keywords: Convex Hull, Data Selection Problem, Classification, Regression, Neural Networks, Support Keywords: Convex Hull, Data Selection Problem, Classification, Regression, Neural Networks, Support Vector Machines. Vector Machines. Section 3 addresses our proposed algorithm for determining 1. INTRODUCTION an approximation of the convex hull in high dimensions. 1 Section 34addresses proposed algorithm for determining Section presents our simulation results. Conclusions are Neural networks and Support Vector Machines (SVM), as 1. INTRODUCTION Section 3 addresses our proposed algorithm for determining an approximation of the convex hull in high dimensions. 1. INTRODUCTION presented in Section 5. well as other data driven machine learning approaches, are approximation of simulation the convex results. hull in high dimensions. 1 Section 4 presents Conclusions are Neural networks methods and Support Vector Machines as an established for classification and (SVM), regression 1well Section 4 presents simulation results. Conclusions are Neural networks and Support Vector Machines (SVM), as 2. RELATED WORKS presented in Section 5. well as other data driven machine learning approaches, are tasks. Since the models generated by these approaches, approaches are presented in Section 5. well as other data driven machine learning are well driven, established methods fordata classification and regression data selecting suitable from large datasets for the 2.1 Convex Hull Definition 2. RELATED WORKS well established methods for classification and regression tasks. Since the models generated by these approaches are 2. RELATED WORKS design phase is a crucial task, as the accuracy of these models tasks.driven, Since selecting the models generated by these are data suitable data from large approaches datasets for the is affected by the data in the training dataset. Data must be Convex Hull Definition From a computational geometry’s point of view, an object in data driven, selecting suitable as data large datasets for the 2.1 design is away crucial thefrom accuracy these models Convex Hull Definition selectedphase in such thattask, it covers the whole of input ranges in 2.1 Euclidean space is convex if for every pair of points within design phase is a crucial task, as the accuracy of these models is affected by the is data training dataset. Datathis must be From a computational geometry’s point of view, an object in which the model to in bethe employed. To achieve goal, the object, every point geometry’s on the straight line joins is affected by theway data in the training dataset. Dataranges must be a computational point of segment view, an that object in selected in such that it covers the whole input in From (Malosekinand Stopjakova, 2006, Wang et al., 2013) presented Euclidean space is the convex if for every pair of points within S them is also within object. A set is convex if, for every selected such way that it covers the whole input ranges in Euclidean space is convex if for every pair of points within which the model is to be employed. To achieve this goal, two different methods Principal To Components Analysis the straight line(1 segment that pair,object, 𝑢𝑢, 𝑣𝑣 ∈every 𝑆𝑆, andpoint all 𝑡𝑡 on ∈ [0,1], the point − 𝑡𝑡)𝑢𝑢 + 𝑡𝑡𝑡𝑡 joins is in which theand model is tousing be2006, employed. achieve this goal, the the object, every point on the straight line segment that joins (Malosek Stopjakova, Wang etlearning al., 2013) presented (PCA) and convex hull. In an on-line context, the S them is also within the object. A set is convex if, for every , 𝑢𝑢 , … , 𝑢𝑢 ∈ 𝑆𝑆, 𝑆𝑆. Moreover, if 𝑆𝑆 is a convex set, for any 𝑢𝑢 (Malosek and Stopjakova, 2006, Wang et al., 2013) presented 1 2 𝑟𝑟 is also within the object. A set S is convex if, for every two different methods Principal Components Analysis them convex hull was appliedusing for sample reduction in classification 𝑟𝑟𝑡𝑡)𝑢𝑢 + 𝑡𝑡𝑡𝑡 is in (1}:− pair,any 𝑢𝑢, 𝑣𝑣nonnegative ∈ 𝑆𝑆, and allnumbers 𝑡𝑡 ∈ [0,1], the point ∑ {𝜆𝜆 , 𝜆𝜆 , … , 𝜆𝜆 = 1 , 𝜆𝜆 and two different methods using Principal Components Analysis 1 2 𝑟𝑟 𝑖𝑖 𝑖𝑖=1 (1 −, 𝑢𝑢 pair, 𝑢𝑢, 𝑣𝑣 ∈𝑟𝑟 𝑆𝑆, ifand alla𝑡𝑡 convex ∈ [0,1],set, the for point 𝑡𝑡)𝑢𝑢 + 𝑡𝑡𝑡𝑡 isthe in (PCA) and convex hull. an on-line the 𝑆𝑆. and regression, where theIn existing modellearning should context, be retrained 𝑆𝑆, Moreover, 𝑆𝑆 is any 𝑢𝑢combination 1 2 , … , 𝑢𝑢𝑟𝑟 ∈ of ∑ 𝑢𝑢 is called a convex vector 𝜆𝜆 (PCA) and convex hull. In an on-line learning context, the 𝑖𝑖 𝑖𝑖 𝑖𝑖=1 , 𝑆𝑆. Moreover, if 𝑆𝑆 is a convex set, for any 𝑢𝑢 convex hull arriving was applied for sample reduction in classification 𝑟𝑟𝑢𝑢2 , … , 𝑢𝑢𝑟𝑟 ∈ 𝑆𝑆, 1 with newly samples along with a reasonable portion }: ∑𝑖𝑖=1 1, the 𝜆𝜆𝑖𝑖 =convex and, 𝑢𝑢any nonnegative numbers 1 , 𝜆𝜆2 , … , 𝜆𝜆𝑟𝑟above, convex hull waswhere applied samplemodel reduction in classification , 𝑢𝑢 . According to the{𝜆𝜆 𝑢𝑢 1 any 2 , …nonnegative ∑𝑟𝑟𝑖𝑖=1 the {𝜆𝜆definitions 𝜆𝜆𝑖𝑖 = 1, the numbers andthe regression, theforexisting be retrained 𝑟𝑟 𝑟𝑟 1 , 𝜆𝜆convex 2 , … , 𝜆𝜆𝑟𝑟 }:combination of current training dataset (Wang et al.,should 2013, Lopez Chau and ∑ 𝑢𝑢 is called a of vector 𝜆𝜆 𝑖𝑖 𝑖𝑖 𝑖𝑖=1 and regression, where the existing model should be retrained or convex of set a𝑋𝑋 of points in the Euclidean with samples along with a reasonable portion hull ∑𝑟𝑟𝑖𝑖=1 𝜆𝜆𝑖𝑖 𝑢𝑢envelope convex combination of vector 𝑖𝑖 is called et al.,newly 2013).arriving … , 𝑢𝑢 . According to terms the definitions above, the convex 𝑢𝑢 1 , 𝑢𝑢2 , can 𝑟𝑟be with newly arriving samples along with a reasonable portion space defined in of convex sets or convex of the current training dataset (Wang et al., 2013, Lopez Chau 𝑢𝑢1 , 𝑢𝑢2 , … , 𝑢𝑢𝑟𝑟 . According to the definitions above, the convex or convex envelope of set 𝑋𝑋 of points in the Euclidean of the identification current trainingofdataset (Wang hull et al.,vertices 2013, Lopez combinations: the convex is a Chau time hull etThe al., 2013). hull orcan convex envelopeinofterms set 𝑋𝑋 of of convex points insets the or Euclidean be defined convex et al., 2013).task, as the complexity of real convex hull space consuming space can be defined in terms of convex sets or convex the minimal convex set containing 𝑋𝑋, or 𝑑𝑑 combinations: The identification of the convex hull ⌊ vertices is a time ⌋ combinations: 2 The identification of the convex hull vertices is a time ) (Bayer, algorithms highas dimensions is 𝑂𝑂(𝑛𝑛 consuming in task, the complexity of real convex1999), hull the of allset convex sets containing 𝑋𝑋, or the intersection minimal convex containing 𝑋𝑋, or consuming as thethecomplexity hull 𝑑𝑑 real convex where 𝑛𝑛 andtask, 𝑑𝑑 denote number ofofsamples and sample ⌊ ⌋ the minimal convex set containing 𝑋𝑋, or 𝑑𝑑 2 (Bayer, 1999),a algorithms high dimensions 𝑂𝑂(𝑛𝑛⌊ ⌋ ) we dimension in respectively. In thisis propose set of all convex of points in𝑋𝑋,𝑋𝑋.or 2 ) (Bayer, 1999), the intersection of allcombinations convex sets containing algorithms high dimensions is paper, 𝑂𝑂(𝑛𝑛 where 𝑛𝑛 andin 𝑑𝑑 denote the number of samples and sample Randomized Approximation Convex Hull Algorithm to the intersection of all convex sets containing 𝑋𝑋, or where 𝑛𝑛 and 𝑑𝑑 denote the number of samples and sample dimension respectively. In this paper, we the propose theHull set ofAlgorithms all convex combinations of points in 𝑋𝑋. overcome both the high execution time and memoryaa 2.2 Convex dimension respectively. In this paper, weAlgorithm propose the set of all convex combinations of points in 𝑋𝑋. Randomized Approximation Convex Hull to requirements, which result from the convex hull algorithm Randomized Approximation Convextime Hull Algorithm to 2.2 Convex Hull Algorithms overcome high execution memory Convex hull algorithms can be categorized from three points complexityboth for the high-dimensional data. and Thethe proposed overcome both the high execution time and the memory 2.2 Convex Hull Algorithms requirements, which result fromforthe convex hull algorithm of view. An algorithm can be deterministic or randomized algorithm can be used not only off-line training, but also requirements, which result from the convex hull algorithm hull can be categorized threeispoints complexity for adaptation. high-dimensional data. The proposed Convex depending onalgorithms the order of vertices found. If from the order fixed for online model Convex hull algorithms can from points complexitycanfor high-dimensional data. training, The proposed view. An algorithm can be be categorized deterministic or three randomized algorithm be used not only for off-line but also of from run to run, the algorithm is deterministic (Graham, of view. An can be deterministic randomized algorithm can usedisnot only forasoff-line also depending The rest ofmodel thebe paper organized follows:training, Sectionbut 2 gives on algorithm the order vertices found. If theororder fixed for online adaptation. 1972); otherwise, it isof (Clarkson andis depending on run, the order of randomized vertices is found. If the order is Shor, fixed onlinedescription model adaptation. aforbrief on existing convex hull algorithms. 1989). from run to the algorithm deterministic (Graham, Furthermore, an algorithm algorithm can be considered(Graham, as a real The rest of the paper is organized as follows: Section 2 gives from run to run, the is deterministic 1972); otherwise,algorithm. it is randomized (Clarkson and Shor, rest of the paper is follows:hull Section 2 gives or approximation If it is capable of identifying all aThebrief description onorganized existing asconvex algorithms. 1972); otherwise, itanisalgorithm randomized (Clarkson and 1989). Furthermore, can be considered as aShor, real vertices of the real convex hull, the algorithm is real (Barber 1a brief description on existing convex hull algorithms. 1989). Furthermore, an algorithm be considered as a real This work was supported by QREN SIIDT 38798, and or approximation algorithm. If it iscan capable of approximation identifying all et al., 1996); otherwise, it is considered an or approximation algorithm. If it is capable of identifying all IDMEC, under LAETA vertices of the real convex hull, the algorithm is real (Barber 1 vertices of the real convex hull, the algorithm is real (Barber 1 This work was supported by QREN SIIDT 38798, and et al., 1996); otherwise, it is considered an approximation This work wasLAETA supported by QREN SIIDT 38798, and Copyright © 2015 IFAC 123 et al., 1996); otherwise, it is considered an approximation IDMEC, under IDMEC, under LAETA
2405-8963 © 2015, IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. Copyright 2015responsibility IFAC 123Control. Peer review© of International Federation of Automatic Copyright ©under 2015 IFAC 123 10.1016/j.ifacol.2015.08.119
CESCIT 2015 124 June 22-24, 2015. Maribor, Slovenia
Antonio Ruano et al. / IFAC-PapersOnLine 48-10 (2015) 123–128
(Bentley et al., 1982, Khosravani et al., 2013). Finally, we can also classify convex hull algorithms into offline and online algorithms. The former uses all the data to compute the convex hull, while the latter employ newly arrived points to adapt an already existing convex hull (Bayer, 1999).
3. PROPOSED ALGORITHM In order to overcome the shortcomings of Quickhull and the algorithm proposed in (Wang et al., 2013), we propose a randomized approximation algorithm so that on one hand, it treats memory complexity efficiently and on the other hand, it identifies the vertices which are exactly the vertices of the real convex hull. Moreover, this algorithm is capable to be applied in high dimensions efficiently.
Although many algorithms have been proposed for identifying the convex hull of datasets in low dimensions, still there is no efficient algorithm available to find the convex hull in higher dimensions. The time complexity of the majority of proposed algorithms for two or three dimensions is 𝑂𝑂(𝑛𝑛 𝑙𝑙𝑙𝑙𝑙𝑙 𝑛𝑛) while for high dimensions, the complexity is
In order to explain the proposed algorithm, first we need to explain two notions in computational geometry which are the hyperplane distance (Weisstein, 2014a, Weisstein, 2014b) and the convex hull distance.
𝑑𝑑
𝑂𝑂(𝑛𝑛⌊2⌋ ), where n is the number of samples in dataset and d is the sample dimension. According to the upper bound theory in computational geometry (Seidel, 1995), the maximum number of facets for a convex hull with 𝑚𝑚 vertices is
3.1 Hyperplane Distance Suppose 𝑉𝑉 = [𝑣𝑣1 , 𝑣𝑣2 , … , 𝑣𝑣𝑛𝑛 ]𝑇𝑇 is a point, 𝐹𝐹 is an n-vertex facet, and 𝐻𝐻 is the corresponding hyper-plane of facet 𝐹𝐹 in a n-dimensional Euclidean space. Also assume 𝑎𝑎1 𝑥𝑥1 + 𝑎𝑎2 𝑥𝑥2 + ⋯ + 𝑎𝑎𝑛𝑛 𝑥𝑥𝑛𝑛 + 𝑏𝑏 = 0, is the corresponding equation of 𝐻𝐻 where 𝑁𝑁 = [𝑎𝑎1 , 𝑎𝑎2 , … 𝑎𝑎𝑛𝑛 ]𝑇𝑇 and 𝑏𝑏 are the normal vector and offset, respectively.
𝑑𝑑
𝑂𝑂(𝑚𝑚⌊2⌋ ), which reflects the large memory requirements for those algorithms that construct the convex hull by enumerating facets, e.g., the randomized incremental algorithm (Clarkson and Shor, 1989) and the Quickhull (Barber et al., 1996).
Among all proposed algorithms, Quickhull is considered as a quick deterministic real convex hull algorithm which is faster than other proposed algorithms in low dimensions. For dimensions 𝑑𝑑 ≤ 3 Quickhull runs in time 𝑂𝑂(𝑛𝑛 log 𝑟𝑟), where 𝑛𝑛 and 𝑟𝑟 are the number of points in the underlying dataset and the number of processed points, respectively. For 𝑑𝑑 ≥ 4, Quickhull runs in time 𝑂𝑂(𝑛𝑛𝑓𝑓𝑟𝑟 /𝑟𝑟), where 𝑓𝑓𝑟𝑟 is the maximum 𝑑𝑑
The distance from 𝑉𝑉 to the hyperplane 𝐻𝐻 is given by (1). 𝑑𝑑𝑑𝑑(𝑉𝑉, 𝐻𝐻) =
𝑎𝑎1 𝑣𝑣1 + 𝑎𝑎2 𝑣𝑣2 + ⋯ 𝑎𝑎𝑛𝑛 𝑣𝑣𝑛𝑛 + b √𝑎𝑎1 2 + 𝑎𝑎2 2 + ⋯ 𝑎𝑎𝑛𝑛 2
3.2 Convex Hull Distance
(1)
Given a set 𝑃𝑃 = {𝑥𝑥𝑖𝑖 }𝑛𝑛𝑖𝑖=1 ⊂ ℝ𝑑𝑑 and a point 𝑥𝑥 ∈ ℝ𝑑𝑑 , the Euclidean distance between 𝑥𝑥 and the convex hull of P, denoted by 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(𝑃𝑃), can be computed by solving the quadratic optimization problem stated in (2).
𝑑𝑑
number of facets for 𝑟𝑟 vertices. Since 𝑓𝑓𝑟𝑟 = 𝑂𝑂(𝑟𝑟 ⌊2⌋ / ⌊ ⌋ !), for 2 high dimensions a massive number of facets would be generated for 𝑟𝑟 vertices. Consequently Quickhull is not feasible for high dimensions, both in terms of execution time and memory requirements, e.g., for 𝑑𝑑 > 8 it suffers from insufficient memory problems.
𝑚𝑚𝑚𝑚𝑚𝑚 1 𝑇𝑇 𝑎𝑎 𝑄𝑄𝑄𝑄 − 𝑐𝑐 𝑇𝑇 𝑎𝑎 𝑎𝑎 2 𝑠𝑠. 𝑡𝑡. 𝑒𝑒 𝑇𝑇 𝑎𝑎 = 1, 𝑎𝑎 ≥ 0 Where 𝑒𝑒 = [1,1, ⋯ ,1]𝑇𝑇 , 𝑄𝑄 = 𝑋𝑋 𝑇𝑇 𝑋𝑋 𝑋𝑋 = [𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑛𝑛 ].
Very recently, an on-line algorithm (Wang et al., 2013) has been proposed for application to SVMs. Its time complexity is at most 𝑂𝑂(𝑛𝑛𝑛𝑛4 ), which means that, for problems with d 8 , it has smaller complexity than the existing techniques. It incrementally forms an approximated convex hull of a dataset on the basis of two thresholds, L and M. The algorithm starts from a d-simplex and ends with an approximated convex hull with at most M vertices. Since a dsimplex has d+1 facets, it divides the space into d+1 partitions. In the first step, each partition whose number of samples is greater than L is divided into d new partitions, based on the furthest sample to the corresponding facet of each partition. This task is performed repeatedly for the new generated partitions and the furthest samples are marked as convex hull vertices. Afterwards, the sample whose distance to the current generated convex hull is maximum is marked as a vertex of convex hull. The procedure is executed until the number of vertices reaches the threshold M. Although the algorithm proposed in (Wang et al., 2013) is feasible to execute in high dimensions, it incorporates vertices which do not belong to the set of vertices of the real convex hull, as will be demonstrated in the results.
and
𝑐𝑐 = 𝑋𝑋 𝑇𝑇 𝑥𝑥,
(2)
with
Suppose that the optimal solution of (2) is 𝑎𝑎∗ ; then the distance of point 𝑥𝑥 to 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(𝑃𝑃) is given by: 𝑇𝑇
𝑑𝑑𝑐𝑐 (𝑥𝑥, 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(𝑃𝑃)) = √𝑥𝑥 𝑇𝑇 𝑥𝑥 − 2𝑐𝑐 𝑇𝑇 𝑎𝑎∗ + 𝑎𝑎∗ 𝑄𝑄𝑎𝑎∗ 3.3 The proposed Algorithm
(3)
The proposed algorithm consists of five main steps: Step 1: Scaling each dimension to the range [-1, 1]. Step 2: Identifying the maximum and minimum samples with respect to each dimension. These samples are considered as vertices of the initial convex hull. Step 3: Generating a population of 𝑘𝑘 facets based on current vertices of convex hull.
Step 4: Identifying the furthest points to each facet in the current population as new vertices of convex hull, if they have not been detected before.
124
CESCIT 2015 June 22-24, 2015. Maribor, Slovenia
Antonio Ruano et al. / IFAC-PapersOnLine 48-10 (2015) 123–128
Wang’s algorithm were executed for ten runs. For the latter L was set to 0.01n for all datasets, and M was set as M>=0.02n, M>=0.07n, M>=0.1n and M>=0.14n, for DS1, DS2, DS3 and DS4, respectively. 𝑛𝑛 is the number of samples.
Step 5: Updating current convex hull by adding newly found vertices into current set of vertices. Steps 3 to 5 are executed iteratively until one of the following two termination criteria is met:
There are no newly found vertices in Step 4
Let 𝑑𝑑𝑑𝑑 be the maximum of approximated distances of furthest points to the current convex hull in each iteration. If there are new vertices as a consequence of Step 4 and the difference between the maximum and minimum of 𝑑𝑑𝑑𝑑 over 𝑤𝑤 last iterations is less than a threshold (assume 0.1), and there is fluctuation in value of 𝑑𝑑𝑑𝑑 in this 𝑤𝑤-sliding window, the algorithm ends.
Algorithm 1: The Proposed Algorithm Input: 𝐷𝐷𝐷𝐷 = {𝑥𝑥𝑖𝑖 }𝑛𝑛𝑖𝑖=1 ⊆ 𝑅𝑅𝑑𝑑 as a set of samples, 𝑘𝑘 denotes the population size of facets in d-dimensional space and 𝑤𝑤 is an integer value as width of the sliding window. 1. Scale each dimension of 𝐷𝐷𝐷𝐷 to the range [-1, 1]. 2. Let 𝑉𝑉 denotes the maximum and minimum samples with respect to each dimension in 𝐷𝐷𝐷𝐷; 3. 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 = 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹; 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 = 𝐹𝐹𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎; 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 1; 𝐷𝐷𝐷𝐷 = {}; 4. While (not 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 and not 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷) do 5. Let 𝑃𝑃 be an empty population. 6. For (𝑖𝑖 = 1; 𝑖𝑖 ≤ 𝑘𝑘; 𝑖𝑖 + +) do 7. Let 𝐹𝐹 be an empty facet. 𝑗𝑗 = 1; 8. While(𝑗𝑗 ≤ 𝑑𝑑) do 9. Select randomly a vertex 𝑣𝑣 from 𝑉𝑉; 10. If (𝑣𝑣 is not in 𝐹𝐹) then 11. 𝐹𝐹 = 𝐹𝐹 ∪ {𝑣𝑣}; 𝑗𝑗 = 𝑗𝑗 + 1 12. 𝑃𝑃 = 𝑃𝑃 ∪ {𝐹𝐹}; 13. 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 = {}; 14. For each facet 𝐹𝐹 in 𝑃𝑃 do 15. Let 𝐹𝐹𝐹𝐹 be the furthest points to facet 𝐹𝐹. For each point 𝑓𝑓𝑓𝑓 in 𝐹𝐹𝐹𝐹 do 16. 17. If (𝑓𝑓𝑓𝑓 is not in 𝑉𝑉) do 18. 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 = 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 ∪ {𝑓𝑓𝑓𝑓}; 19. If (𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 = {}) then 20. 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 = 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇; 21. If (not 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁) then 22. Let 𝑑𝑑𝑑𝑑 be the maximum of the approximated distances of vertices in 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 to the current convex hull. 23. 𝐷𝐷𝐷𝐷 = 𝐷𝐷𝐷𝐷 ∪ {𝑑𝑑𝑑𝑑} 24. If (𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ≥ 𝑤𝑤) then 25. Let 𝑑𝑑𝑐𝑐min be the minimum of 𝑑𝑑𝑑𝑑 in 𝐷𝐷𝐷𝐷 over 𝑤𝑤 last iterations. 26. Let 𝑑𝑑𝑐𝑐max be the maximum of 𝑑𝑑𝑑𝑑 in 𝐷𝐷𝐷𝐷 over 𝑤𝑤 last iterations. 27. 𝐈𝐈𝐈𝐈 (fluctuating observed in value of 𝑑𝑑𝑑𝑑 over 𝑤𝑤 last iteration and (𝑑𝑑𝑑𝑑𝑚𝑚𝑚𝑚𝑚𝑚 − 𝑑𝑑𝑑𝑑𝑚𝑚𝑚𝑚𝑚𝑚 ) < 0.1) 𝐭𝐭𝐭𝐭𝐭𝐭𝐭𝐭 28. 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 = 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇; 29. Else 30. 𝑉𝑉 = 𝑉𝑉 ∪ {𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛}; 31. 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 + 1 Output: V
Since computing the distance from a point to the current convex hull is complex and time consuming in high dimensions, the approximated distance of a newly found vertex to the current convex hull is computed based on 2 × 𝑑𝑑 vertices which are nearest neighbors to the newly found vertex in the current convex hull, where 𝑑𝑑 denotes the dimension. The proposed Algorithm is summarized in Algorithm 1. 4. SIMULATION RESULTS Three experiments were executed to evaluate the proposed algorithm performance and its effect on the accuracy of classification or regression tasks. The algorithm has been implemented in Python and C languages, and was executed in a computer with an Intel i5 CPU core and 4 Gigabytes of RAM. 4.1 Experiment 1 In order to evaluate the proposed algorithm, it was applied on four artificial datasets named DS1, DS2, DS3 and DS4. All datasets are composed of 4000 random samples with 3, 4, 5 and 6-dimensional feature space, respectively. Since Quickhull is a deterministic algorithm, in this experiment its result is employed as a reference for comparing the results achieved by the proposed algorithm and by the algorithm proposed in (Wang et al., 2013), both being approximation convex hull algorithms. We use two criteria, 𝑃𝑃 and 𝑅𝑅, which are defined in (4) and (5), to compare the results obtained by both algorithms to those obtained by Quickhull. 𝑃𝑃 =
#(𝑉𝑉𝑅𝑅 ∩ 𝑉𝑉𝑃𝑃 ) ∗ 100 #𝑉𝑉𝑃𝑃
(4)
#(𝑉𝑉𝑅𝑅 ∩ 𝑉𝑉𝑃𝑃 ) ∗ 100 (5) #𝑉𝑉𝑅𝑅 Where 𝑉𝑉𝑅𝑅 is the set of vertices obtained by employing the Quickhull algorithm and 𝑉𝑉𝑃𝑃 is the set of vertices obtained by applying one of the other algorithms. Basically, criterion 𝑃𝑃 shows the amount of precision of an algorithm in approximating the Quickhull results, while criterion 𝑅𝑅 denotes how much the results obtained by an algorithm are similar to those obtained by Quickhull. 𝑅𝑅 =
125
In the proposed algorithm the sliding window size, 𝑤𝑤, was set to 10 for all datasets, and 𝑘𝑘 was set to 4000, 5000, 6000, and 7000, for datasets DS1, DS2, DS3 and DS4, respectively. Fig. 1 and 2 show the average values of 𝑃𝑃 and 𝑅𝑅 obtained on datasets DS1 to DS4 by the proposed algorithm and by the Wang’s algorithm.
In this experiment both the proposed algorithm and the algorithm proposed in (Wang et al., 2013), denoted as 125
CESCIT 2015 126 June 22-24, 2015. Maribor, Slovenia
Antonio Ruano et al. / IFAC-PapersOnLine 48-10 (2015) 123–128
Analysing Fig.1, it may be observed that the proposed algorithm only identifies vertices that belong to the set of vertices of the real convex hull, while Wang’s algorithm selects only a significantly smaller fraction of those vertices. Moreover, according to Fig.2, the proposed algorithm detects considerably more vertices of the real convex hull, in comparison to Wang’s algorithm.
In this experiment, the following classification rate criterion was used: 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 =
𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 + 𝐹𝐹𝐹𝐹
(6)
Where 𝑇𝑇𝑇𝑇, 𝑇𝑇𝑇𝑇, 𝐹𝐹𝐹𝐹 and 𝐹𝐹𝐹𝐹 denote the number of True Positive, True Negative, False Positive and False Negative, respectively. Table. 2 shows the results obtained in the two cases for the datasets described in Table 1. According to the fourth column of Table 2, for all datasets the data selection mechanism employing the proposed algorithm has improved the accuracy of the corresponding classifiers in comparison with the random data selection method. For the Breast Cancer and the Letter datasets, the highest and lowest improvements were achieved, respectively. For the Cover Type, both algorithms achieve perfect classification. The average classification rate for datasets Satellite, Letter and Cover Type, in the second case is equal to 1 which means that perfect classification is obtained for these datasets.
Fig.1. Average value of criterion P for the proposed algorithm and Wang’s algorithm on DS1-4.
Table 1. Description of datasets used in classification. #F, #DS, #TR, #TE are the number of features, total number of samples, number of training samples and test samples, respectively. C and 𝛄𝛄 are the SVM hyper-parameters. Dataset Breast cancer Parkinson Satellite Letter Cover Type
#F
#DS
#TR
#TE
C
γ
30
569
376
193
1
0.05
26 36 16
1040 2033 1555
686 1342 1026
354 691 529
200 500 1
0.1 0.1142 0.6576
54
37877
24999
12878
1
0.5
Table 2. Average classification rate for test dataset in two cases for all datasets in Table 1. 𝑪𝑪𝑪𝑪𝑻𝑻𝑻𝑻 (𝟏𝟏) and 𝑪𝑪𝑪𝑪𝑻𝑻𝑻𝑻 (𝟐𝟐) denote the classification rates for the test dataset in the first case (random selection) and second case (data selection by the proposed algorithm) respectively.
Fig.2. Average value of criterion R for the proposed algorithm and Wang’s algorithm on DS1-4. 4.2 Experiment 2
Dataset Breast Cancer Parkinson Satellite Letter Cover Type
In this experiment, the proposed algorithm was applied as a method for data selection in classification tasks. In order to evaluate the accuracy of the classification model, two cases were considered. In the first, ten training datasets were generated by random selection of samples from the whole dataset. In the second case, ten training datasets were generated, each one of them incorporating vertices of the approximated convex hull (which were obtained by the proposed algorithm) as well as random samples from the remaining dataset. The algorithm was applied separately for positive and negative classes. The datasets for classification were taken from (Frank and Asuncion, 2013). Built-in MATLAB SVM tool with Gaussian RBF (Radial Basis Function) kernel was used to design a classifier in both scenarios. The description of each dataset along with their corresponding parameters’ value for SVM classifier is given in Table 1.
𝐶𝐶𝐶𝐶𝑇𝑇𝑇𝑇 (1) 0.963 0.656 0.990 0.993 1.000
𝐶𝐶𝐶𝐶𝑇𝑇𝑇𝑇 (2) 0.981 0.667 1.000 1.000 1.000
𝐶𝐶𝐶𝐶𝑇𝑇𝑇𝑇 (2) − 𝐶𝐶𝐶𝐶𝑇𝑇𝑇𝑇 (1) 0.018 0.011 0.010 0.007 0.000
4.3 Experiment 3 Experiment 3 was conducted to find out how much can data selection using the proposed algorithm, improve the accuracy of regression models. As in Experiment 2, two approaches were analysed for comparison: 1) generating ten training datasets by random selection; 2) generating ten training datasets by applying the proposed algorithm together with random selection. The 126
CESCIT 2015 June 22-24, 2015. Maribor, Slovenia
Antonio Ruano et al. / IFAC-PapersOnLine 48-10 (2015) 123–128
datasets which are used for regression were taken from(Frank and Asuncion, 2013, Rasmussen et al., 1996). The description of each dataset is given in Table 3. The MLP (Multilayer Perceptron Neural Network) implemented in MATLAB was employed with two hidden layers and the output layer with one linear neuron. For all datasets except the Concrete dataset, both hidden layers have ten sigmoidal neurons. For the Concrete dataset, both hidden layers contain five sigmoidal neurons. The training algorithm described in (Ruano et al., 2005) is employed, terminating if earlystopping is met, the number of training iterations exceeds 100 iterations, or the optimization criterion described in (2.20-22) of the reference is met, where 𝜏𝜏 = 10−3 is a measure of the desired number of correct digits in the training criterion.
127
for vertices of convex hull for Bank is larger than those for Puma. This specific result related to datasets Puma and Bank reveals this fact that the distribution of samples can influence the run time. Table 3. Description of datasets used in regression. #F, #DS, #TR, #TE and #VAL are number of features, total samples, training samples, test samples and validation samples respectively. Dataset Puma Bank CompAct Concrete Skillcraft
The RMSE (Root Mean Squared Error) criterion is employed to evaluate the accuracy of the models. Tables. 4-5 show the results obtained in the two cases for the datasets described in Table 3.
#F 32 32 21 8 18
#DS 8192 8192 8192 1030 3338
#TR 4915 4915 4915 618 2003
#TE 1638 1638 1638 206 667
#VAL 1639 1639 1639 206 668
Table 4. Average RMSE for test dataset in two cases for all datasets in Table 3. 𝑬𝑬𝑻𝑻𝑻𝑻 (𝟏𝟏) and 𝑬𝑬𝑻𝑻𝑻𝑻 (𝟐𝟐) denote RMSE for test dataset in first case (random selection) and second case (data selection by the proposed algorithm) respectively.
Table 4 shows the average RMSEs for the test datasets in the two mentioned cases. As it may be seen in the fourth column, the regression model which resulted from the data selected by the proposed algorithm, has lower regression error. Table 5 shows the average RMSEs for the validation sets. Again, it may be concluded that the use of the proposed method in the data selection phase, decreases the error for all datasets except Skillcraft which has an identical value.
Dataset Puma Bank CompAct Concrete Skillcraft
To summarize the results, among the 15 performance values presented in Tables 2, 4 and 5, the use of the proposed algorithm for data selection achieves better results than those obtained by using random data selection in 13 cases, and achieves equal performance in 2 cases.
𝐸𝐸𝑇𝑇𝑇𝑇 (1) 0.076 0.209 0.082 0.161 0.404
𝐸𝐸𝑇𝑇𝑇𝑇 (2) 0.073 0.195 0.049 0.143 0.337
𝐸𝐸𝑇𝑇𝑇𝑇 (1) − 𝐸𝐸𝑇𝑇𝑇𝑇 (2) 0.003 0.014 0.033 0.018 0.067
Table 5. Average RMSE for validation dataset in two cases for all datasets in Table 3. 𝑬𝑬𝑽𝑽𝑽𝑽𝑽𝑽 (𝟏𝟏) and 𝑬𝑬𝑽𝑽𝑽𝑽𝑽𝑽 (𝟐𝟐) denote RMSE for validation dataset in first case (random selection) and second case (data selection by the proposed algorithm) respectively.
4.4 Run Time The proposed algorithm run time depends on five factors including the size of the involved dataset, i.e., the number of samples and features, population size (input parameter 𝑘𝑘), number of iterations, number of vertices of convex hull found, and on the distribution of samples in the dataset. In order to see the dependency of run time with these factors, the algorithm was applied to all the datasets described in Table 1 and 3 for ten times. For all datasets, 𝑘𝑘 (population size) and 𝑤𝑤 (width of sliding window) were set to 1000 and 5, respectively. Fig. 3 shows the average percentage of total samples identified as vertices of convex hull for each dataset described in Tables 1 and 3.
Dataset Puma Bank CompAct Concrete Skillcraft
𝐸𝐸𝑉𝑉𝑉𝑉𝑉𝑉 (1) 0.076 0.209 0.061 0.162 0.334
𝐸𝐸𝑉𝑉𝑉𝑉𝑉𝑉 (2) 0.073 0.194 0.048 0.147 0.334
𝐸𝐸𝑉𝑉𝑉𝑉𝑉𝑉 (1) − 𝐸𝐸𝑉𝑉𝑉𝑉𝑉𝑉 (2) 0.003 0.015 0.013 0.015 0.000
5. CONCLUSIONS This paper proposes a novel randomized approximation convex hull algorithm for high-dimensional data, to overcome the limiting memory requirements and time complexity problems found in conventional algorithms. According to the simulation results, the proposed algorithm can find significantly more vertices of the real convex hull in comparison to the algorithm proposed in (Wang et al., 2013). Moreover, the obtained results in classification and regression problems show that the use of the proposed algorithm as a data selection method improves the accuracy of the designed models.
Fig. 4 also illustrates the average number of iterations that were used to terminate the algorithm for each dataset. The corresponding average run time for each dataset is given in Table 6. As can be seen in this table, the highest and lowest average run times are related to datasets Cover Type and Concrete, respectively. Cover Type is the largest dataset in terms of number of samples and features while Concrete has the least number of features and is the second smallest dataset with respect to number of samples. Although datasets Bank and Puma have the same size, the average run time for Bank is larger than that for Puma, because the average number of iterations and average percentage of total samples identified
Future work will address the use of the proposed approach for online model adaptation purposes, by incorporating the 127
CESCIT 2015 128 June 22-24, 2015. Maribor, Slovenia
Antonio Ruano et al. / IFAC-PapersOnLine 48-10 (2015) 123–128
proposed algorithm in the window management scheme described in (Ferreira and Ruano, 2009).
BAYER, V. 1999. Survey of Algorithms for the Convex Hull Problem. Department of Computer Science; Oregon State University. BENTLEY, J. L., PREPARATA, F. P. & FAUST, M. G. 1982. Approximation algorithms for convex hulls. Commun. ACM, 25, 64-68. CLARKSON, K. L. & SHOR, P. W. 1989. APPLICATIONS OF RANDOM SAMPLING IN COMPUTATIONAL GEOMETRY .2. Discrete & Computational Geometry, 4, 387-421. FERREIRA, P. M. & RUANO, A. E. 2009. Online SlidingWindow Methods for Process Model Adaptation. IEEE Transactions on Instrumentation and Measurement, 58, 3012-3020. FRANK, A. & ASUNCION, A. 2013. UCI Machine Learning Repository. GRAHAM, R. L. 1972. An Efficient Algorithm for Determining the Convex Hull of a Finite Planar Set Inf. Process. Lett., 1, 2. KHOSRAVANI, H. R., RUANO, A. E. & FERREIRA, P. M. A simple algorithm for convex hull determination in high dimensions. Intelligent Signal Processing (WISP), 2013 IEEE 8th International Symposium on, 2013. 109-114. LOPEZ CHAU, A., LI, X. & YU, W. 2013. Large data sets classification using convex-concave hull and support vector machine. Soft Computing, 17, 793-804. MALOSEK, P. & STOPJAKOVA, V. 2006. Pca data preprocessing for neural network-based detection of parametric defects in analog ic. Proceedings of the 2006 Ieee Workshop on Design and Diagnostics of Electronic Circuits and Systems, 131-135. RASMUSSEN, C. E., NEAL, R. M., HINTON, G. E., CAMP, D. V., REVOW, M., GHAHRAMANI, Z., KUSTRA, R. & TIBSHIRANI, R. 1996. Delve [Online]. Available: datasets http://www.cs.toronto.edu/~delve/data/datasets.html. RUANO, A. E., FERREIRA, P. M. & FONSECA, C. M. 2005. An Overview of Nonlinear Identification and Control with Neural Networks. In: RUANO, A. E. (ed.) Intelligent Control Systems using Computational Intelligence Techniques. Institution of Electrical Engineers. SEIDEL, R. 1995. THE UPPER BOUND THEOREM FOR POLYTOPES - AN EASY PROOF OF ITS ASYMPTOTIC VERSION. Computational Geometry-Theory and Applications, 5, 115-116. WANG, D., QIAO, H., ZHANG, B. & WANG, M. 2013. Online Support Vector Machine Based on Convex Hull Vertices Selection. Ieee Transactions on Neural Networks and Learning Systems, 24, 593609. WEISSTEIN, E. W. 2014a. Plane [Online]. MathWorld--A Wolfram Web Resource. Available: http://mathworld.wolfram.com/Plane.html. WEISSTEIN, E. W. 2014b. Point-Plane Distance. [Online]. MathWorld--A Wolfram Web Resource. Available: http://mathworld.wolfram.com/PointPlaneDistance.html.
Fig.3. Average percentage of total samples identified as vertices of convex hull for each dataset described in Tables 1, 3.
Fig.4. Average number of iterations in the proposed algorithm for each dataset described in Tables 1, 3. Table 6. Average run time of the proposed algorithm on datasets described in Tables 1, 3. Dataset Concrete Letter Skillcraft ComAct Breast Cancer Bank Puma Satellite Cover Type
Average Run Time (seconds) 11.78 19.13 37.70 37.39 8.10 257.34 174.24 62.80 1280.16 6. REFERENCES
BARBER, C. B., DOBKIN, D. P. & HUHDANPAA, H. 1996. The Quickhull algorithm for convex hulls. Acm Transactions on Mathematical Software, 22, 469-483. 128