A neural tree for classification using convex objective function

A neural tree for classification using convex objective function

Accepted Manuscript A Neural Tree for classification using Convex Objective Function Asha Rani, Gian Luca Foresti, Christian Micheloni PII: DOI: Refe...

458KB Sizes 5 Downloads 111 Views

Accepted Manuscript

A Neural Tree for classification using Convex Objective Function Asha Rani, Gian Luca Foresti, Christian Micheloni PII: DOI: Reference:

S0167-8655(15)00276-7 10.1016/j.patrec.2015.08.017 PATREC 6329

To appear in:

Pattern Recognition Letters

Received date: Accepted date:

3 January 2015 14 August 2015

Please cite this article as: Asha Rani, Gian Luca Foresti, Christian Micheloni, A Neural Tree for classification using Convex Objective Function, Pattern Recognition Letters (2015), doi: 10.1016/j.patrec.2015.08.017

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Research Highlights • A neural tree based classifier is proposed • The network parameters are optimized using a convex objective function

CR IP T

• Instead of iterative gradient decent method, matrix method is used to compute the weights

• Proposed COF-NT is able to reduce the training time without decreasing classification accuracy

AC

CE

PT

ED

M

AN US

• No user defined parameters are required

ACCEPTED MANUSCRIPT

A Neural Tree for classification using Convex Objective Function Asha Rania,∗, Gian Luca Forestib , Christian Michelonib Lab, Department of Computer Science and Engineering, Indian Institute of Technology Roorkee, Roorkee-247667, India b AVIRES Lab, Department of Mathematics and Computer Science, University of Udine, Udine-33100, Italy

CR IP T

a CVGIP

Abstract

M

AN US

In this paper, we propose a neural tree classifier, called the Convex Objective Function Neural Tree (COF-NT), which has a specialized perceptron at each node. The specialized perceptron is a single layer feed-forward perceptron which calculates the errors before the neuron’s non-linear activation function instead of after them. Thus, the network parameters are independent of non-linear activation functions, and subsequently, the objective function is a convex objective function. The solution can be easily obtained by solving a system of linear equations which will require less computational power than conventional iterative methods. During the training, the proposed neural tree classifier divides the training set into smaller subsets by adding new levels to the tree. Each child perceptron takes forward the task of training done by its parent perceptron on the superset of this subset. Thus, the training is done by a number of single layer perceptrons (each perceptron carrying forward the work done by its ancestors) that reach the global minima in a finite number of steps. The proposed algorithm has been tested on available benchmark datasets and the results are promising in terms of classification accuracy and training time.

ED

Keywords: Neural Tree, Artificial Neural Networks (ANNs), Mean squared error, Pattern Classification, Perceptron, Convex Optimization. 1. Introduction

AC

CE

PT

Neural trees have been used in a broad area of problems. Some of these problems include vowel recognition Artificial neural networks (ANNs) have been used in a [6], character recognition [7], face recognition [8], image number of scientific and engineering applications. ANNs analysis [9], time series prediction [10], disease classi[1] have proven to be very powerful for classification and fication [11], outlier detection in stereo image matching regression tasks. Still, there is a major issue associated [12], digital image watermarking [13, 14], novelty detecwith their use-they are very much dependent on architec- tion [15], traffic prediction [16], protein structure predicture. The ANN architecture is not unique for a given prob- tion [17], power signal pattern classification [18], and walem as there may exist different ways of defining an archi- ter stage forecasts in river basin during typhoons [19]. tecture for a specific problem. Depending on the problem, Neural trees are hybrid structures between decision it may require one or more hidden layers, feed-forward or trees and artificial neural networks that were developed feedback connections, and there may be direct connecto determine the structure of artificial neural networks autions between input and output nodes. Several solutions tomatically [4, 20, 21, 22]. Sethi [23] described a method have been proposed in the literature to tackle this issue. for converting a univariate decision tree into a neural netNeural trees are one such solution [2, 3, 4, 5]. work and then retraining it, resulting in a tree structured entropy network with sigmoid splits. Guo and Gelfand ∗ Corresponding author: Tel.: +91-1332-285824; [24] developed a decision tree with multi-layer percepEmail address: [email protected] (Asha Rani) trons at each node, giving non-linear and multivariate

Preprint submitted to Pattern Recognition Letters

September 4, 2015

ACCEPTED MANUSCRIPT

splits. In the last two decades efforts have been done to optimize the structure of neural trees by punning techniques [25, 26]. In [25], the tree is pre-pruned by removing some patterns that lead to over-fitting and add quite a few levels to the tree. In [26], a uniformity factor has been introduced to pre-prune the tree branches. Such a factor stops the tree growing further by accepting some error, bounded by a threshold. Different variants of stochastic search methods have been used for the development of structure and parameter identification [8, 10, 16]. Based on the kind of optimization strategy different kinds of neural trees have been distinguished in the literature such as the balanced neural tree (BNT) [25], flexible neural tree (FNT) [27], and generalized neural tree (GNT) [28]. Different variants of artificial neural networks, such as high-order perceptrons (HOP) [29], multi-layer perceptrons (MLP) [26], and radial-basis functions (RBF) [30, 31] have been used as decision taking units at nodes of the neural tree. Apart from the several kinds of hybrid neural trees, exploiting more than one type of classification unit has also been undertaken [8, 32, 33, 34, 35]. Although there are several kinds of neural tree classifiers proposed in the literature employing various kinds of learning machines such as MLP, HOP, RBF, etc., the use of a single layer perceptron (SLP) as the decision taking unit has its own advantages. SLPs are simple to implement and computationally cheaper compared to above the mentioned learning machines. They operate with a minimum number of ad-hoc parameters. In the proposed COFNT, the necessity of ad-hoc parameters such as learning rate, number of iterations, error tolerance threshold, etc. is completely eliminated. Neural trees based on single layer neural networks or multi-layer neural networks use conventional iterative learning schemes to optimize the mean squared error (MSE) objective function. The conventional MSE is a non-convex objective function due to superimposition of several convex functions. As a result, the learning process has a tendency to stick in local minima and does not obtain optimal solution. The consequence is an enlarged tree structure with a poor generalization capability or a non-converging tree building process. Other than this, due to iterative learning schemes, these kinds of neural trees need longer training times. In this paper, we propose a new neural tree classifier, called the Convex Objective Function Neural Tree (COF-NT), that uses a specialized single layer perceptron using a convex objective

CR IP T

function. In this objective function, the mean squared error is computed before the non-linear activation function. As the new objective function is a convex function, the optimal solution can be obtained analytically by equating derivatives to zero, giving a system of linear equations. Thus, the weights of the COF-NT network are computed by solving a system of linear equations, which is much faster than iterative schemes. 2. Description of the COF-NT classifier

ED

M

AN US

A single layer perceptron is easy to train compared to a multi-layer perceptron. However, although a single layer perceptron uses a non-linear activation function in its output layer, it still has only a linear discrimination capability [36]. A non-linear performance can be obtained by using single layer perceptrons in a non-linear structure, such as in a tree structure. The proposed COF-NT classifier has specialized single layer perceptrons with convex objective functions at each node. We will discuss the perceptron with convex objective functions in later sections. A neural tree learns the training set (TS) by partitioning it into smaller subsets called local training sets (LTS). It has a unique root node, several internal nodes, and several leaf nodes. 2.1. Training phase

AC

CE

PT

The training process starts at the root node with whole training set as input. A single layer feed-forward neural network with convex objective function (perceptron) is trained at the root node. A trained perceptron splits the training set into subsets depending on the generated activation values for each of the outputs. The winnertakes-all rule is applied to classify a pattern, i.e. a pattern belongs to the class the having highest activation value. Thus, all the patterns present in the current LTS are further divided into groups/LTS. Child nodes are added to the tree at the next level corresponding to each LTS. A single layer neural network is trained at the child node for learning the corresponding LTS. The tree keeps on growing by adding child nodes and the training process proceeds until all the LTS become homogeneous. A homogeneous LTS is a set consisting of patterns that belong to a single class. Furthermore, classes are not mixed. When an LTS becomes homogeneous, the node corresponding to this LTS 2

ACCEPTED MANUSCRIPT

CR IP T

a case, the perceptron is replaced by a binary classifier that divides the LTS based on features having maximum variance among two dominating classes present in the LTS. Such a classifier splits the LTS into two LTSs. Now the tree model is ready for classifying unseen patterns of a similar type. A flowchart describing the training phase of COF-NT is shown in Fig 1.

ED

M

AN US

2.2. Classification phase The already build tree model is traversed in a top-down approach to classify a test pattern. A test pattern starts traversing the tree at the root node and moves down until it reaches a leaf node. The label of the leaf node tells the class of the test pattern. There are several paths in a tree, out of these paths which path a pattern will take, has to be decided by the weights of the single layer perceptron stored at each node during the training process. The perceptron generates activation values for a pattern for each of the possible classes using the weights. The highest activation value tells the class, and in turn, the path, to be taken by the test pattern. In cases, where a test pattern reaches a binary split node, a prefixed threshold decides the path to be taken by the test pattern. A flowchart describing the classification phase of COF-NT is shown in Fig 2. Let us explain the two types of nodes and learning techniques used by COF-NT in more details.

CE

PT

2.3. Perceptron training rule Let T s = {X1 , X2 , ..., X p , ...XP } be the training set having a P number of X patterns, here X = {x1p , x2p , ..., xnp , ci }| i ∈ [1, m] is an n dimensional pattern belonging to one of the m classes. The architecture of the used single layer feed-forward neural network is shown in Fig 3. Let x jp be the input and yip be the output for the ANN, where j = 0, 1, 2, .., n, i = 1, 2, ..., m and p = 1, 2, ..., P. The ANN used consists of a single layer of output neurons with non-linear activation functions f1 , f2 , ... fm . Input and output are related by the following equations:  n  X  yip = fi (zip ) = fi  wi j x jp  (1)

AC

Figure 1: Flow chart diagram of COF-NT (Training phase)

is marked as a leaf node and labeled by the class to which the pattern of this LTS belongs. When all the nodes at the current level (the latest level) become leaf nodes, the tree stops growing and the training process is complete. Some times the perceptron is unable to divide the LTS in further groups (see discussion in later sections). In such

j=0

where wi j , j = 1, 2, ..., n are weights and w0 j is the bias associated with the ith neuron out of m neurons. The system 3

CR IP T

ACCEPTED MANUSCRIPT

AN US

Figure 3: Architecture of the adopted single layer feed-forward perceptron

To estimate the weights wi j , mean squared error has to be minimized, which is defined as: 2  n   X  2 P m  tip − fi  wi j x jp  MS E = ΣPp=1 Σm i=1 eip = Σ p=1 Σi=1    j=0

AC

CE

PT

ED

M

(3) Gradient descent methods can be used to obtain the saddle points of this function. The function given by eq. 3 is non-linear in weights due to the presence of non-linear functions fi . Therefore, it does not guarantee the absence Figure 2: Flow chart diagram of COF-NT (Classification phase) of local minima. This fact has been demonstrated in [37]. Thus, there is a possibility of sticking in local minima for the Gradient descent method while trying to reach global given by eq. 1 has m × P equations and m × (n + 1) un- optima of this objective function. knowns. Since data is usually large i.e. P ≫ (n + 1), this In order to avoid the problem of sticking in local minsystem of equations does not have an analytic solution. ima, we have adopted a new objective function demonSo, the estimated solution can be computed using itera- strated by [38] for training a single layer feed-forward tive techniques by optimizing an objective function. An neural network at the nodes of a neural tree classifier. This objective function could be based upon the differences be- particular objective function calculates the errors before tween the desired computed response and the real target. non-linear activation functions as shown in Fig 3. VariThere are several kinds of objective functions proposed able z is used instead of y , where z is the output beip ip ip by researchers. The one used in most iterative optimiza- fore non-linearity and y is the output after non-linearity. ip tion techniques is based on the mean squared error (MSE) Let f , f −1 and f 0 be invertible functions, its inverse and i i i criterion. Let tip be the desired target from ith neuron and derivatives. Then the minimization of MSE between tip pth pattern, then the error between the desired target and and y at the output of non-linearity is approximately ip the computed response of neuron (from eq. 1) is defined equivalent up to first order Taylor’s series expansion, to as: minimization of MSE before non-linearity, that is, be n  tween zip and t¯ip = f −1 (tip ) (where t¯ip is tip weighted X  eip = tip − yip = tip − fi  wi j x jp  (2) by the values of derivative of non-linearity at the corresponding operating point). Mathematically, this theory is j=0 4

ACCEPTED MANUSCRIPT

2   n P  X X    0 −1   fi (t¯ip )  fi (tip ) − wi j x jp  =

formulated as given below: (4)

The global optimum of this function can be obtained by computing the derivative with respect to the parameters of the network (weights) and keeping the derivatives equal to zero.

or equivalently P  P  X 2 X 2 0 tip − fi (zip ) ≈ fi (t¯ip )(t¯ip − zip ) p=1

(5)

p=1

∂MS Ei (be f ore) =0 ∂wik    n P  X X    0 0 −1  ¯   wi j x jp  xkp fi (t¯ip ) = 0 −2  fi (tip )  fi (tip ) −

or equivalently 2  n  P  X  X  tip − fi  wi j x jp 

p=1

j=0

2   n P  X X    0  fi (t¯ip )  fi−1 (tip ) − wi j x jp  ≈

where k = 0, 1, 2, ..., j, ..., n. Further it can be written as: n X Ak j wi j = bki , k = 0, 1, 2, ..., n (10)

(6)

j=0

p=1

(9)

j=0

AN US

p=1

CR IP T

minwi j MS Ei (a f ter) ≈ minwi j MS Ei (be f ore)

(8)

j=0

p=1

j=0

This theorem has been stated and proved in [38]. For more details on this result, readers are referred to [38]. Exploiting this result, it is possible to use either of the objective functions alternatively for learning a feed-forward neural network. The advantage of using the right hand side objective function before non-linearity is that it is easier to obtain the optimal solution as this is a convex optimization problem. In this case, absence of local minima is assured. The only requirement is that the activation function has to be invertible and differentiable. This requirement is not so hard to fulfill as there are several functions such as sigmoid functions and logarithmic functions which hold the required properties. In this case, the system of equations given by eq. 2 can be rewritten as:

P P 0 0 ¯) ¯ ), bki = Pp=1 tip ¯ xkp fi 2 (tip where Ak j = Pp=1 x jp xkp fi 2 (tip −1 ¯ = fi (tip ) and tip

AC

CE

PT

ED

M

The system of linear equations given by eq. 10 has (n + 1) linear equations and (n + 1) unknowns for the ith output that leads to the existence of only one real solution (except for ill-conditioned problems). This unique solution corresponds to the global optimum of the objective function. Several methods that are computationally efficient can be used to solve such a system of equations [39, 40] with a complexity of order O(K 2 ) (where K = (n + 1) is the number of weights of network for ith output). A general triangular factorization computed with Gaussian elimination with the partial pivoting method has n X been used to solve the system of linear equations. None e¯ ip = t¯ip − zip = f −1 (tip ) − wi j x jp (7) of the user defined parameters such as learning rate, numj=0 ber of iteration, error tolerance threshold, initial weights where p=1,2,...,P, i=1,2,...,m. Here is is worth not- or network architecture (hidden layers and hidden nodes) ing that the network parameters (wi j ) are independent are required. The proposed algorithm, therefore, is a paof the non-linear activation functions ( fi ) so the error rameter free algorithm. is linear with respect to network parameters. Now the motive is to minimize the following objective function 2.4. Binary splitting rule MS Ei (be f ore), ∀i = 1, 2, ...m: Training on smaller training sets is easier than bigger P X 0 training sets. Splitting bigger training sets into smaller loMS Ei (be f ore) = ( fi (t¯ip )¯eip )2 cal training sets is therefore an essential function of neural p=1 5

ACCEPTED MANUSCRIPT

tree training algorithms. Due to the linear behavior of single layer perceptrons, sometimes it does not split the training set. In such a case, a binary splitting rule is used to divide the training set into two subsets. Such a splitting rule is based on two constraints: (1) the two subsets should have almost equal cardinality, and (2) splitting should be based on most discriminant features. So that, this particular binary split can preserve the work done by perceptrons and carry forward the training process in the right direction. To do so, first the two dominating classes (classes with highest cardinality) are located and their barycenters x¯1 and x¯2 are computed. Then, the component that maximizes the L1 norm is identified (having maximum variance). Let k be the component with maximum variance. Then, a hyperplane orthogonal to the k axis and passing through the median point of x¯1 and x¯2 is the required splitting hyperplane as shown in Fig 4. Let the splitting hyperplane intersects the k axis at λ. Finally, a pattern X = {x1 , x2 , ...xk ...xn } belongs to class1 if xk > λ otherwise, it belongs to class2.

Table 1: Description of data-sets taken from [41, 42]

M

3. Experimental evaluation of the COF-NT classifier

AC

CE

PT

ED

The proposed COF-NT classifier has been tested on available benchmark data-sets [41, 42]. Classification accuracy i.e. percentage of correctly classified patterns and time taken for training (building the tree model) has been taken as the parameters for evaluation of the proposed algorithm. These two parameters are selected for study as the purpose of the developed algorithm is to reduce the training time without declining the classification accuracy. The proposed training algorithm is free of user-

Figure 4: Binary splitting rule

6

size 20000 2310 799 314 606 6435 350 336 214 1440 150 3186 5000 270 946

I/P 16 19 9 7 101 36 34 35 9 72 4 60 40 13 18

O/P 26 7 2 8 2 7 2 7 7 20 3 3 3 2 4

CR IP T

Data-set Letter Segment Breast-W E.coli Hill-Valley Satellite landsat Ionosphere Dermatology Glass identification ViHASi Iris DNA Waveform-5000 Heart-statlog Vehicle

AN US

S.N. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Attr nu nu nu nu nu nu nu nu,no nu nu nu no nu nu,no nu

defined parameters such as learning rate, network topology (number of layers in the network and number of nodes in each layer), number of iteration, error tolerance, threshold etc. Thus there is no need to define ad hoc parameters as in [9, 25, 26]. The results have been compared with similar related algorithms reported in the literature such as Neural Tree (NT-SLP) using single layer perceptrons at each node [9], Balanced Neural Tree (BNT) [25], Neural Tree (NT-MLP) using multi-layer perceptron at each node [26], single layer neural network (SLNN) using convex objective function [38], and Random Forest [43]. Out of these five classification algorithms used for comparison, four are tree based classifiers and SLNN is a linear classifier. Fourteen data-sets having a different number of inputs and outputs have been taken from the machine learning repository of the University of California Irvine [41]. Another database, the Virtual Human Action Silhouette (ViHaSi) is taken from Kingston University’s Digital Imaging Research Centre [42]. For feature extraction from ViHaSi, readers are referred to [44]. A brief description of the data-sets used for evaluation is given in Table 1, however more details can be found in [41, 42, 44]. For the experiments, a ten-fold cross validation has been performed, i.e. the data-set is divided into ten subsets randomly. Each time, nine out of ten subsets are used for training, with one remaining for testing. Mean clas-

ACCEPTED MANUSCRIPT

and this has been achieved. It has been established from the experimental results listed in Table 3, that the proposed algorithm takes less time to train a model for a Sr.N COF-NT NT-SLP BNT 1 15,2523,1910 24,3435,3310 21,2921,2401given training set compared to other neural tree algo2 11, 160, 101 16, 260, 215.3 12, 239, 201 rithms (NT-SLP, BNT, NT-MLP). The NT-MLP algorithm 3 6, 20.9, 4 8.9, 23.8, 9.2 7.4, 21.1, 7 wins once over proposed COF-NT algorithm; this can be explained by the time taken to train the model and the 4 4.1, 11, 0 5, 16, 0 3, 12 5 8.2, 188, 39 14.2, 262.7,68 10.2, 252, 49 standard deviation of the classification. Due to the re6 12.3, 459, 234 17, 752, 704 14, 592 quirement of more user-defined parameters the results ob7 3, 4, 0 3, 4, 0 3, 4, 0 tained by NT-MLP are not so stable. Moreover, the CPU 8 9.2, 49, 35 12.3, 68.2, 56.7 9.9, 53.2, 38 time needed for training is greater compared to the pro9 9, 38, 19 11.2, 48.1, 23.2 9.8, 45, 21 posed COF-NT. Comparing the proposed COF-NT with 10 9, 98.2, 39 12, 149.4, 58 11, 129.1, 49 SLNN [38], it clear that when SLNN is used multiple times in a non-linear tree structure, it is more powerful. 11 7, 9.9, 3.8 7.5, 11.7, 4.5 3, 4 The time taken to build the tree model is greater compared 12 6, 25, 8 11, 42, 6 5, 19 13 13.2, 188, 27 17.1, 220.4, 59.9 15, 201.4, 41 to SLNN but still not high enough to negate the benefit 14 6.2, 49.1, 19 9.2, 89.3, 26.9 7, 68.3, 37.1 of increased classification accuracy. Random Forest is a 15 7.3, 79.1, 29 11.3, 102.3, 39.1 8.1, 91.1, 48 well-established algorithm for classification. The results given by the proposed algorithm are either slightly better or very close to Random Forest. The Random Forest wins three times over the proposed algorithm. A comparison sification accuracy for ten experiments along with stan- in terms of tree size is also given in Table 4. The size of dard deviation obtained by each of the algorithms listed COF-NT is compared with BNT and NT-SLP. From the above and the proposed algorithm have been given in Ta- table, it is clear that, the tree built using the COF-NT alble 2. The experiments have been performed for two ac- gorithm is more compact in size with a smaller number of tivation functions i.e. sigmoid function and logarithmic perceptrons and binary split nodes. As a result, the classifunction. The results obtained are similar for each func- fication accuracy is improved by avoiding over-fitting of tion. Here we have reported the results with the logarith- data. Moreover, the classification time improves due to mic activation function. The classification accuracy ob- less traversal time to reach a leaf node. tained with the proposed COF-NT algorithm is generally A detailed statistical analysis has been done to investithe best among all the algorithms. The proposed COF- gate whether the results obtained with the proposed algoNT algorithm is able to gain 11 wins out of 15, where as rithm are significantly better than those obtained for other NT-MLP wins 1 out of 15, and Random Forest wins 3 out algorithms. A two-tailed, pair-wise F-test and t-test have of 15. However, these missed wins can be explained by been conducted between COF-NT and BNT, COF-NT and the reduced training time. Moreover, sometimes the clas- NT-MLP, and COF-NT and Random Forest . Two-tailed sification accuracy is sufficiently low (below 80%) due to F-test at 5% confidence level was conducted in order to the relatively high magnitude of error. A small error be- check the quality of variances of the results over 10 runs fore the activation function can propagate more accurate (10 fold cross validation). The computed values of FTaylor’s series expansion. Therefor, when error is high, statistics are listed in Table 5 for all the three pairs of alTaylor’s series approximation is not so exact. From the gorithms mentioned above. In the case of COF-NT and error analysis it has been observed that for an error higher BNT, it was observed that all the computed F-statistics than 2.224e−2 , the proposed method does not perform so values are in the range of two tailed F-critical values, exwell. cept for two datasets. Hence, null hypothesis H0 of equal The main objective of the proposed algorithm was to variances, may be accepted in all cases except the two. reduce the training time as compared to neural tree based In the case of COF-NT and NT-MLP, the F-statistic value algorithms without declining the classification accuracy, is in the range of F-critical values except for five datasets.

AC

CE

PT

ED

M

AN US

CR IP T

Table 4: Depth of tree build, number of perceptron nodes and number of split nodes in COF-NT, NT-SLP and BNT

7

ACCEPTED MANUSCRIPT

Table 2: Mean classification accuracy along with standard deviation for the data-sets given in Table 1 for a ten fold cross validation performed by proposed COF-NT, BNT ([25]), NT-SLP ([9]), NT-MLP ([26]), SLNN ([38]) and Random Forest ([43])

BNT 83.99±1.02 91.47±1.99 94.86±0.90 82.42±0.36 58.91±0.60 84.01±0.91 94.16±0.59 94.02±0.20 70.23±0.39 98.19±0.98 96.07±0.39 93.16±1.49 79.11±1.69 78.12±1.10 78.98±0.89

NT-SLP 84.82±1.90 90.51±1.49 94.23±2.12 82.33±1.39 58.03±0.59 82.60±1.69 94.04±2.12 89.94±2.17 69.23±2.10 98.01±0.90 96.5±1.49 90.52±2.45 81.14±2.11 78.81±1.29 79.88±0.78

NT-MLP 82.90±2.12 95.06±1.23 93.27±2.10 81.71±2.12 57.24±0.89 86.10±2.49 91.16±3.13 94.72±2.13 73.10±2.10 97.11±0.97 97.33±0.23 94.2±2.12 80.56±1.56 79.10±0.89 81.10±1.29

SLNN 77.33±0.89 85.21±0.95 95.63±0.19 80.41±1.20 57.13±12.11 79.58±2.10 86.60±0.56 91.06±0.29 69.29±0.48 97.14±1.11 96.12±0.29 92.29±1.23 80.11±0.20 77.23±0.39 76.89±1.01

Random Forest 93.29±1.29 96.39±1.17 96.13±0.76 82.44 ±0.41 60.56±0.53 86.01 ±1.40 92.80 ±0.89 94.05 ±0.69 66.60±0.88 98.11±1.29 95.33±0.33 89.96±1.39 81.11 ±1.23 78.14 ±1.19 77.04 ±1.88

CR IP T

proposed COF-NT 87.50±1.59 93.76±1.13 96.56±0.61 82.90±0.29 60.03±0.45 86.99±1.45 94.76±0.39 95.14±0.01 72.23±0.78 99.08±0.49 98.09±0.39 95.11±1.90 81.19±1.11 80.15±1.23 82.67±1.89

AN US

Sr.N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Table 3: Mean training-time in milliseconds along with standard deviation for the data-sets given in Table 1 for a ten fold cross validation performed by proposed COF-NT, BNT ([25]), NT-SLP ([9]), NT-MLP ([26]), SLNN ([38]) and Random Forest ([43])

CE

M

NT-SLP 2914468.8±4.9 80567.5±7.4 1490.3±11.34 1094.4±9.3 2083.1±3.4 1448982.6±4.9 990.4±4.6 1984.4±1.9 1425.3±9.9 2312.8±5.2 1192.5±2.9 33891.5±8.9 712343.4±9.9 16345.5±4.9 23567.6±8.4

ED

BNT 2823458.3±5.9 85490.4±5.6 1994.8±7.5 1282.42±9.0 2123.10±3.2 1589482.1±9.9 991.6±4.9 1987.2±5.8 1329.3±4.9 2279.11±2.9 1096.7±8.1 34791.6±9.9 689179.1±7.9 15678.4±9.3 23123.4±12.1

PT

COF-NT 69889.5±4.9 933.4±4.5 36.5±2.6 28.9±2.1 48.3±1.3 55886.9±4.5 29.9±5.6 95.1±7.1 79.3±8.2 98.9±7.34 20.9±1.9 589.1±6.9 1981.9±9.1 130.4±9.0 240.5±3.59

AC

Sr.N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

As the variance of NT-MLP is higher than COF-NT, COFNT is better than NT-MLP. In the case of COF-NT and Random Forest the null hypothesis can be accepted for all datasets accept for three databases.

NT-MLP 7123456.6±31.3 262341.5±6.9 23456.9±12.4 19834.5±5.6 244681±2.4 3672344.6±15.3 1256.4±2.8 23785.6±5.8 34124.4±3.9 46235.7±8.8 24675.7±7.5 123895.5±4.6 1167451.1±11.2 115634.5±6.9 106751.6±8.9

SLNN 7823.3±6.7 180.1±8.2 14.0±4.7 19.4±4.3 19.4±2.1 779.2±3.6 19.8±4.9 17.6±5.9 20.2±4.3 33.2±4.1 9.2±1.4 94.2±3.5 580.1±2.9 26.5±4.5 33.5±4.6

Random Forest 5541±11.21 473 ±9.10 52±3.9 113 ±6.89 413±12.10 439.3±8.10 162.3±7.12 52±3.19 53±6.10 312 ±2.10 23±1.9 571±9.12 1912±4.10 91±3.11 342±4.23

6. It can be observed that the t-statistics values are higher than t-critical value for all datasets in the case of COFNT and BNT. So the means of COF-NT are significantly better than those of BNT. In the case of COF-NT and NTMLP, t-stats is higher than t-critical for nine datasets. So the means of nine datasets for COF-NT are significantly

A two-tailed t-test with equal variances was performed at 5% significance level. The results are shown in Table 8

ACCEPTED MANUSCRIPT

better than those of of NT-MLP. In the case of Random Forest, t-stats is higher than t-critical value for 12 datasets. So the means of 12 datasets for COF-NT are better than those of Random Forest. Overall the proposed algorithm is better than all other algorithms tested in this paper.

[4] J. Sirat, J. Nadal, Neural trees: A new tool for classification, Neural Network 1 (1990) 423–448.

4. Conclusions

[6] A. Sankar, R. J. Mammone, Speaker independent vowel recognition using neural tree networks,, in: Int. Joint Conf. on Neural Networks, Seattle, 1991, pp. 809–814.

CR IP T

[5] O. T. Yildiz, E. Alpaydin, Omnivariate decision trees, IEEE Transactions on Neural Networks 12 (6) (2001) 1539–1546.

The proposed COF-NT algorithm employs a single layer feed-forward neural network at each node. The adopted neural network is trained by a learning method that is based on a convex objective function. This objective function computes the errors before non-linear activation functions instead of after them as is usually the case. In this way, this convex objective function ensures the absence of local optimum. The global optimum is obtained by solving a square system of linear equations. As a result, the algorithm is computationally faster. From the experiments carried out, it has been proven that the proposed NT algorithm is faster and classification accuracy is in general higher compared to existing neural tree based algorithms.

AN US

[7] T. Li, Y. Tang, S. Suen, L. Fang, A. Jennings, A structurally adaptive neural tree for the recognition of large character set, in: 11th Int. Conf. on Pattern Recognition IAPR - Conference B: Pattern Recognition Methodology and Systems, 1992, pp. 187–190.

M

[8] Y.-Q. Pan, Y. Liu, Y.-W. Zheng, Face recognition using kernel pca and hybrid flexible neural tree, in: Int. Conf. on Wavelet Analysis and Pattern Recognition -ICWAPR, 2007, pp. 1361 – 1366.

Acknowledgments

ED

One of the authors ‘Asha Rani’ is grateful to the department of science and technology (DST), India for their financial support under the grant number ’SR/WOS-A/ET96/2013’.

[10] S. Bouaziz, A. Alimi, A. Abraham, Evolving flexible beta basis function neural tree for nonlinear systems, Neural Networks (IJCNN) (2013) 1–8. [11] F. Qi, X. Liu, Y. Ma, Neural tree network ensemble mode for disease classification, in: Frontier and Future Development of Information Technology in Medicine and Education, LNEE, Springer, 2014, pp. 1791–1796.

PT

References

[9] G. L. Foresti, G. Pieroni, Exploiting neural trees in range image understanding, Pattern Recognition Letters 19 (1998) 869–878.

CE

[1] J. A. Anderson, E. Rosenfeld, Neurocomputing: Foundations of Research, Vol. 78, MIT Press, Cambridge, MA, 1988.

[12] S. Kumar, A. Rani, C. Micheloni, Application of balanced neural tree for classifying tentative matches in stereo vision, Optical Engineering 51 (8) (2012) 087202.

AC

[2] G. Deffuant, Neural units recruitment algorithm for generation of decision trees, in: Proceedings of the International Joint Conference on Neural Networks, Vol. 1, San Diego, CA, 1990, pp. 637–642.

[13] A. Rani, B. Raman, S. Kumar, A robust watermarking scheme exploiting balanced neural tree for rightful ownership protection, Multimedia Tools and Applications 72 (3) (2013) 2225–2248. doi:http://link.springer.com/article/10.1007/s11042013-1528-3.

[3] A. Sankar, R. Mammone, Neural Network: Theory and Application, Academic Press Professional, Inc. San Diego, CA, USA, 1992, Ch. Neural Tree Networks, pp. 281–302. 9

ACCEPTED MANUSCRIPT

Table 5: Statistical analysis (F-test) for pair of algorithms

CR IP T

AN US

Sr.N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

COF-NT and BNT COF-NT and NT-MLP COF-NT and Random Forest H F-stat DOF H F-stat DOF H F-stat DOF 0 2.4299 (9,9) 0 1.7778 (9,9) 0 1.5192 (9,9) 0 3.1013 (9,9) 0 1.1848 (9,9) 0 1.0720 (9,9) 0 2.1768 (9,9) 1 11.8483 (9,9) 0 1.5523 (9,9) 0 1.5410 (9,9) 1 17.1233 (9,9) 0 1.9988 (9,9) 0 1.7778 (9,9) 0 3.9116 (9,9) 0 1.3872 (9,9) 0 3.2045 (9,9) 0 2.7927 (9,9) 0 1.0727 (9,9) 0 2.2889 (9,9) 1 64.5161 (9,9) 1 5.2078 (9,9) 1 400 (9,9) 1 5000 (9,9) 1 0.0047 (9,9) 0 4 (9,9) 1 7.2464 (9,9) 0 1.2728 (9,9) 0 4 (9,9) 0 3.9188 (9,9) 1 6.9309 (9,9) 0 1 (9,9) 0 2.8752 (9.9) 0 1.3967 (9.9) 0 1.6261 (9,9) 0 1.2450 (9,9) 0 1.8684 (9,9) 0 2.3180 (9,9) 0 1.9751 (9,9) 0 1.2279 (9,9) 0 1.2503 (9,9) 0 1.9100 (9,9) 0 1.0684 (9,9) 1 4.5097 (9,9) 0 2.1466 (9,9) 0 1.0107 (9,9) DOF:(dof in num, dof in denom); F-critical=(0.2483, 4.02599) Table 6: Statistical analysis (t-test) for pair of algorithms

COF-NT and NT-MLP COF-NT and Random Forest t-stat DOF s2P H t-stat DOF s2P 5.4654 (18,18) 3.5113 1 -8.9425 (18,18) 2.0961 -2.4613 (18,18) 1.3949 1 -6.7436 (18,18) 0.7605 4.7576 (18,18) 2.3910 0 1.3953 (18,18) 0.4748 1.758 (18,18) 2.2893 1 2.8966 (18,18) 0.1261 8.8467 (18,18) 0.4973 1 -2.4106 (18,18) 0.2417 0.9767 (18,18) 4.1513 0 1.5376 (18,18) 2.0313 3.6092 (18,18) 4.9745 1 6.3786 (18,18) 0.4721 0.6235 (18,18) 2.2685 1 4.9950 (18,18) 0.2381 -1.2281 (18,18) 2.5092 1 15.1401 (18,18) 0.6914 5.7325 (18,18) 0.5905 1 2.2229 (18,18) 0.9521 5.3081 (18,18) 0.1025 1 17.0840 (18,18) 0.1305 1.0108 (18,18) 4.0522 1 6.9178 (18,18) 2.7711 1.0405 (18,18) 1.8329 0 0.1527 (18,18) 1.3725 2.1870 (18,18) 1.1525 1 3.7140 (18,18) 1.4645 2.1697 (18,18) 2.6181 1 6.6785 (18,18) 3.5532 S 2p : pooled variance; t-critical=2.1009

M

ED

PT

CE

H 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

AC

Sr.N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

COF-NT and BNT t-stat DOF s2P H 5.8758 (18,18) 1.7843 1 3.1645 (18,18) 1.6182 1 4.9445 (18,18) 0.5911 1 3.2835 (18,18) 0.1068 0 4.7223 (18,18) 0.2813 1 5.5048 (18,18) 1.4653 0 2.6827 (18,18) 0.2501 1 17.6867 (18,18) 0.0201 0 7.2524 (18,18) 0.3802 0 2.5687 (18,18) 0.6002 1 11.5817 (18,18) 0.1521 1 2.5539 (18,18) 2.9150 0 3.2531 (18,18) 2.0441 0 3.8903 (18,18) 1.3615 1 5.5857 (18,18) 2.1821 1 DOF: degree of freedom;

[14] A. Rani, B. Raman, Kumar, A fragile watermarking scheme exploiting neural tree for image tamper detection, in: International Conference on Soft Computing for Problem Solving (SocProS), 2011,

pp. 547–554. [15] D. Martinez, Neural tree denisty estimation for novelty detection, IEEE Trans. Neural Networks 9 10

ACCEPTED MANUSCRIPT

(1998) 330–338.

[26] P. Maji, Efficient design of neural network tree using a single spilitting criterion, Nerocomputing 71 (2008) 787–800.

[16] Y. Chen, B. Yang, Q. Meng, Small-time scale network traffic prediction based on flexible neural tree, Applied Soft Computing 12 (2012) 274–279.

CR IP T

[27] Y. Chen, B. Yang, J. Dong, A. Abraham, Time-series forecasting using flexible neural tree model, Information sciences 174 (2005) 219–235.

[17] H. Teng, S. Liu, Y. Chen, Protein tertiary structural prediction based on a novel flexible neural tree, Journal of Chemical and Pharmaceutical Research 5 (2013) 678–683.

[28] G. L. Foresti, C. Micheloni, Generalized neural trees for pattern classification, IEEE Trans. on Neural Networks 13(6) (2002) 1540–1547.

[18] B. Biswal, M. Biswal, S. Mishra, Automatic classification of power quality events using balanced neural tree, IEEE Trans. Industrial Electronics 61 (2014) 521–530.

AN US

[29] G. L. Foresti, T. Dolso, An adaptive high-order neural tree for pattern recognition, IEEE Trans. on Systemas, Man, Cybernatics- part B: Cybernatics 34 (2) (2004) 988–996.

[19] C. C. Tsai, M. C. Lu, C. C. Wei, Decision treebased classifier combined with neural-based predictor for water-stage forecasts in a river basin during typhoons: A case study in taiwan, Environ Eng Sci 29 (2) (2012) 108–116.

[30] T. Chen, C. Chang, C. Wu, D. Lou, On the security of a copyright protection scheme based on visual cryptography,, Computer Standards & Interfaces 31 (2009) 1–5. [31] M. Kubat, Decision trees can initialize radial-basis function networks, IEEE Transactions on Neural Networks 9(5) (1991) 813–821.

[21] K. J. Cios, A machine learning method for generation of a neural-network architecture: A continuous id3 algorithm, IEEE Transaction on Neural Networks 3 (1992) 280291.

[32] A. Rani, S. Kumar, C. Micheloni, G.-L. Foresti, Incorporating linear discriminant analysis in neural tree for multidimensional splitting, Applied Soft Computing. 13 (10) (2013) 4219–4228.

[22] S. Bouaziz, A. M. Alimi, A. Abraham, Extended immune programming and opposite-based pso for evolving flexible beta basis function neural tree, in: IEEE, International Conference on Cybernetics (CYBCONF), 2013, pp. 13 –18.

[33] A. Rani, S. Kumar, Df-lda tree: a nonlinear multilevel classifier for pattern recognition, Journal of Experimental & Theoretical Artificial Intelligence 25 (2) (2012) 177–188.

PT

ED

M

[20] P. E. Utgoff, Perceptron tree: A case study in hybrid concept representation., Connection Science 1(4) (1989) 377 – 391.

CE

[34] M. F. Amasyal, O. Ersoy, Cline: A new decisiontree family, IEEE Transactions on Neural Networks 19 (2) (2008) 356–363.

[23] I. K. Sethi, Entropy nets: From decision trees to neural networks, in: proceddings of the IEEE, Vol. 78, 1990, pp. 1605–1613.

[35] S. Bouaziz, H. Dhahri, A. Alimi, A. Abraham, A hybrid learning algorithm for evolving flexible beta basis function neural tree model, Neurocomputing 117 (2013) 107–117.

[25] C. Micheloni, A. Rani, S. Kumar, G. Foresti, Balanced neural tree for pattern classification, Neural Networks 27 (2012) 81–90.

[36] M. Minsky, S. Papert, Perceptrons: An Introduction to Computational Geometry, MIT Press, Cambridge, MA, 1969.

AC

[24] H. Guo, S. B. Gelfand, Classification trees with neural-network feature extraction, IEEE Transactions on Neural Networks 3 (1992) 923–933.

11

ACCEPTED MANUSCRIPT

[37] S. Sontag, H. Sussmann, Back propagation can give rise to spurious local minima even for networks without hidden layers, Complex Systems 3 (1989) 91–106.

CR IP T

[38] F. R. Oscar, G. B. Bertha, P. S. Beatriz, A. B. Amparo, A new convex objective function for the supervised learning of single-layer neural networks, Pattern Recogintion 43 (2010) 1984–1992. [39] A. Bojanczyk, Complexity of solvingl inear systems in different models of computation, SIAMJournalonNumericalAnalysis 21 (13) (1984) 591–603.

AN US

[40] G. Carayannis, N. Kalouptsidis, D. Manolakis, Fast recursive algorithms for a class of linear equations, IEEE Transactions on Acoustics, Speech and Signal Processing ASSP 30 (2) (1982) 227–239.

[41] A. Asuncion, D. J. Newman, Uci machine learning repository, http://www.ics.uci.edu/ mlearn/MLRepository.html (2007).

ED

M

[42] H. Ragheb, S. Velastin, P. Remagnino, T. Ellis, Vihasi: Virtual human action silhouette data for the performance evaluation of silhouette-based action recognition methods, in: In ACM Int. Workshop on Vision Networks for Behaviour Analysis, Vancouver, Canada, 2008, p. 7784.

PT

[43] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, H. Ian, The WEKA Data Mining Software: An Update, Vol. 11 of 1, SIGKDD Explorations, 2009.

AC

CE

[44] A. Rani, S. Kumar, C. Micheloni, G. Foresti, Human action recognition using a hybrid ntld classifier, in: Seventh IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS) Boston, USA, 2010, pp. 262–269.

12