SIM-ELM: Connecting the ELM model with similarity-function learning

SIM-ELM: Connecting the ELM model with similarity-function learning

Accepted Manuscript SIM-ELM: Connecting the ELM model with similarity-function learning Paolo Gastaldo, Federica Bisio, Sergio Decherchi, Rodolfo Zuni...

718KB Sizes 2 Downloads 147 Views

Accepted Manuscript SIM-ELM: Connecting the ELM model with similarity-function learning Paolo Gastaldo, Federica Bisio, Sergio Decherchi, Rodolfo Zunino PII: DOI: Reference:

S0893-6080(15)00211-7 http://dx.doi.org/10.1016/j.neunet.2015.10.011 NN 3549

To appear in:

Neural Networks

Received date: 4 March 2015 Revised date: 17 September 2015 Accepted date: 20 October 2015 Please cite this article as: Gastaldo, P., Bisio, F., Decherchi, S., & Zunino, R. SIM-ELM: Connecting the ELM model with similarity-function learning. Neural Networks (2015), http://dx.doi.org/10.1016/j.neunet.2015.10.011 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

*Manuscript Click here to view linked References

SIM-ELM: Connecting the ELM model with Similarity-Function Learning Paolo Gastaldo,1 Federica Bisio,1 Sergio Decherchi,2 and Rodolfo Zunino1 1 Dept. of Electrical, Electronics, and Telecommunications Engineering and Naval Architecture (DITEN), University of Genoa, Genova, Italy {paolo.gastaldo, rodolfo.zunino}@unige.it, [email protected]

2 Istituto Italiano di Tecnologia, IIT, Morego, Genova, Italy [email protected]

Abstract This paper moves from the affinities between two well-known learning schemes that apply randomization in the training process, namely, Extreme Learning Machines (ELMs) and the learning framework using similarity functions. These paradigms share a common approach involving data re-mapping and linear separators, but differ in the role of randomization within the respective learning algorithms. The paper presents an integrated approach connecting the two models, which ultimately yields a new variant of the basic ELM. The resulting learning scheme is characterized by an analytical relationship between the dimensionality of the remapped space and the learning abilities of the eventual predictor. Experimental results confirm that the new learning scheme can improve over conventional ELM in terms of the trade-off between classification accuracy and predictor complexity (i.e., the dimensionality of the remapped space).

1. Introduction Randomization has been getting increasing attention in the area of machine learning, mostly thanks to the resulting simplicity and speed in the empirical training process. The Extreme Learning Machine (ELM) [1-3], for example, has recently emerged as a powerful and flexible paradigm in that context. The most interesting feature of the ELM framework is the ability to attain a notable generalization performance by deploying a Single-hidden-Layer Feedforward Network (SLFN), in which all hidden-nodes parameters are set randomly. When compared with

conventional neural networks, this configuration simplifies the training procedure, since ELM training just requires to solve a linear system. In addition, the model confirms that randomization can play a significant role in inductive learning [1, 4, 5, 6, 7]. This paper addresses the general question about how randomization supports the overall learning process. The analysis considers the relationship between the ELM model and the framework discussed in [4], where a learning theory based on similarity functions is formalized. The latter work points out two interesting aspects: first, by tackling inductive learning with similarity functions instead of (today’s popular) kernel functions, one is no longer tied to functions that 1) span high-dimensional spaces implicitly, and 2) rely on positive semi-definite matrixes. Secondly, similarity functions support a two-stage learning paradigm that shares a common formalism with the ELM model. The first stage involves an explicit mapping of data: in the case of the framework presented in [4], a subset of patterns (landmarks) is randomly extracted from the available dataset, and training samples are all remapped depending on the similarity of each sample to those landmarks. In the second stage, a conventional learning algorithm sets a linear classifier in the remapped space. In principle, even in the presence of a common approach based on data re-mapping and linear separators, the theory presented in [4] and the ELM model seem to propose different solutions for exploiting randomization in the respective learning paradigms. This paper shows that the framework [4] can stimulate novel investigations on ELM, ultimately leading to the definition of a new variant of the original ELM model: the Similarity ELM (SIM-ELM). The novel learning scheme formalized in this paper exhibits an interesting feature. SIMELM well fits the theoretical framework that sets the conditions for learning through similarity functions [4]. As a major result, the resulting learning scheme inherits the fruitful properties that set a formal relationship between the dimensionality of the remapped space and the learning abilities of the eventual predictor. The experimental verification of SIM-ELM addressed binary classification and involved six real-world complex benchmarks, namely, Ionosphere [8], Heart [8], Glass Identification [9], USPS [9], Cod-RNA [8], and Covertype [8]. Experimental results confirmed the effectiveness of SIM-ELM and proved that the proposed learning scheme can outperform the conventional basic-ELM approach. Empirical evidence showed that SIM-ELM can attain a valuable trade-off between classification accuracy and dimensionality of the hidden layer in the eventual predictor

(i.e., dimensionality of the remapped space). As a consequence, SIM-ELM often scored better classification accuracy than the basic ELM model, while requiring a smaller number of mapping units in the hidden layer. This feature becomes even more important when envisioning electronic implementations of the predictor, since both computational complexity and storage requirements strictly depend on the size of the hidden layer. Experimental evidence also pointed out that SIM-ELM would compare satisfactorily with a state-of-the-art, supervised-learning paradigm such as Support Vector Machines (SVMs) [10]. The paper is organized as follows. Section 2 reviews ELM and the theory of learning with similarity functions. Section 3 illustrates convergences and divergences between the two learning schemes, thus setting the basis for the development of a novel learning algorithm that inherits the advantages of both approaches. Section 4 introduces two different implementations of the novel variant of the original ELM model, i.e., SIM-ELM 1 and SIM-SELM 2. Section 5 presents the experimental results. Finally, some concluding remarks are made in Section 6.

2. Background 2.1 Extreme Learning Machine The ELM model [1] implements a single-hidden layer feedforward neural network (SLFN) with N mapping neurons. The neuron's response to an input stimulus, x Z, is implemented by any nonlinear piecewise continuous function a(x,R), where R denotes the set of parameters of the mapping function. The overall output function is then expressed as N

f ( x)   w j h j ( x)

(1)

j 1

where wj denotes the weight that connects the jth neuron with the output, and hj(x) = a(x,R j). The peculiar aspect of ELM, though, is that the parameters R j are set randomly. Hence, if one uses, for example, classical Radial Basis Functions (RBFs) to implement a(): a(x, R )  exp(  x  c ) 2

(2)

the parameters to be set randomly are the coordinates of each centroid, c Z, and the quantity

. Table 1 reports on the random parameters to be set for three very popular choice of a(): sigmoid function, RBF, and polynomial function.

Table 1 Examples of functions that can implement neuron's response to an input stimulus in ELM. Function

a(x, R )

Random Parameters

sigmoid

1 1  exp( x t r  b)

r Z b 

RBF

exp(  x  c )

c Z  

polynomial

(x t r  b) e

r Z b 

2

Accordingly, in general, the training process reduces to the adjustment of the output layer, i.e., setting the vector w N in (1). As a result, training ELMs is equivalent to solving a regularized least squares (RLS) problem in a linear space [1, 11]. Hence, let H denote a P × N matrix, where hij = hj(xi) and P is the number of samples in the training set T = {(x,y)i; i = 1,…,P}; then, the eventual minimization problem can be expressed as



min y  Hw w

2

 w

2



(3)

The vector of weights w is then obtained as follows:

w  (H t H  I) 1 H t y

(4)

The ELM model can be conveniently described as a 2-stage learning machine. In the first stage, the data originally lying in the Z-dimensional space are remapped into a new Ndimensional space (ELM feature space) by exploiting as many ‘random’ neurons. Then, RLS is used for learning the linear classifier in the N-dimensional space. Obviously, the same framework can be easily extended to regression problems. The literature showed that this learning scheme can attain notable performances in terms of prediction accuracy on a wide range of realworld problems [12-17].

2.2 Theory of Learning with Similarity Functions Similarity-based classifiers predict the class of a test pattern based on 1) the similarities between the pattern and a set of labeled training samples, and 2) the pairwise similarities between the

training samples. Kernel-based learning machines [18] are a popular and powerful family of such classifiers; in this case, the notion of similarity is expressed by adopting positive-semidefinite, symmetric kernel functions, which exploit an implicit mapping of data into a Hilbert space. The theoretical framework formalized in [4] proved that similarity-based classifiers can guarantee tight generalization bounds even if one adopts similarity functions that do not strictly fit the assumptions imposed by kernel functions. According to [4], a similarity function over an input space X is any pairwise function K: X × X  [-1,1]; K is a symmetric similarity function if K (x,x′) = K (x′,x) for all x, x′ X. The framework is indeed based on the following definition of “good” similarity function, which reasonably assumes that the two classes are evenly represented in the training set T ; the formalism l(x) denotes the label of pattern x. Definition [4]: a similarity function K is an (, γ)-good similarity function for a learning problem L if there exists a bounded weighting function  over X ((x′)  [0,1] for all x′ X) such that at least a 1- probability subset of examples satisfy

Ex~L [(x)K (x, x) | l (x)  l (x)]  Ex~L [(x)K (x, x) | l (x)  l (x)]  

(5)

The crucial aspect is that the above definition requires the bounded weighting function  to exist, but it is not required that such function is known a-priori. In practice, K is an (, γ)-good similarity function if a weighting scheme exists such that the following condition holds: a set of (1-)∙P training patterns are, on average, 'more similar' to random samples of the same class than to random samples of the opposite class. As a major result, the definition of (, γ)-good similarity function can lead to the learning procedure outlined as pseudocode in Fig. 1, which mainly includes two steps. The first step exploits a similarity function, and a set of 2d samples drawn at random from the training set (landmarks), to construct a new feature space, 2d, in which the original samples x Z are re-mapped; the 2d landmarks should include d samples for each class. For a given pattern, x, the remapping requires to compute the similarity between the pattern itself and each landmark. In the second step, a linear predictor is trained in the new feature space. The learning abilities of this procedure have been formally analyzed in [4]: if K is a (, γ)-good simi-

Algorithm 1 Inputs: a training set T = {(x,y)i; i = 1,…,P} a similarity function K number of landmarks per class d 1.

Initialize a. work out d positive examples from T: S+ = {(x+m; m = 1,…,d} b. work out d negative examples from T: S- = {(x-n; n = 1,…,d}

2.

Mapping remap all the patterns x  T by using the following mapping function (x) = {K(x, x+1), .., K(x, x+d), K(x, x-1), .., K(x, x-d)}

3.

Learning train a linear predictor in the transformed space : X 

2d

Figure 1. The learning scheme that exploits the theory of learning with (, γ)-good similarity functions [4].

larity function and d = (4/)2ln(2/), then with probability at least 1- there exists a low-error (≤

+) large-margin (≥ ) separator in the new feature space. An (, γ)-good similarity function is indeed a strict generalization of the notion of good kernel [4]. Such similarity functions need not subsume valid semi-positive definite kernels to support the learning procedure suggested in Algorithm 1. The learning scheme outlined in Algorithm1 provides a viable approach to deal with the definition of (, γ)-good similarity function, which relies on a weighting scheme  that does not need to be known a-priori. This in turn means that, in principle, for a training set T any admissible similarity function K supports an (, γ)-good similarity function. The actual settings of the parameters { , γ} depend on the eventual solution supported by the linear predictor (as per Step 3 of Algorithm 1), which is entitled to set the weighting scheme. The optimal, yet ideal, similarity function clearly is the one that admits a remapping of the original data into a space, , where the two classes are linearly separable ( = 0) with the largest possible margin γ. The margin, γ, and the “performance” parameter, , set a reference value d = (4/)2ln(2/) for the number of landmarks to be adopted by the predictor. The number of required landmarks expectably decreases as the margin provided by the mapping increases; likewise, the number of required landmarks decreases as the bound to the expected performance increases. It is interest-

Figure 2. Number of landmarks d as a function of margin γ and .

ing to quantify d based on realistic values of γ and . Figure 2 gives a contour plot that provides the values of d as a function of both γ (x axis) and  (y axis). The plot shows that, for example, if one wants to set  = 0.2 (i.e., the bound on the expected classification error is ( + 20%)), then 6,000 landmarks (3,000 × 2) are required when the similarity function guarantees a margin γ between 0.1 and 0.15, i.e., a significant margin. The number of landmarks grows to more than 10,000 if one wants to set a tighter bound to the expected classification error ( < 0.1). In general, one should take into account that only large-scale datasets can provide the amount of landmarks required to deal with configurations where a tiny margin is involved and a reasonable  is addressed. On the other hand, the quantity (+) sets an upper bound; hence, in practical situations, one may expect that effective performance can be obtained also when the amount of available landmarks runs into the hundreds or even into the tenths.

3. ELM and Learning with Similarity Functions The theory of learning with (, γ)-good similarity functions stimulates new insights about the ELM model. The two learning schemes feature quite interesting affinities. Both frameworks im-

plement a two-step learning procedure, where the first step applies a remapping of the original input space X and the second step adjusts a linear predictor in the new space. Moreover, both frameworks apply a randomization process in the first stage. Since, on the other hand, each framework possesses its own peculiarities, Section 3.1 will analyze convergences and divergences between the two paradigms; Section 3.2 will exploit a toy example to help the reader understand the peculiarities of the mapping scheme (x). 3.1 Learning in a Remapped Space: the Role of Landmarks and Similarity Functions A convenient way to highlight the relationships between the theory of (, γ)-good similarity functions and the ELM model is to rewrite the mapping function, : Z  N, that ELM applies to an input pattern x Z:

(x) = {h1(x), …, hN(x)} = {a(x,R1), …, a(x,RN)}

(6)

A comparison between the ELM mapping, (x), and the mapping function of Algorithm 1, (x), suggests that the function a(x,Rj) can indeed be interpreted as a function K that assesses the similarity between the pattern x and a random point that lies in Z. For example, if one implements the function a() with RBFs, each element hj(x) may be interpreted as the similarity between x and a centroid cj randomly selected. From this viewpoint, the mapping function (x) uses a similarity function K (i.e., the RBF) and N randomly selected landmarks (i.e., the centroids) to remap the original input space Z into a new space N. One may also draw similar conclusions for different implementations of the function a(x,Rj). As a general rule, any admissible a(x,Rj) can be a function that measures the similarity -according to a given criterion- between a pattern x Z and a random vector that lies in the same space. In summary, both (x) and (x) can be viewed as mapping functions that exploit the similarity between a pattern x and a number of landmarks, to remap the original input space into a new space. A more detailed discussion on the convergences and divergences between (x) and

(x) should now address two main issues: 1. Landmark placement. The mapping scheme implemented by (x) relies on randomness to draw, from the available dataset T, a subset of samples to be used as landmarks. This highlights the fruitful properties of (, γ)-good similarity functions: in the case of (x), landmarks

are provided by valid samples of the (unknown) domain distribution, p(X, Y). By contrast, the mapping scheme, (x), of ELM applies a random sampling of the overall input space, X, to create the landmarks. As a consequence, ELM landmarks might end up in regions of the data space X that are not covered by the actual patterns in the available dataset, since in those regions the original (unknown) data distribution might nullify. This discrepancy is relevant, since the design of (x) is matched by theoretical relationships between the number of landmarks (2d) and the representation ability of the eventual learning machine (as per Sec. 2), whereas a similar theoretical property is not available for (x) in the ELM model. This aspect involves the critical issue of trading-off the generalization ability of the eventual predictor and its complexity, which directly relates to the dimensionality of the mapping space (i.e., the number of hidden nodes in the ELM model). 2. Similarity functions. The set of admissible similarity functions also includes functions that are characterized by a parameterization (e.g.,for the RBF, b for the sigmoid and the polynomial). This in turn implies that, for example, the pair K1 = exp(-0.1∙||x - c||2), and K2 = exp(-0.3∙||x - c||2), actually includes different similarity functions. The mapping scheme implemented by (x) actually relies on one specific (, γ)-good similarity function K. Balcan and Blum [4], though, proved that the learning abilities of the procedure reported in Algorithm 1 remain unchanged even when some convex combination of F similarity functions {K1, …, KF} satisfy the definition of an (, γ)-good similarity function (see Theorem 3 in [4] for further details). In this case, a new mapping scheme (x): X  2Fd replaces the original mapping scheme (x):

(x) = {K1(x, x+1), .., KF(x, x+1), …, K1(x, x+d), .., KF(x, x+d), K1(x, x-1), .., KF(x, x-1), …, K1(x, x-d), .., KF(x, x-d)}

(7)

The mapping scheme, (x), sets a link between the theory of similarity functions and the area of kernel learning [4] [19]. In this case the learning process is supported by stage 3 of Algorithm 1. In fact, Algorithm 1 requires to define both the similarity function(s) to be adopted, and the associated parameter(s) settings. Moreover, the quantity F has a direct impact on the overall complexity of the eventual learning machine, as the mapping space is 2Fd-dimensional.

The ELM model addresses this issue by assigning random values to the functions' parameters; as a result, a specific, random parameterization characterizes each neuron in (x). In practice, this means that the ELM model may use N different similarity functions in the mapping stage (where N is the number of neurons). However, the ELM model does not apply the mapping procedure implemented by (x). The mapping scheme (x) in ELMs mostly resembles (x), although one cannot strictly assume that every neuron adopts the same function K. In this sense, the process of assigning a random parameterization to each single neuron represents the peculiar aspect that makes the mapping scheme implemented by ELM different from that theoretically formalized in [4].

Finally, it is interesting to make a comment on the convergences and divergences between the role played by landmarks in the theory of learning with similarity function and the role played by support vectors in SVMs. In practice, in the theory of learning with similarity function [4], as in the SVM theory, landmarks (support vectors) are provided by valid samples of the p(X, Y).

Indeed, support vectors are landmarks that eventually stem from a learning process. Hence,

landmarks are set a-priori, by exploiting randomness, while support vectors are set a-posteriori.

3.2 An example on synthetic data A synthetic example can clarify the peculiarities of the two mapping strategies: (x) and (x). Figure 3(a) presents a toy problem, involving sixteen patterns lying on a 2-dimensional space. The patterns are evenly divided between two classes (crosses and squares, respectively). In both cases, patterns are random samples worked out from a multivariate normal distribution with covariance matrix *: the first class (crosses) is characterized by 1 = [-0.3 -0.3]; the second class (squares) is characterized by 2 = [0.3 0.3]. Each pattern is numbered to facilitate the exposition. The role played by the mapping stage in Algorithm 1 can be illustrated by considering two examples of similarity functions: 

K1 : a sigmoid function with b = 0;



K2: a RBF with  = 1. The definition of (, γ)-good similarity function suggests that the first crucial step is to

evaluate the ability, for each available pattern, to contribute to a consistent remapping of the

whole dataset; i.e., the ability of each pattern to represent a “good” landmark. Thus, Figure 4(a) provides the outcome that one would obtain by using K1. The x-axis sets sixteen points, which correspond to the sixteen available landmarks: accordingly, the sixteen marks with x = 1 plot (on the y-axis) the similarity between the sixteen patterns and pattern #1, which covers the role of the landmark. The same presentation applies for the remaining fifteen spots on the x-axis: for x = 2 the role of the landmark is covered by pattern #2, and so on. Figure 4(b) applies the same format to show the results that one would obtain by using K2 as similarity function. Figure 4(a) refers to experiments involving paradigm K1; every pattern, when used as landmark, can by itself provide a complete separation of the two classes. Moreover, all samples are 2γ = 0.36 more similar to a landmark of the same class than to a landmark of the opposite class. According to the definition of (, γ)-good similarity function, this means that K1 can provide a (0, 0.18)-good similarity function even under the trivial case in which the weighting scheme  assigns the same weight to each landmark. This in turn means that, in principle, one may set up an effective classifier without explicitly training the linear separator (as per Step 3 of Algorithm 1). Figure 4(b) shows that, on the other hand, K2 may represent a less effective choice for the problem at hand: only eight patterns out of sixteen would provide valuable landmarks

(a)

(b)

Figure 3. The toy bi-class problem adopted in the proposed example: (a) training patterns; (b) training patterns and test patterns.

(a)

(b)

Figure 4. Data remapping by exploiting available data as landmarks: (a) similarity function: sigmoid; (b) similarity function: RBF.

under a plain weighting scheme. However, one may expect that a weighting scheme exists that can improve the effectiveness of this mapping. Thus, in this (more realistic) case, training a linear separator in the remapped space can lead to enhancements. In this regard, it is interesting to compare the performance of two predictors realized according to the set up formalized in Algorithm 1. In the example, each predictor uses two landmarks per class (i.e., d = 2) and a RLS loss function is exploited to train a linear classifier in the transformed space . Predictor #1 adopts K1 as similarity function, while predictor #2 adopts K2.

Table 2 Classification accuracy scored by the two predictors on the test patterns of the toy example. Predictor

Test Pattern 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

#1

100

100

100

100

100

75

100

100

100

100

100

100

100

100

100

100

#2

100

100

100

100

80

25

100

100

100

100

100

100

100

100

100

100

The sixteen samples of Fig. 3(a) form the training set. The test set includes eight patterns per class, as per Fig. 3(b); for each class, patterns have been worked out by randomly sampling the distribution that already characterized the training set. The performance of each predictor can be assessed by evaluating how its classification accuracy (on the test set) varies under the 784 possible configurations of the mapping stage (four landmarks among sixteen available patterns, two landmarks per class). In this regard, Table 2 gives, for each test pattern, the percentage of configurations that would lead to a correct prediction for predictor #1 and predictor #2, respectively. Numerical results provide the following outcomes: 

Test pattern #6 represents a critical sample for both the predictors. Indeed, in the case of predictor #1, the 75% of the admissible configurations of  would lead to a correct prediction; in the case of predictor #2, the configurations that would lead to a correct classifier are only a 25% over a total of 4900.



Test pattern #5 represents a critical sample only for predictor #2, which would misclassify the pattern with 20% of the admissible configurations. Table 2 shows that the random selection of the landmarks in the training stage may have

a considerable impact on the performance of the eventual classifier. This in turn confirms that one can take advantage of the availability of good landmarks, i.e., landmarks for which the similarity function becomes a (0, )-good similarity function. Nonetheless, the results of Table 2 also attest the role played by the weighting scheme, which can improve the overall performance of a predictor based on a ‘weak’ mapping. Accordingly, the eventual gap between predictor #1 and predictor #2 is not ample even if the latter could not rely on an effective similarity function for the problem at hand.

The last aspect to be addressed is a comparison between the mapping scheme (x) and the mapping scheme (6) supported by the ELM model. To the purpose of setting up a fair comparison, one can tackle the classification problem of Fig. 3 by building up an ELM that maps the original two-dimensional space into a four-dimensional space (i.e., four neurons in the hidden layer). Accordingly, the new experiment involves two different implementations of predictor #1, which indeed share the dimensionality of the remapped space: 

SIM: this implementation uses available training patterns as a source of possible landmarks, as above; thus, 784 admissible configurations of the mapping stage are available (under the hypothesis that two landmarks per class should be used).



ELM: this implementation can exploit as possible landmarks sixteen random samples that lie on 2 (in the interval [-1, 1]). Thus, 1820 admissible configurations of the mapping stage are available (four landmarks among sixteen available samples). Fig. 5 provides the results of this experiment. Fig. 5(a) refers to the SIM predictor and

plots five different lines on the original two-dimensional space: 

1: this perimeter encloses the portion of the space in which a pattern would be invariantly assigned to class “+1”; thus, all the 784 admissible configurations of the predictor would classify as “+1” a pattern lying in this area.



weak 1: this perimeter delimits the portion of the space in which a pattern would be assigned to class “+1” by no more than the 75% of the available configurations; thus, given the 784 admissible combinations of the four landmarks, up to 196 configuration would generate a predictor that classifies as “-1” a pattern lying in this area.



0: a pattern that lies over this contour would be assigned to class “+1” by 50% of the admissible predictors, while the remaining 50% of the admissible predictors would classify it as “1”.



weak -1: this perimeter delimits the portion of the space in which a pattern would be assigned to class “-1” by no more than the 75% of the available configurations; thus, given the 784 admissible combinations of the four landmarks, up to 196 configuration would generate a predictor that classifies as “+1” a pattern lying in this area.



-1: this perimeter encloses the portion of the space in which a pattern would be invariantly assigned to class “-1”; thus, all the 784 admissible configurations of the predictor would classify as “-1” a pattern lying in this area. Fig. 5(b) refers to the ELM predictor and shows the results of the experiment by using the

same set up. Overall, Fig. 5(a) confirms that test pattern #6 is the only critical sample for the SIM predictor, as 196 combinations of landmarks exist that would generate a predictor that misclassifies that pattern. On the other hand, Fig. 5(b) proves that the number of critical patterns increases when the candidate landmarks are random samples (i.e., when adopting the ELM predictor). This outcome is interesting since it validates the effectiveness of the mapping strategy proposed in [4], which exploits available samples as landmarks. A comparison between the performances attained by the mapping scheme  implemented by ELM and those scored by the mapping scheme  seems to suggest that the latter scheme may be more effective in terms of trade-off between dimensionality of the mapping space and representation ability. Hence, for a similarity function K* and a dimensionality 2d* of the mapping space, one expects that, on average, a predictor based on the mapping scheme  attains a classification performance that a predictor applying the scheme  cannot improve. This hypoth-

(a)

(b)

Figure 5. The results of the experiment designed to provide a comparison between the two mapping schemes: (a) SIM predictor; (b) ELM predictor.

esis appears reasonable when considering that the representation ability of the learning machine based on the mapping scheme  has been formally determined [4], whereas a corresponding theoretical framework does not exists for the mapping scheme . On the other hand, it is worth noting that the toy example involved two non-parametric similarity functions (as the free parameter has been set to a definite value in both functions). This is a crucial aspect because the peculiarities of the mapping scheme applied by the ELM model also lie in the use of randomization for setting the neurons’ parameters. So, although the example is useful to get insights into the characteristics of the learning framework of Algorithm 1, one cannot draw any general conclusion about the learning abilities of that framework as compared with the ELM model.

4. SIM-ELM: a novel learning scheme for ELM The convergences between the ELM model and the theoretical framework based on ( , )-good similarity functions stimulate the design of novel learning schemes that may benefit from the properties of both paradigms. The research presented here proposes two different learning schemes, which augment the basic ELM model by using, in the mapping stage, landmarks drawn from the empirical training set. The most remarkable aspect of the learning scheme implemented by Algorithm 1 is the availability of a formal relationship between the properties of the similarity function (i.e., parameters  and ), the number of landmarks d, and the representation ability of the eventual classifier. On the other hand, the ELM model suggests a viable approach to the problem of parameterization of the similarity function(s): the use of random settings. In this regard, it seems interesting to address two topics: 

Can the ELM learning model benefit from a mapping scheme that uses only available training patterns as candidate landmarks?



Can the mapping scheme (x) formalized in (7) benefit from a random setting of the parameterization of the similarity function(s)? The two novel learning schemes outlined in Fig. 6 and Fig. 7, respectively, are designed

to tackle these topics. These learning schemes will be denoted in the following as SimilarityELM (SIM-ELM).

Algorithm 2 – SIM-ELM 1 Inputs: training set T = {(x,y)i; i = 1,…,P} regularization parameter  similarity function K(xi,xj,R) number of landmarks per class d number of combinations of similarity function F 1.

Initialize a. work out d positive examples from T: S+ = {(x+m; m = 1,…,d} b. work out d negative examples from T: S- = {(x-n; n = 1,…,d} c. set F random parameterizations: R1,…, RF

2.

Mapping remap all the patterns x  T by using the following mapping function

(x) = {K(x,x+1,R1), .., K(x,x+1,RF), …, K(x,x+d,R1), .., K(x,x+d,RF), K(x,x-1,R1), .., K(x,x-1,RF), …, K(x,x-d,R1), .., K(x,x-d,RF)} 3.

Learning train the predictor in the transformed space : X  2Fd w = (HtH+I)-1Hty where hij =  j(xi)

Figure 6 Pseudo-code of the learning scheme that supports SIM-ELM 1.

The Algorithm SIM-ELM 1 in Fig. 6 gives the pseudo-code of the first learning scheme. The mapping stage strictly inherits the approach applied by ELM, which in practice relies on a combination of activation/similarity functions (see Sec. 3.1 for details). Therefore, the proposed predictor exploits the set-up implemented by (x), where a convex combination of F similarity functions {K1, …, KF} is adopted. The F similarity functions correspond to as many different parameterizations of the single parameterized similarity K(xi,xj,R); those F parameterizations are set by using randomization, thus replicating the ELM setup. Finally, also the learning stage fully reproduce the ELM model: hence, the training procedure is based on the minimization problem (3). The second learning scheme proposed in this reformulation, SIM-ELM 2, is outlined in Fig. 7. The algorithm differs from the previous one in the approach adopted in the mapping

Algorithm 3 – SIM-ELM 2 Inputs: a training set T = {(x,y)i; i = 1,…,P} regularization parameter  similarity function K(xi,xj,R) number of landmarks per class d number of randomizations F

1.

Initialize a. work out d positive examples from T: S+ = {(x+m; m = 1,…,d} b. work out d negative examples from T: S- = {(x-n; n = 1,…,d} c. set Q = 2d random parameterizations: R1={R11,…, R1F}, R2={R21,…, R2F}, …., R2d={RQ1,…, RQF}

2.

Mapping remap all the patterns x  T by using the following mapping function

(x) = {K(x,x+1,R11), .., K(x,x+1,R1F), …, K(x,x+d,RQ1), .., K(x,x+d,RQF), K(x,x-1,R11), .., K(x,x-1,R1F), …, K(x,x-d,RQ1), .., K(x,x-d,RQF)} 3.

Learning train the predictor in the transformed space : X  2Fd w = (HtH+I)-1Hty where hij =  j(xi)

Figure 7 Pseudo-code of the learning scheme that supports SIM-ELM 2.

stage. In both cases, a pattern is remapped by computing its similarity with respect to a set of landmarks. As opposed to SIM-ELM 1, though, in SIM-ELM 2 each landmark is endowed with a proper set of (random) parameterizations for the similarity function at hand. As a result, there is not a common set of parameterizations shared by all landmarks; this in turn means that each landmark implements its own similarity function. For the sake of clarity, Fig. 8 schematizes the two mapping approaches. The mapping scheme adopted in SIM-ELM 2 actually diverges from the basic formulation, (x); as a consequence, the learning scheme that may not inherit all the properties that characterize the general framework based on (, )-good similarity functions. On the other hand, in this case the approach adopted to build (x) resembles the one used in the original ELM model, where each neuron implements a different similarity function.

(a)

(b)

Figure 8 The two mapping approaches exploited by SIM-ELM: (a) SIM-ELM 1; (b) SIM-ELM 2.

5. Experimental Results The experimental section aims at evaluating the accuracy performances of the proposed learning schemes. The main goal is to compare SIM-ELM with the conventional ELM model. For the sake of completeness, the comparison will also involve the most important paradigm within the kernel machines family, i.e., Support Vector Machines. In this regard, SVM not only may set a reference in terms of classification accuracy; it also provides an interesting reference in terms of support vectors, i.e., landmarks that have been set a-posteriori. The latter quantity can in turn stimulate an analysis about the ability of the three different predictors to address the trade-off between classification accuracy and eventual complexity of the predictor itself.

To the purpose of robustly assessing the performance of SIM-ELM, six different benchmarks have been involved in the experimental evaluation: Ionosphere [8], Heart [8], Glass Identification [9], USPS [9], Cod-RNA [8], and Covertype [8]. Each experimental session has been designed to provide a fair comparison between the generalization performances. Therefore, in each experiment, SIM-ELM 1, SIM-ELM 2 and ELM have been compared by defining a common set up for both the range of admissible s (i.e., the regularization parameter), and the dimensionality, N, of the remapped space. The latter quantity actually directly depends on the number of adopted landmarks, d, in the case of SIM-ELM (as per Algorithm 2 and Algorithm 3). Thus, in each experiment d has been set a-priori to the purpose of outlining a suitable range for N. In the case of the basic ELM model, two different implementations have been considered, which differ in the choice of the activation function: sigmoid and RBF. Experiments on SIMELM adopted three different similarity functions: sigmoid (rescaled in the range [-1; 1]), RBF (rescaled in the range [-1; 1]), and Minkowski. The latter function has been defined as K (x, x) 

2 1 1  mk (x, x,  )

where mk (x, x,  ) is the conventional Minkowski distance (order ) between vectors x and x . In the present application,  represents the parameter to be set randomly. In the following, Sec. 5.1, Sec. 5.2, Sec. 5.3, Sec. 5.4, and Sec. 5.5 discuss the results of the five experimental sessions.

5.1 Ionosphere dataset The Ionosphere dataset included a total of 351 patterns, which lie in a 34-dimensional space; the original dataset was quite unbalanced, as one of the two classes only provided 126 patterns out of 351. In the experimental design, both the training set and the test set included 50 patterns per class; all the 34 features were renormalized in the interval [-1; 1]. To avoid any bias due to the specific composition of the training and test sets, the experiments involved fifty different runs, which corresponded to fifty different training/test pairs. The experimental session was organized as follows. SIM-ELM 1, SIM-ELM 2 and ELM were evaluated by using the following settings for  (regularization parameter) and N (eventual dimension of the space in which original data are remapped):



  {1∙10-6, 1∙10-5, 1∙10-4, 1∙10-3, 1∙10-2, 1∙10-1, 1, 1∙101, 1∙102, 1∙103, 1∙104, 1∙105, 1∙106}



N  {10, 20, 50, 100, 200, 500, 1000} In the case of SIM-ELM 1 and SIM-ELM 2, the experiments focused on two specific set-

tings of the parameter d: d = 5 (10 landmarks, 5 per class), d = 25 (50 landmarks, 25 per class). These settings corresponded, respectively, to the following configurations: 10% of the available training patterns become landmarks, 50% of the available training patterns become landmarks. In the SIM-ELM model the eventual dimensionality of the remapped space is N = 2∙F∙d; thus, according to the constraint imposed on N, the parameter F was allowed to take the following values: 

d = 5: F  {1, 2, 5, 10, 20, 50, 100}



d = 25: F  {1, 2, 4, 10, 20} Table 3 compares the performance scored by SIM-ELM and ELM on this experimental

session. Fifteen different predictors are compared: -

three implementations of the SIM-ELM 1 based on 10 landmarks (sigmoidal similarity function, RBF similarity function, and Minkowski similarity function);

-

three implementations of the SIM-ELM 1 based on 50 landmarks (sigmoidal similarity function, RBF similarity function, and Minkowski similarity function);

-

three implementations of the SIM-ELM 2 based on 10 landmarks (sigmoidal similarity function, RBF similarity function, and Minkowski similarity function);

-

three implementations of the SIM-ELM 2 based on 50 landmarks (sigmoidal similarity function, RBF similarity function, and Minkowski similarity function);

-

two implementations of the conventional ELM model (sigmoidal activation function and RBF activation function). Besides, the last row of the table provides the performance attained by SVM on the same

testbed; the results refer to an implementation based on the RBF kernel and efficient strategies of model selection [20]. Table 3 reports, for each predictor, the configuration that led to the best performance in terms of classification error on the test set (average value over the fifty runs). Accordingly, the table provides, for each predictor, three quantities: the average classification error (in percentage) along with the expected standard error of the average and the specific configuration of (, N) that

Table 3 Experimental results on the Ionosphere benchmark. Predictor

Best Model

Similarity SIM-ELM 1 10 landmarks

SIM-ELM 1 50 landmarks

SIM-ELM 2 50 landmarks

ELM

SVM



N

Sigmoid

20.2 ± 1.1

1∙10

RBF

10.5 ± 0.9

1∙10-1

100

Minkowski

11.3 ± 0.8

1∙10-3

100

Sigmoid

15.2 ± 0.8

1∙10

200

RBF

7.50 ± 0.70

1

8.60 ± 0.70

Sigmoid RBF

-3

200

1

200

1∙10

-1

200

19.7 ± 1.0

1∙10

-4

100

10.9 ± 1.00

1∙10-1

100

Minkowski

11.5 ± 0.70

1∙10

-4

100

Sigmoid

15.1 ± 0.80

1∙101

200

RBF

7.50 ± 0.70

1

200

Minkowski

8.10 ± 0.80

1∙10

Activation

Classification Error



12.5 ± 0.70

1∙10

RBF

13.7 ± 0.60

1∙10

Kernel

Classification Error

C,

sv

-1

80

Minkowski SIM-ELM 2 10 landmarks

Classification Error

Sigmoid

RBF

8.60 ± 0.60

-1

200 N

2

1000

-3

200

(1∙10 ,1)

characterizes the best predictor. In the case of SVM, the configuration is expressed in terms of the couple (C, ) and the number of support vectors. Overall, the results in Table 3 yielded some interesting outcomes. First, it is convenient to take as a reference the best performance scored by the conventional ELM model, that is, a classification error of 12.5% that was attained by adopting a sigmoidal activation function and 1000 neurons in the hidden layer (i.e., a 1000-dimensional remapped space). SIM-ELM 1 scored the best classification error of 7.50% when adopting the RBF as similarity function; this result indeed corresponds to a 200-dimensional remapped space generated by setting 50 landmarks (i.e., F = 4). It is worth noting, though, that SIM-ELM also improved over conventional ELM when adopting a 100-dimensional remapped space, as SIM-ELM 1 (based on 10 landmarks and an RBF similarity function) scored a classification error of 10.5%. Besides, also SIM-ELM 2

based on 50 landmarks and the RBF similarity function scored a classification error of 7.50% in a 200-dimensional remapped space. The experimental session also showed that SIM-ELM compared favourably with SVMs on this benchmark, as the latter predictor achieved a best classification error of 8.60%. It is interesting to note that the SVM required 80 support vectors to score its best result; i.e., SVM used 80 samples among those included in the training set to set the decision function. On the other hand, SIM-ELM attained a classification error of 7.50% with 50 landmarks, which, as discussed in Sec. 3.1, actually play the role of support vectors. A final comment concerns the comparison between SIM-ELM 1 and SIM-ELM 2. In practice, experimental results show that the two approaches could attain comparable performance on this benchmark.

5.2 Heart dataset The Heart dataset included a total of 270 patterns, which lie in a 13-dimensional space. The original dataset was moderately unbalanced, as one of the two classes covered 150 samples out of 270. In the proposed experiment, both the training set and the test set included 50 patterns per class; all the 13 features were indeed renormalized in the interval [-1; 1]. The experiments involved fifty different runs, which corresponded to ten different training/test couples. SIM-ELM 1, SIM-ELM 2 and ELM were evaluated by using the following settings for  and N: 

  {1∙10-6, 1∙10-5, 1∙10-4, 1∙10-3, 1∙10-2, 1∙10-1, 1, 1∙101, 1∙102, 1∙103, 1∙104, 1∙105, 1∙106}



N  { 10, 20, 50, 100, 200, 500, 1000} In the case of SIM-ELM, the experiments focused on the following settings of the param-

eter d: d = 5 (10 landmarks, 5 per class), d = 25 (50 landmarks, 25 per class). As above, these settings corresponded, respectively, to the following configurations: 10% of the available training patterns become landmarks, 50% of the available training patterns become landmarks. Accordingly, the parameter F took the following values: 

d = 5: F  {1, 2, 5, 10, 20, 50, 100}



d = 25: F  {1, 2, 4, 10, 20} Table 4 reports on the performance scored by SIM-ELM, ELM and SVM on this

experimental session; the table follows the same format applied to Table 3. The conventional

Table 4 Experimental results on the Heart benchmark. Predictor

Best model Classification Error



20.8 ± 0.90

1∙10

RBF Minkowski

Similarity

50

19.7 ± 0.80

1∙10

-1

50

19.6 ± 0.80

1∙10-1

50

19.0 ± 0.70

1∙10

1

100

RBF

19.1 ± 0.60

1∙10

1

500

Minkowski

19.6 ± 0.70

1

200

Sigmoid

20.8 ± 1.00

1∙

10

RBF

19.4 ± 0.80

1∙10-1

50

Minkowski

19.6 ± 0.80

1

200

Sigmoid

Sigmoid SIM-ELM 1 10 landmarks

Sigmoid SIM-ELM 1 50 landmarks

SIM-ELM 2 10 landmarks

SIM-ELM 2 50 landmarks

ELM

SVM

N 1

18.9 ± 0.80

1∙10

1

100

RBF

19.2 ± 0.60

1∙101

200

Minkowski

19.0 ± 0.60

1

200

Activation

Classification Error



N

Sigmoid

18.3 ± 0.70

1∙101

200

RBF

18.1 ± 0.80

1∙10

200

Kernel

Classification Error

C, 

18.4 ± 0.80

(1∙10 , 1∙10 )

RBF

-3

-3

sv 2

52

ELM model achieved a classification error of 18.1% when adopting the RBF activation function (200 neurons). Actually, such result is only slightly better than that scored by ELM with the sigmoid activation function (classification error of 18.3%). In practice, ELM offered a minor improvement over SIM-ELM in terms of classification error: the best result is a 18.9%, which was obtained by using 50 landmarks and a sigmoid similarity function. Nonetheless, this performance was scored by remapping data in a 100-dimensional space. Finally, it is interesting to note that SIM-ELM, ELM, and SVM attained comparable performance on this benchmark. Indeed, SIM-ELM achieved its best performance by using 50 landmarks, and SVM required 52 support vectors. Hence, in this particular case the two different strategies for the selection of the most useful samples led to similar outcomes.

5.3 Glass identification dataset The Glass Identification dataset included 214 samples that lie in a 9-dimensional space. The benchmark involved a multi-class problem, as six different classes were represented in the dataset; the experiments presented here, though, only addressed a binary classification problem, namely, class 1 versus class 2. In the experimental design, both the training set and the test set included 30 patterns per class randomly extracted from the original dataset. The experiments involved fifty different runs, which corresponded to ten different training/test couples. All the 9 features were renormalized in the interval [-1; 1]. SIM-ELM 1, SIM-ELM 2 and ELM were evaluated by using the following settings for  and N: 

  {1∙10-6, 1∙10-5, 1∙10-4, 1∙10-3, 1∙10-2, 1∙10-1, 1, 1∙101, 1∙102, 1∙103, 1∙104, 1∙105, 1∙106}



N  {6, 12, 30, 60, 90, 210, 510, 1020} In the case of SIM-ELM, the experiments focused on the following settings of the param-

eter d: d = 3 (6 landmarks, 3 per class), d = 15 (30 landmarks, 15 per class). As above, these settings corresponded, respectively, to the following configurations: 10% of the available training patterns become landmarks, 50% of the available training patterns become landmarks. Accordingly, the parameter F took the following values: 

d = 3: F  {1, 2, 5, 10, 15, 35, 85, 170}



d = 15: F  {1, 2, 3, 7, 17, 34} Table 5 reports on the performance scored by SIM-ELM, ELM and SVM on this

experimental session; the table follows the same format applied to Table 3. Experimental results confirm that SIM-ELM can compare positively with conventional ELM, which attained a classification error of 26.3% when adopting the sigmoid activation function and 90 neurons in the hidden layer. Actually, both SIM-ELM 1 and SIM-ELM 2 were able to achieve better results. Overall, the best performance is the classification error of 18.8% scored by SIM-ELM 2 when adopting the Minkowski similarity function; in this case, the predictor exploited 30 landmarks and data are remapped in a 510-dimensional space (i.e., F = 17). Indeed, SIM-ELM could improve over conventional ELM also with other configurations. In general, this experimental session also seems to confirm that SIM-ELM 2 does not offer significant improvement over SIM-ELM 1.

Table 5 Experimental results on the Glass Identification benchmark. Predictor

Best model

Similarity SIM-ELM 1 6 landmarks

SIM-ELM 1 30 landmarks

Sigmoid

28.9 ± 1.5

1∙10

RBF

24.4 ± 1.5

1

Minkowski

24.4 ± 1.1

1

Sigmoid

27.0 ± 1.1

1∙10

RBF

22.1 ± 1.1

10

Minkowski

19.1 ± 1.0

1

28.8 ± 1.5

RBF

SIM-ELM 2 30 landmarks

ELM

SVM

N -6

60 60 210

-3

90 210 510

1∙10

-6

60

23.9 ± 1.5

1∙10

-1

60

Minkowski

24.6 ± 1.1

1

Sigmoid

27.0 ± 1.1

1∙10-3

90

RBF

22.7 ± 1.1

1∙10

210

Minkowski

18.8 ± 0.9

1

510

Activation

Classification Error



N

Sigmoid

26.3 ± 0.9

1∙10-1

90

RBF

26.3 ± 1.0

1∙10

1020

Kernel

Classification Error

C, 

21.9 ± 0.9

(1∙10 , 1∙10 )

Sigmoid SIM-ELM 2 6 landmarks



Classification Error

RBF

1

210 1

-2

sv -1

60

SIM-ELM was also able to compare positively with SVM, which scored a classification error of 21.9% on this benchmark.

5.4 USPS dataset The USPS dataset provided a training set including 7291 samples and a test set including 2007 samples, drawn from a 256-dimensional space. The original benchmark spanned a multi-class problem (ten different classes are represented in this dataset). The present experimental session only addressed a binary classification problem: class 3 versus class 8; the training set included 500 patterns per class randomly extracted from the original training database, while the test set included 100 patterns per class randomly extracted from the original test database. The experiments involved fifty different runs, which corresponded to ten different training/test couples. All the 256 features were renormalized in the interval [-1; 1].

SIM-ELM 1, SIM-ELM 2 and ELM were evaluated by using the following settings for  and N: 

  {1∙10-6, 1∙10-5, 1∙10-4, 1∙10-3, 1∙10-2, 1∙10-1, 1, 1∙101, 1∙102, 1∙103, 1∙104, 1∙105, 1∙106}



N  {100, 200, 500, 1000, 2000} In the case of SIM-ELM 1 and SIM-ELM 2, the experiments focused on two specific set-

tings of the parameter d: d = 50 (100 landmarks, 50 per class), d = 250 (500 landmarks, 250 per class). As above, these settings corresponded, respectively, to the following configurations: 10% of the available training patterns become landmarks, 50% of the available training patterns become landmarks. Accordingly, the parameter F took the following values: 

d = 50: F  {1, 2, 5, 10, 20}



d = 250: F  {1, 2, 4} Table 6 reports on the performance scored by SIM-ELM, ELM and SVM on this

experimental session; the table follows the same format applied to Table 3. Experimental results shows that SIM-ELM can sligthly improve over conventional ELM, which attained a classification error of 1.4% when adopting the sigmoid activation function and 500 neurons in the hidden layer. Both SIM-ELM 1 and SIM-ELM 2 were able to achieve better or at least comparable results. Overall, the best performance was the classification error of 0.09% scored by SIM-ELM 2 when adopting the RBF similarity function; in this case, the predictor exploited 100 landmarks and data were remapped in a 200-dimensional space (i.e., F = 2). Table 7 indeed shows that SIM-ELM attained very poor performance when adopting the sigmoid similarity function. The reason should be found in the specific structure of the samples involved in the experiment, which involve a considerable number of features that invariably took value -1 (in the renormalized range). When adopting a sigmoid function as notion of similarity between samples and landmarks (i.e., randomly selected samples), such configuration led to an altered scenario where almost all samples were dissimilar from each other. Incidentally, ELM did not suffer from this issue, as in that case the landmarks are not actual samples. The best performance on this benchmark was in fact attained by SVM, which scored a classification error of 0.06% by using 1000 support vectors. In this regard, one may note that SIM-ELM 2 achieved a sligthly inferior performance by using only 100 landmarks.

Table 6 Experimental results on the USPS benchmark. Predictor

Best model

Similarity

SIM-ELM 1 500 landmarks

SIM-ELM 2 100 landmarks

ELM

SVM

N

42.0 ± 1.5

1∙10

RBF

1.00 ± 0.10

1∙10

-3

500

Minkowski

1.20 ± 0.10

1∙10-3

500

Sigmoid

28.6 ± 1.9

1∙10

2000

RBF

1.00 ± 0.10

1

Minkowski

1.30 ± 0.10

1∙10-2

2000

Sigmoid

41.9 ± 1.5

1∙10

2000

RBF

0.09 ± 0.01

1∙10-2

200

1.20 ± 0.10

1∙10

-3

1000

Sigmoid

28.3 ± 1.9

1∙10

-1

2000

RBF

0.09 ± 0.01

1∙10-1

1000

Minkowski

1.30 ± 0.10

1∙10

1000

Activation

Classification Error



1.40 ± 0.20

1∙10

RBF

1.50 ± 0.10

1∙10

Kernel

Classification Error

C, 

Minkowski SIM-ELM 2 500 landmarks

 -5

Sigmoid SIM-ELM 1 100 landmarks

Classification Error

Sigmoid

RBF

0.06 ± 0.01

1

-2

2000

2000

-6

-3

N 1

500

-2

1000

(1∙10 , 1)

sv 1000

5.5 Cod-RNA dataset The Cod-RNA dataset provided a training set including 59,535 samples and a test set including 271,617 samples, drawn from an 8-dimensional space. In the proposed experiment, both the training set and the test set included 5,000 patterns per class; all the 8 features were indeed renormalized in the interval [-1; 1]. The experiments involved fifty different runs, which corresponded to fifty different training/test couples. SIM-ELM 1, SIM-ELM 2 and ELM were evaluated by using the following settings for  and N: 

  {1∙10-6, 1∙10-5, 1∙10-4, 1∙10-3, 1∙10-2, 1∙10-1, 1, 1∙101, 1∙102, 1∙103, 1∙104, 1∙105, 1∙106}



N  {100, 200, 300, 500, 1000, 2000, 3000} In the case of SIM-ELM 1 and SIM-ELM 2, the experiments focused on two specific set-

tings of the parameter d: d = 50 (100 landmarks, 50 per class), d = 500 (1,000 landmarks, 500 per

Table 7 Experimental results on the Cod-RNA benchmark. Predictor

SIM-ELM 1 100 landmarks

SIM-ELM 1 1000 landmarks

SIM-ELM 2 100 landmarks

SIM-ELM 2 1000 landmarks

ELM

SVM

Best model

Similarity

Classification Error



N

Sigmoid

3.70 ± 0.00

1

200

RBF

3.70 ± 0.00

1

Minkowski

4.60 ± 0.10

1∙10

Sigmoid

3.60 ± 0.00

1

2000

RBF

3.70 ± 0.00

1

2000

Minkowski

3.80 ± 0.10

1

3000

Sigmoid

3.70 ± 0.00

1

RBF

3.70 ± 0.00

1∙10

Minkowski

4.60 ± 0.10

1∙10-3

1000

Sigmoid

3.60 ± 0.00

1

1000

RBF

3.60 ± 0.00

1∙10

Minkowski

3.80 ± 0.10

1

Activation

Classification Error



500 -1

2000

200 -4

-3

100

1000 2000 N

Sigmoid

3.60 ± 0.00

1∙10

RBF

3.80 ± 0.00

1∙10-6

1000

Kernel

Classification Error

C, 

sv

RBF

3.60 ± 0.00

(1∙10-1, 1)

1417

-1

1000

class). These settings corresponded, respectively, to the following configurations: 1% of the available training patterns become landmarks, 10% of the available training patterns become landmarks. Accordingly, the parameter F took the following values: 

d = 50: F  {1, 2, 3, 5, 10, 20, 30}



d = 500: F  {1, 2, 3} Table 7 reports on the performance scored by SIM-ELM, ELM and SVM on this

experimental session; the table follows the same format applied to Table 3. Experimental results show that the three predictors attained almost the same performance on this dataset. Nonetheless, it is worth noting that ELM required a 1000-dimensional space to reach a classification error of 3.6%, while SVM used 1,417 support vectors to score the same result. Actually, SIM-ELM 2,

when adopting the RBF similarity function, scored a classification error of 3.7% by using 100 landmarks and a 100-dimensional space. Hence, this session confirms that SIM-ELM can effectively address a balance of the two fundamental quantities: classification error and predictor complexity.

5.6 Covertype dataset The Covertype dataset included 581,012 samples, drawn from a 54-dimensional space. In the proposed experiment, both the training set and the test set included 5,000 patterns per class; all the 54 features were indeed renormalized in the interval [-1; 1]. The experiments involved fifty different runs, which corresponded to fifty different training/test couples. SIM-ELM 1, SIM-ELM 2 and ELM were evaluated by using the following settings for  and N: 

  {1∙10-6, 1∙10-5, 1∙10-4, 1∙10-3, 1∙10-2, 1∙10-1, 1, 1∙101, 1∙102, 1∙103, 1∙104, 1∙105, 1∙106}



N  {100, 200, 300, 500, 1000, 2000, 3000} In the case of SIM-ELM 1 and SIM-ELM 2, the experiments focused on two specific set-

tings of the parameter d: d = 50 (100 landmarks, 50 per class), d = 500 (1,000 landmarks, 500 per class). These settings corresponded, respectively, to the following configurations: 1% of the available training patterns become landmarks, 10% of the available training patterns become landmarks. Accordingly, the parameter F took the following values: 

d = 50: F  {1, 2, 3, 5, 10, 20, 30}



d = 500: F  {1, 2, 3} Table 8 reports on the performance scored by SIM-ELM, ELM and SVM on this

experimental session; the table follows the same format applied to Table 3. The results show that SIM-ELM achieved its best performances when adopting the RBF similarity function and 1000 landmarks (classification error of 18.6% with SIM-ELM 1 and of 18.7% with SIM-ELM 2). ELM indeed attained the same performance by using the sigmoid activation function. Both SIMELM and ELM remapped the original data into a 3000-dimensional space. It is interesting to note that the sigmoid function totally failed to provide valuable performances when used as similarity function in SIM-ELM. As in the case of the USPS dataset, the reason should be found in the peculiar structure of the samples involved in the experiment: in almost all patterns, features from number 12 to number 54 took value -1 (in the renormalized range).

Table 8 Experimental results on the Covertype benchmark. Predictor

SIM-ELM 1 100 landmarks

SIM-ELM 1 1000 landmarks

SIM-ELM 2 100 landmarks

SIM-ELM 2 1000 landmarks

ELM

SVM

Best model

Similarity

Classification Error



N

Sigmoid

50.0 ± 0.00

-

-

RBF

21.2 ± 0.01

1∙10-6

2000

Minkowski

21.7 ± 0.01

1∙10-5

3000

Sigmoid

50.0 ± 0.00

-

-

RBF

18.6 ± 0.01

1∙10-6

3000

Minkowski

19.0 ± 0.01

1∙10

3000

Sigmoid

50.0 ± 0.00

-

RBF

21.2 ± 0.01

1∙10-6

3000

Minkowski

21.7 ± 0.01

1∙10

2000

Sigmoid

50.0 ± 0.00

-

RBF

18.7 ± 0.01

1∙10-6

3000

Minkowski

19.1 ± 0.01

1∙10

2000

Activation

Classification Error

-2

-

-6

-

-2



N

18.6 ± 0.01

1∙10

-2

RBF

22.2 ± 0.01

1∙10

-6

Kernel

Classification Error

C, 

16.7 ± 0.01

(1∙10 , 1∙10 )

Sigmoid

RBF

1

3000 3000 sv -1

10000

Table 8 indeed shows that the best overall performance was scored by SVM, which obtained a classification error of 16.7%. Nonetheless, one should take into account that SVM needed to set 10,000 support vectors to reach such result, i.e., all the training patterns served as support vectors. In this regard, one may conclude that both SIM-ELM and ELM achieved a good trade-off between landmarks (3,000) and performance.

6. Conclusions This research showed that the theory of learning with similarity functions can provide the basis for the development of a novel variant of the ELM model. The new learning scheme is characterized by the strategy adopted in the process of remapping the input space into a new space in which the learning takes place: the remapped space conveys the similarity between the input pat-

tern and a number of landmarks, i.e., a subset of patterns randomly extracted from the training set. This approach allows one to set up a new version of ELM, the SIM-ELM, which can inherit fruitful properties from similarity-function learning, such as the formal relationship between the dimensionality of the remapped space and the learning abilities of the eventual predictor. The present work provides a few interesting outcomes that may represent a starting point for additional investigations on the proposed learning scheme. First, when addressing supervised classification SIM-ELM compares positively with ELM in terms of representation ability. Second, SIM-ELM seems more effective than ELM in addressing a satisfactory trade-off between classification accuracy and eventual dimensionality of the remapped space. Third, SIM-ELM seems to confirm that randomization can play a major role in machine learning. In this regard, the experimental session allowed one to compare the performance of SIM-ELM with those scored by SVM, which actually exploits landmarks (i.e., the support vectors) that are selected aposteriori. The results showed that SIM-ELM and SVM often achieved comparable performance in terms of classification accuracy; nonetheless, SIM-ELM in some cases proved able to exploit less landmarks than SVM. Two future activities can stem from this research. The first activity addresses the extension of SIM-ELM to regression problems. The second activity aims at analyzing the ability of the proposed learning scheme to deal with semi-supervised learning. In this regard, one should take into account that, in principle, both labeled and unlabeled data may provide a useful source of landmarks. As a major consequence, SIM-ELM can also be designed to tackle those situations in which the availability of labeled examples is quite scarce, but a relatively large number of unlabeled examples can be gathered.

References [1]

G.-B.Huang, H. Zhou, X. Ding, R. Zhang, Extreme Learning Machine for Regression and Multiclass Classification, IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics 42 (2) (2012) 513-529

[2]

E. Cambria, G.-B. Huang, et al., Extreme learning machines, IEEE Intelligent Systems, 28 (6) (2013) 30-59

[3]

G. Huang, G.-B. Huang, S. Song, K. You, Trends in extreme learning machines: A review, Neural Networks, 61 (2015) 32-48

[4]

M.-F. Balcan and A. Blum, “On a theory of learning with similarity functions”, in William W. Cohen, Andrew Moore (Eds.): Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25-29, 2006. ACM 2006 ACM International Conference Proceeding Series ISBN 1-59593-383-2

[5]

A. Rahimi and B. Recht, “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning” in D. Koller and D. Schuurmans and Y. Bengio and L. Bottou (Eds.): Advances in Neural Information Processing Systems 21, Proc. of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008. Curran Associates, Inc. 2009

[6]

D. Lowe, “Adaptive radial basis function nonlinearities, and the problem of generalization.” in: Proc. 1st IEE Int. Conf. Artificial Neural Networks, London, UK, Oct 16-18, 1989, IET

[7]

Y.H. Pao, G.H. Park, D.J. Sobajic, Learning and generalization characteristics of random vector functional-link net, Neurocomputing, 6 (1994) 163–80

[8]

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html

[9]

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html

[10] V.N. Vapnik, Statistical Learning Theory. Wiley-Interscience Pub (1998) [11] T. Evgeniou, M. Pontil, T. Poggio, Regularization networks and support vector machines, J. Adv. Comput. Math. 13 (1) (2000) 1–50 [12] S. Decherchi, P. Gastaldo, A. Leoncini, R. Zunino, Efficient Digital Implementation of Extreme Learning Machines for Classification, IEEE Transactions on Circuits and Systems II, 50 (8) (2012) 496-500 [13] P. Gastaldo, L. Pinna, L. Seminara, M. Valle, R. Zunino, A Tensor-Based PatternRecognition Framework for the Interpretation of Touch Modality in Artificial Skin Systems, IEEE Sensors 14 (7) (2014) 2216-2225 [14] S. Poria, E. Cambria, G. Winterstein, and G.-B. Huang, Sentic patterns: Dependency-based rules for concept-level sentiment analysis, Knowledge-Based Systems 69 (2014) 45–63 [15] H. Chen, J. Peng, Y. Zhou, L. Li, Z. Pan, Extreme learning machine for ranking: Generalization analysis and applications, Neural Networks, 53 (2014) 119-126

[16] A. Grigorievskiy, Y. Miche, A.-M. Ventelä, E. Séverin, A. Lendasse, Long-term time series prediction using OP-ELM, Neural Networks, 51 (2014) 50-56 [17] E. Cambria, P. Gastaldo, F. Bisio, R. Zunino, An ELM-Based Model for Affective Analogical Reasoning, Neurocomputing 149A (2015) 443-455 [18] T. Hofmann, B. Schölkopf, A.J. Smola, Kernel Methods in Machine Learning, The Annals of Statistics, 36 (8) (2008) 1171-1220 [19] X. Liu, L. Wang, J. Yin, E. Zhu, J. Zhang, An Efficient Approach to Integrating Radius Information into Multiple Kernel Learning, IEEE Trans. on Systems, Man, and Cybernetics - Part B: Cybernetics, 43 (2) (2013) 557-569 [20] Z. Xu, M. Dai, D. Meng, Fast and Efficient Strategies for Model Selection of Gaussian Support Vector Machine, IEEE Trans. on Systems, Man, and Cybernetics - Part B: Cybernetics, 39 (5) (2009) 1292-1307