ELM-SOM+: A continuous mapping for visualization

ELM-SOM+: A continuous mapping for visualization

Neurocomputing 365 (2019) 147–156 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom ELM-SOM...

4MB Sizes 0 Downloads 40 Views

Neurocomputing 365 (2019) 147–156

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

ELM-SOM+: A continuous mapping for visualization Renjie Hu a,∗, Karl Ratner b, Edward Ratner b, Yoan Miche c, Kaj-Mikael Björk d,e, Amaury Lendasse d,f a

Department of Industrial and Systems Engineering, The University of Iowa, Iowa City, USA Edammo Inc, Iowa City, USA c Nokia Bell Labs, Finland d Arcada University of Applied Sciences, Helsinki, Finland e Hanken School of Economics, Helsinki, Finland f Department of Information and Logistics Technology, College of Technology, University of Houston, Houston, USA b

a r t i c l e

i n f o

Article history: Received 29 December 2018 Revised 22 June 2019 Accepted 23 June 2019 Available online 20 July 2019 Communicated by Dr. Chi Man Vong

a b s t r a c t This paper presents a novel dimensionality reduction technique based on ELM and SOM: ELM-SOM+. This technique preserves the intrinsic quality of Self-Organizing Map (SOM): it is nonlinear and suitable for big data. It also brings continuity to the projection using two Extreme Learning Machine (ELM) models, the first one to perform the dimensionality reduction and the second one to perform the reconstruction. ELM-SOM+ is tested successfully on nine diverse datasets. Regarding reconstruction error, the new methodology shows considerable improvement over SOM and brings continuity.

Keywords: Dimensionality reduction techniques Visualization Extreme Learning Machines Self-Organizing Maps Machine learning Neural networks

1. Introduction In Machine Learning, dimensionality reduction is of great importance for several reasons. First, due to the curse of dimensionality, many machine learning techniques can result in overfitting problems, thus, high-dimensional data can be challenging to analyze [1–3]. Second, the computational load is correlated with the number of the features of the data. Analyzing highdimensional data can be computationally intensive [2]. Last, the high-dimensional data cannot be visualized directly [2]. “Looking at the data” is crucial for data analysis, because it provides the interpretability that allows us to make some sense of the data before carrying out further analysis. Generally, when analyzing high-dimensional data, the dimensionality of the data is larger than necessary, especially when the variables are correlated. Another common assumption is that the high-dimensional data is embedded on a lower dimensional manifold [4]; therefor, the data can be transformed onto a lower dimensional space, and the transformed data still preserve the information of the data, nearly without information loss [1]. If the



Corresponding author. E-mail addresses: [email protected], [email protected] (R. Hu).

https://doi.org/10.1016/j.neucom.2019.06.093 0925-2312/© 2019 Elsevier B.V. All rights reserved.

© 2019 Elsevier B.V. All rights reserved.

manifold of the data is in 3-D or less than 3-D space, the original data can be precisely visualized by the transformed data in the manifold space. Of course, in reality, the perfect manifold can not always be found, thus the transformation is always associated with information loss, however, the information loss can be minimized through the searching for the optimal manifold for the data [5]. In machine learning, feature extraction is one of the main dimensionality reduction processes [6], which builds derived values (features) from the original data. The derived features usually come from some type of transformation of the data, projecting the data to another (lower dimensional) space. The transformation can be linear or nonlinear [2]. Linear feature extraction methods works well when data is lying on the linear subspace. Principal Components Analysis (PCA) [7] is a popular linear feature extraction method, which is targeting on maximizing the variance of the data. Multidimensional Scaling (MDS) [8] is preserving the pair-wise distances of the data, which yields to the same results as PCA [1] for linear MDS. Both methods perform poorly when the data is lying on a (curved) nonlinear manifold, which can often be the case [2]. Methods for nonlinear dimensionality reduction can be further divided into two groups: distance-preserving methods and topology-preserving methods [9]. Distance-preserving methods include Sammon’s mapping [10], Curvilinear Component Analysis

148

R. Hu, K. Ratner and E. Ratner et al. / Neurocomputing 365 (2019) 147–156

Fig. 1. ELM-SOM Algorithm.

(CCA) [11], Isomap (IM) [12,13], and Curvilinear Distance Analysis (CDA) [14]. Another state of art dimensionality reduction tool is t-distributed Stochastic Neighbor Embedding (t-SNE), which is a nonlinear dimensionality reduction technique developed by van der Maaten and Hinton in [15]. t-SNE is preserving local structure by maintaining small pair-wise distances and is particularly wellsuited for embedding high-dimensional data into low-dimensional space. It allows meaningful explanations of distance in the projected space, which indicates the degree of similarity between the assay samples. Topology preserving methods are more powerful and, are at the same time, more complex than distance preservation methods [1], such as Generative Topographic Mapping (GTM) [16], Laplacian Eigenmaps (LE) [17,18], Growing Neural Gas algorithms (GNG) [19], and Self-Organizing Maps (SOM) [20]. Both GTM and SOM use pre-defined grids and create discrete projections. Neural Gas Algorithm applies a neural network structure and is inspired by SOM. This method aims at finding the optimal data representation (an optimal manifold) [21]. LE creates continuous projections; however, the performances of the projection are generally poor [1]. In this paper, we propose a new topology-preserving nonlinear dimensionality reduction tool: ELM-SOM+. The new method is inspired by SOM, and inherits the ability of SOM to capture the topology of data. The projection of SOM is limited to the predefined network structure. Multiple data points are projected to the same point on the SOM network structure. Therefore, the SOM projection does not preserve the information to distinguish those data. Extreme Learning Machines (ELMs) [22–24] have some distinct merits to cope with this problem. Due to the universal approximation ability [25], the continues output value and the fast training process [26,27] of ELMs, it is ideal to use ELMs to learn the topology of the data and eliminate the discontinuity of the projection. ELM-SOM+ uses a 2-D manifold to capture the data topology,

then with the help of two ELMs creates a continuous projection and minimizes the reconstruction error. In the next section, we first give the overview of the mythology and explain the basic components of ELM-SOM+. In Section 3 is the detailed description of ELM-SOM+ Algorithm. In Section 4, we successfully present and analyze the experiment results of ELMSOM+ for nine diverse datasets. Conclusion and future work are shown in Section 5. 2. Methodology This paper presents a nonlinear dimensionality reduction method for visualization: ELM-SOM+. This method is based on both Extreme Learning Machine and Self-Organizing Map. The framework of ELM-SOM+ algorithm is demonstrated in Fig. 1. At the initialization phase of the ELM-SOM+, the original data topology of X ∈ RN×d is captured by SOM, creating a discrete projection X p ∈ RN×2 . Then, the initial projection Xp is learned by an encoder: (ELMENC ), which creates a continuous projection: Xˆp ∈ RN×2 . Next, Xˆ ∈ RN×d is reconstructed from Xˆp by a decoder: ELMDEC , generating an approximation of the original data. This allows the calculation of the reconstruction error between Xˆ and X. Lastly, the reconstruction error is minimized by optimizing the output weights in the ELMENC , which improves the quality of the projection: Xˆp and the quality of the reconstruction: Xˆ as well. 2.1. Extreme Learning Machine The Extreme Learning Machines [28–33] are proposed as generalized Single-Layer Feed-forward Networks (SLFNs) [25,34–36]. According to Huang et al. in [28], ELM can produce good generalized performance in most cases and can learn thousands of times

R. Hu, K. Ratner and E. Ratner et al. / Neurocomputing 365 (2019) 147–156

149

Fig. 3. Self-Organizing Maps, Graph from SDL component suite.1

Fig. 2. ELM Structure.

faster than conventional popular learning algorithms for feedforward neural networks. Typically, the structure of ELM contains three layers (see in Fig. 2): the input layer, the hidden layer, and the output layer. Input layer weights (w) and biases (b) are generated randomly and stay unchanged afterwards. The input data x is mapped to Ldimensional ELM random feature space, and the hidden layer output is:

hi (x ) = φ (xT wi + bi ), i ∈ [1, L].

(1)

where φ is the activation function (a sigmoid function is a common choice, but other activation functions are possible including linear) (see [34,36]). The ELM functional output is:

fELM (x ) =

L 

βi hi (x ) = h(x ) β, i ∈ [1, L], T

(2)

i=1

where β is the output weights β = (β1 , . . . , βL )T . For a set of N distinct training samples (xj , tj ), j ∈ [1, N], with x j ∈ Rd being the input, and t j ∈ Rc being the corresponding target. The ELM model can be expressed in the matrix form as: Hβ = T, where



h 1 ( x1 )

H = ⎝ ... h 1 ( xN )

... .. . ...

h L ( x1 )



⎠, ... h L ( xN )

(3)

T = ( t1 , . . . , tN )T ,

 2 and β is Ordinary Least Squares solution of arg min Hβ − T , 2

(4)

β

β = H T,

(5)

H† = (HT H )−1 HT .

(6)



2.2. Self-Organizing Maps SOM is introduced by the Finnish professor Teuvo Kohonen in the 1980s [20]. It is an unsupervised learning tool [37–39], and a popular nonlinear dimensionality reduction tool that uses a predefined 2-D lattice (see Fig. 3) to capture the topology of the data in the high dimension [4]. Each node in the lattice attains a weight vector wi in the original d-dimensional data space as the input vectors x.

At the beginning all the weight vectors are randomly initialized. For each input vector xk , k ∈ [1, N], the pairwise distances between xk and every weight vector wi is calculated. The Best Matching Unit (BMU) for xk is the node whose weight vector wu has the smallest distance with xk . When the BMU is found, the lattice weights are updated as:

mi (t + 1 ) = mi (t ) −  (t )λ(mBMU , mi , t )(mi (t ) − xk ),

(7)

where  (t) is the adaption rate, and the λ(mBMU , mi , t) is neighborhood function that decides the influence range of the updating. Finally, after a considerable number of iterations, these weight vectors will converge, hence the SOM is trained. After the training, according to SOM algorithm, each input vector xk , is projected to the corresponding BMU on the 2-D lattice. Therefore, Self-Organizing Maps are performing a discrete nonlinear dimensionality reductions. 3. ELM-SOM+ Self-Organizing Map is a powerful visualization tool to create 2-D projections, nonetheless, the projection is discrete. The projection is on the pre-defined grid, which has at most s × s possible values, where s is the size of the SOM grid. As a result, the reconstruction of the data can only be discrete [5]. This paper extends the original idea of SOM, creating a topology-preserving projection: ELM-SOM+, which is similar to SOM, yet without the limitation of the discrete projection. In ELM-SOM+ projection, points that are close in the original space, are close in the projected space, but are not overlapping on the same BMU as SOM does. Therefore, the ELM-SOM+ projection is no longer discrete. This continuous projection allows a better reconstruction of the data. The reconstruction error is used to measure the quality of the projection [7]. Comparing with SOM, ELM-SOM+ can largely decrease the reconstruction error. The next paragraphs (Phase-I, II, III and IV) are presenting the ELM-SOM+ algorithm in detail, which is explained in the order of Fig. 1. Phase-I: Learning the Data Topology with SOM: A SOM is built at the first step of ELM-SOM+ to preserve the topology of the data. For simplicity the map size is defined to be s × s. For each data point xi ∈ Rd , i ∈ [1, N], the BMU of xi is ci ∈ Rd . Each vector ci is a node of the SOM, and can be seen as a cluster center for the data points. ci has only s2 unique values. For each center ci , the corresponding projection is x pi ∈ R2 , i ∈ [1, N]. After the SOM training, the data topology is learned by the following transformation:

P (xi ) = x pi , i = [1, . . . , N].

(8)

All the data points xi that have the same BMU have the same projection. The projection xpi is discrete in space and has only s2

150

R. Hu, K. Ratner and E. Ratner et al. / Neurocomputing 365 (2019) 147–156

unique values. In ELM-SOM+, we do not use xpi as the final projection. xpi just preserves the topology information of the data. Phase-II: Reconstructing the data with the decoder ELM: ELMDEC reconstructs the data in Rd from the projection created by the SOM. The input is the SOM projection xpi , and the target is the corresponding data point xi . The number of hidden neurons for ELMDEC is nB . The following transformation is learned by ELMDEC :

R(x pi ) = xˆi , i = [1, . . . , N].

(9)

This allows the reconstruction error (Mean Square Error) to be calculated:

N E=

i=1

  R(x pi ) − xi 2 N

P˜(xi ) = xˆ pi , i = [1, . . . , N].

(11)

For data points that have the same BMU the projection learned by the ELMENC are different. The reason is that the points that have the same SOM projection originally are different from each other. In other words, even though ELMENC have the same output target in this case, the input values are different. Therefore, the projections from ELMENC are different. Similarly, the optimal complexity for ELMENC is determined through the validation process. The LOO Error is applied as the validation error, and is computed with respect to each nA value. The optimal ELMENC has the number of neurons: n∗A and gives the minimum LOO Error. Too many neurons in ELMENC may lead into an over-fitting problem. In cases of the over-fitting, ELMENC learns perfectly the transformation: P (xi ) = x pi , and projects every data point precisely on the target xpi . Therefore, the results are the same as the SOM projection from the phase-I, which provides no improvements on the continuity of the projection and leads to the same reconstruction error as SOM. Too few neurons in the ELMENC may lead to an under-fitting case: the model is not complex enough to approximate the transformation P (xi ) = x pi , therefore ends up losing the topology information of the data, and with a large reconstruction error. Phase-IV: Optimizing the output weights of ELMENC using the “fminunc” in Matlab. This Phase involves both ELMENC and ELMDEC . With the optimal size of the encoder: n∗A , and the optimal size of decoder n∗B from the previous Phases, ELMENC and ELMDEC are assembled as an auto-encoder. Both the inputs and the targets are the original data xi . The reconstruction error is calculated as:

N

N

.

(12)

Define the output weights of ELMENC as βENC . The reconstruction error E can be written as a function of βENC :

f (βENC ) = E.

3: 4: 5: 6: 7: 8:

(10)

The optimal complexity for ELMDEC is determined through the validation process. The Leave-One-Out (LOO) Error is applied as the validation error, and is computed with respect to each nB value. The optimal ELMDEC has the number of neurons of n∗B and gives the minimum LOO Error. Phase-III: Creating Continuous Projection with ELMENC : Based on the topology information learned from the SOM, the encoder ELM: ELMENC is built to create the continuous projection. The number of hidden neurons for ELMENC is nA . The inputs of the ELMENC are the data points xi . The targets are the corresponding projection xpi learned from the SOM in phase-I. Thus, ELMENC is trained to learn the transformation from xi ∈ Rd to xˆ pi ∈ R2 :

E=

1: 2:

9:

.

 2  ˜  i=1 R (P (xi )) − xi

Algorithm 1 ELM-SOM+ Algorithm.

(13)

The goal here is to find the minimum reconstruction E by solving min f . The “fminunc” in Matlab is applied for this optimization. βENC

“fminunc” is a subspace trust region method and is based on the

Train a SOM on the dataset. Build ELMDEC of size nB , with the SOM Projection as the input and the original data x as the output target. Find the n∗B by optimizing the LOO Error of ELMDEC . Build ELMENC of size nA , with the original data x as the input and the SOM Projection as the output target. Find the n∗A by optimizing the LOO Error of ELMENC . Project the original data with ELMENC . Reconstruct the data from the projection with ELMDEC . Calculate the Reconstruction Error E. Optimize the output weights of ELMENC to minimize E.

Algorithm 2 Allen’s PRESS algorithm, in a fast matrix form.



−1

Compute the utility matrix C = XT X 2: And P = XC; T 3: Compute the pseudo-inverse w = CX T;

4: Compute the denominator of the PRESS D = 1 − diag PXT ; 1:

5: 6:

And finally the PRESS error ε = Reduced to an MSE, MSE

PRESS

=

T−Xw D ; 1 N i=1 N

εi2 .

interior-reflective Newton method. Each iteration involves the approximate solution of a large linear system using the method of preconditioned conjugate gradients (PCG). Since the optimization process will change the output weights of the encoder, it is no longer an ELM any more. The ELM-SOM+ algorithm is summarized as follows: 3.1. PRESS statistics for estimating Leave-one-out error In the model selection process, Leave-one-out Error is introduced to help fine tune the autoencoder structure and select the number of neurons in each layer. LOO is a type of validation error. For linear system, LOO can be expressed by the PRESS formula. The original PRESS formula was proposed by Allen in [40] and used for Randomized Neural Networks by Miche et al. [41]. The original PRESS formula for a linear relationship between X and T can be expressed precisely as PRESS

MSE

N 1 = N





−1



−1

ti − xi XT X

1 − xi XT X

i=1

xTi ti

2

,

xTi

(14)

Where, T = (t1 , . . . , tN )T , which means that each observation is “predicted” using the other N − 1 observations and the residuals are finally squared and summed up. Algorithm 2 proposes to implement this formula efficiently, by matrix computations. Note that for a linear problem Hβ = T as in the ELM, the PRESS formula will look like

MSEPRESS ELM

N 1 = N i=1

ti − hi (HT H )−1 hTi ti 1 − hi (HT H )−1 hTi



2

.

(15)

The key observation here is that MSEPRESS ELM for different targets T with the same inputs X is calculated without changing H. It means that the matrix H is obtained only once for a dataset, and then MSEPRESS ELM for different flips can be estimated from that fixed H extremely fast. 4. Experiments Several experiments are performed to examine the performance of the proposed method on diverse datasets. Our approach is implemented together with PCA and t-SNE, and the performances

R. Hu, K. Ratner and E. Ratner et al. / Neurocomputing 365 (2019) 147–156 Table 1 Datasetse.

151

4.1.4. Glass identification data The Glass Identification data [45], which is utilized in criminal investigation, consists in 241 instances and nine features. This data is created to classify different types of glasses.

Name of the dataset

Number of samples

Number of features

T

Abalone Countries Data Sculpture Glass Identification MNIST Handwritten Digits Wisconsin Breast Cancer SantaFeA Blood Transfusion Wine Quality

2784 149 693 214 1000 569 988 748 4898

8 17 3915∗ 9 580∗ 30 12 4 11

Ring (age of Abalone) NA NA Type of glass Numbers (0–9) Benign or Malignant The next observation Donating or Not Orderd/ Not balance

Note∗ : The number of features is less than the original dataset because the common empty columns in the pictures have been eliminated. Table 2 Comparison of the Reconstruction Errors. Dataset Name

PCA

SOM

ELM-SOM+

Abalone Countries Sculpture Glass Handwritten Digits Breast Cancer SantaFeA Blood Transfusion Wine Quality

0.088 0.351 0.564 0.493 0.883 0.368 0.222 0.089 0.564

0.059 0.0592 0.154 0.049 0.527 0.278 0.087 0.064 0.458

0.023 0.002 0.327 0.073 0.577 0.264 0.017 0.020 0.418

are compared. Reconstruction error, and the actual projections are compared. Table 1 summarize the datasets used in the experiments.

4.1. Data Nine different and diverse datasets are selected to perform experiments to evaluate our methodology for different circumstances. These datasets, which are listed in Table 2.

4.1.1. Abalone data Abalone dataset which has been measured to predict the age of abalone according to various physical measurements [42]. This data consists in 4177 instances with nine different features including gender (Male, Female, and Infant), length, diameter, height, whole weight, shucked weight, viscera weight, shell weight, and rings.

4.1.2. Countries data This dataset contains a part of a larger dataset of total wealth estimates and per capita wealth estimates dataset for 209 countries in different years, in which regional and income group aggregates are computed for each year [43]. We merely utilize total wealth estimates in 2005 as our countries dataset which consists of 209 instances and 17 different features including for example: Population, Net foreign assets, Produced Capital, Crop, Pasture Land, Oil, Natural Gas, Hard coal, Soft coal, Minerals, and Subsoil Assets.

4.1.3. Sculpture data This dataset which is widely utilized as a benchmark to recover the neighborhood structure for instance in [44], includes a set of 698 sculpture face images. These sculpture images are computer renderings of a 3-D sculpture head under different poses and lighting directions [44]. Each image consists in a 4096-dimensional vector built from an array of 64 by 64 brightness values of pixels.

4.1.5. MNIST Handwritten Digits Data The Modified National Institute of Standards and Technology (MNIST) dataset consists of 60,0 0 0 images of handwritten digits in which the black and white digits are normalized in size, and centered in a fixed size image. In each image, the center of gravity lies at the center of the image with 28 by 28 pixels. For simplicity, we use 10 0 0 image samples, and the dimensionality of each sample vector is 28 × 28 = 784 [46]. 4.1.6. Wisconsin Breast Cancer Data The Wisconsin Breast Cancer Data (WBCD) [47] consists in 569 breast mass pattern instances and 30 different measurement features computed from a digitized image of a fine needle aspirate of a breast mass. They describe characteristics of the cell nuclei present in the image. Among these patterns 2/3 of them are benign samples and 1/3 are malignant samples. 4.1.7. SantaFeA The main benchmark of the Santa Fe Time Series Competition, time series A, is composed of a clean low-dimensional nonlinear and stationary time series with 10 0 0 observations [48]. Competitors were asked to correctly predict the next 100 observations (SantaFe.A.cont). The performance evaluation done by the Santa Fe Competition was based on the NMSE errors of prediction found by the competitors. 4.1.8. Blood transfusion To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan [49]. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. To build a FRMTC model, they selected 748 donors at random from the donor database. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood). 4.1.9. Wine Quality The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine [50]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). It is not sure if all input variables are relevant. 4.2. Performance criteria We consider two different criteria, reconstruction error and visual projection performance to evaluate each of utilized methods for these nine different datasets. The reconstruction error (defined in Eq. (12)) is calculated to validate how far are the reconstructed data points from the original data points on average. In other words, the smaller the value of reconstruction error, the higher the quality of the dimensionality reduction. Reconstruction quality is also validated using visual projection performance.

152

R. Hu, K. Ratner and E. Ratner et al. / Neurocomputing 365 (2019) 147–156

Table 3 Comparison of the Time Characteristics. Name of the dataset

Phase II

Phase III

Phase IV

PCA

SOM

t-SNE

Abalone Countries MNIST

1.7 0.2 4.1

0.9 0.02 1.9

360.1 8.3 1184.9

0.006 0.004 0.04

0.2 0.08 0.6

15.5 0.7 5.3

Fig. 5. Countries ELM-SOM+ Visualization.

Fig. 4. Abalone ELM-SOM+ Visualization.

4.3. Procedure To compare our novel ELM-SOM+ method with another dimensionality reduction methods (PCA), we perform experiments on nine diverse datasets. For each experiment, we compare the reconstruction errors. Because each data has different feature variables regarding concept and scaling units, we perform normalization on each dataset to make each variable have the same influence on pair-wise Euclidean distances among data points. Besides, for each dataset, we remove response variable for regression problems and class label variable for classification problems. This final pre-processing is done to avoid any impact of response variable or class label on the visualization. Therefore, we can investigate the strength of dimensionality reduction and visualization for regression/classification without having any predefined response variable or class label. 4.4. Results The results of the experiments reveal that our ELM-SOM+ method outperforms PCA and most times outperforms SOM according to the reconstruction error. Some reconstruction errors are higher than SOM because the encoder of the model is insufficiently optimized due to the long iteration time. Furthermore, ELM-SOM+ is providing a continuous projection rather than a discrete one in SOM. The reconstruction errors are listed in Table 2, and the time characteristics of each phase of ELM-SOM+ for selected dataset is listed in Table 3. 4.5. Visualizations We first present the ELM-SOM+ visualization results. Next, we present a cooperation of ELM-SOM+ with PCA for selected datasets. Finally, we are comparing ELM-SOM+ visualization result with the state of art visualization technique: t-SNE. 4.5.1. ELM-SOM+ Visualizations Fig. 4 shows ELM-SOM+ for the Abalone dataset. We can see that infant Abalones are nicely visualized between females and males using ELM-SOM+. Fig. 6 shows ELM-SOM+ for Sculpture dataset. Fig. 7 shows ELM-SOM+ for Glass dataset. Fig. 9 shows

Fig. 6. Sculpture Faces on ELM-SOM+ Results. Note: This figure is the generated without the optimization process of ELM-SOM+. A clear transitioning pattern of faces turning can be observed already in the Figure.

ELM-SOM+ for Wisconsin dataset. Fig. 10 shows ELM-SOM+ for SantaFe A dataset. Fig. 11 shows ELM-SOM+ for Blood Transfusion dataset. Fig. 12 shows ELM-SOM+ for Wine Quality dataset. The red color on the figure indicating the wine has a higher quality, with a quality score at least 7, the blue color on the figure indicating the wine has lower quality, with quality score lower than 7. 4.5.2. ELM-SOM+ vs PCA In this section, ELM-SOM+ is compared with PCA. Fig. 4 is the visualization results from ELM-SOM+. In the figure it is not difficult to identify the three classes of abalone shell: male, female and infant. On top of that, not like PCA results in Fig. 13 where three clusters are separated with almost equal distance in the figure, ELM-SOM+ is able to generate clusters that are well separated and more importantly can show the trending of the gender changes. The distance between different clusters are varying. This reflects the uniqueness of the individual Abalone shell: for example, although some shells are female, they are large and look like male shells more than female shells. Fig. 5 is showing the visualization result of Countries dataset using ELM-SOM+. In the comparison with PCA visualization result in Fig. 14, we can see ELM-SOM+ is able to preserve more information than PCA, while PCA can barely distinguish a large amount of countries.

R. Hu, K. Ratner and E. Ratner et al. / Neurocomputing 365 (2019) 147–156

Fig. 7. Glass ELM-SOM+ Visualization.

153

Fig. 10. SantaFE A ELM-SOM+ Visualization.

Fig. 11. Blood Transfusion ELM-SOM+ Visualization. Fig. 8. MNIST ELM-SOM+ Visualization.

Fig. 9. Wisconsin ELM-SOM+ Visualization.

4.5.3. ELM-SOM+ vs t-SNE In this section, the visualization results of ELM-SOM+ is compared with the state of art algorithm: t-SNE. Fig. 15 shows the t-SNE visualization results of Abalone dataset. Five separated clusters can be found from the figure. Comparing with ELM-SOM+ in Fig. 4, extra smaller clusters are identified. Fig. 16 is the t-SNE visualization results for the countries dataset. Comparing with ELMSOM+ results in Fig. 5, t-SNE failed to preserve the information that ELM-SOM+ is capable of preserving: the consistency with geopolitical facts. For example, in ELM-SOM+ results oil producers are projected together; United states is an outstanding outlier, followed by China, India and Russian Federation, etc. Fig. 8 shows ELM-SOM+ results for MNIST dataset. Fig. 19 shows t-SNE results for MNIST dataset. The t-SNE visualization results for Win Quality is in Fig. 23; for Wisconsin Breast Cancer in Fig. 20; for SantaFeA

Fig. 12. Wine ELM-SOM+ Visualization. The red color on the figure indicating the wine has a higher quality, with a quality score at least 7, the blue color on the figure indicating the wine has lower quality, with quality score lower than 7.

Fig. 13. Abalone PCA Visualization.

154

R. Hu, K. Ratner and E. Ratner et al. / Neurocomputing 365 (2019) 147–156

Fig. 14. countries PCA Visualization.

Fig. 18. Glass Identification t-SNE Visualization.

Fig. 15. Abalone t-SNE Visualization.

Fig. 19. MNIST t-SNE Visualization.

Fig. 16. Countries t-SNE Visualization.

Fig. 20. Wisconsin Breast Cancer t-SNE Visualization.

Fig. 17. Sculpture t-SNE Visualization.

Fig. 21. SantaFeA t-SNE Visualization.

R. Hu, K. Ratner and E. Ratner et al. / Neurocomputing 365 (2019) 147–156

155

publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property. We understand that the Corresponding Author is the sole contact for the Editorial process (including Editorial Manager and direct communications with the office). He/she is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs. We confirm that we have provided a current, correct email address which is accessible by the Corresponding Author and which has been configured to accept email from: [email protected].

Fig. 22. Blood Transfusion t-SNE Visualization.

Fig. 23. Wine Quality t-SNE Visualization.

in Fig. 21; for Sculpture in Fig. 17; and for Blood Transfusion in Fig. 22; for Glass Identification in Fig. 18. 5. Conclusion According to the performed experiments on diverse datasets, it is shown that the ELM-SOM+ technique can contribute to an efficient dimensionality reduction. In different data-based concepts, the ELM-SOM+ decreases the reconstruction error considerably compared to what PCA does. Furthermore, it can be concluded that ELM-SOM+ is capable of improving SOM algorithm. It not only has the nonlinearity feature and suitability for big data but also eliminate the discontinuity of the SOM algorithm. It creates a continuous projection by using two Extreme Learning Machine models, the first one to perform the dimensionality reduction and the second one to perform the reconstruction. Although t-SNE also has good visualization results, it does not reserve the topology of the data in the high dimensional space. In the future, we will extend and test ELM-SOM+ for big data applications. Declaration of Competing Interest We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us. We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of

References [1] J.A. Lee, M. Verleysen, Nonlinear Dimensionality Reduction, 1st ed., Springer-Verlag New York, 2007. [2] S. Kaski, J. Peltonen, Dimensionality reduction for data visualization, IEEE Signal Process. Mag. 28 (2) (2011) 100–104. [3] J.A. Lee, A. Lendasse, M. Verleysen, Nonlinear projection with curvilinear distances: isomap versus curvilinear distance analysis, Neurocomputing 57 (2004) 49–76. New Aspects in Neurocomputing: 10th European Symposium on Artificial Neural Networks 2002 [4] A. Akusok, S. Baek, Y. Miche, K.-M. Bjork, R. Nian, P. Lauren, A. Lendasse, ELMVIS+: Fast nonlinear visualization technique based on cosine distance and Extreme Learning Machines, Neurocomputing 205 (2016) 247–263. [5] R. Hu, V. Roshdibenam, H.J. Johnson, E. Eirola, A. Akusok, Y. Miche, K.-M. Björk, A. Lendasse, Elm-SOM: A continuous self-organizing map for visualization, in: Proceedings of the International Joint Conference on Neural Networks (IJCNN), IEEE, 2018, pp. 1–8. [6] E. Alpaydin, Introduction to Machine Learning, The MIT Press, 2014. [7] K. Pearson, LIII. on lines and planes of closest fit to systems of points in space, Lond. Edinb. Dublin Philosoph. Mag. J. Sci. 2 (11) (1901) 559–572. [8] J.B. Kruskal, Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis, Psychometrika 29 (1) (1964) 1–27. [9] N.V. Kireeva, S.I. Ovchinnikova, I.V. Tetko, A.M. Asiri, K.V. Balakin, A.Y. Tsivadze, Nonlinear dimensionality reduction for visualizing toxicity data: distance-based versus topology-based approaches, ChemMedChem 9 (5) (2014) 1047–1059. [10] J.W. Sammon, A nonlinear mapping for data structure analysis, IEEE Trans. Comput. C-18 (5) (1969) 401–409. [11] P. Demartines, J. Herault, Curvilinear component analysis: a self-organizing neural network for nonlinear mapping of data sets, IEEE Trans. Neural Netw. 8 (1) (1997) 148–154. [12] J.B. Tenenbaum, Mapping a manifold of perceptual observations, in: Proceedings of the Conference on Advances in Neural Information Processing Systems 10, in: NIPS ’97, MIT Press, Cambridge, MA, USA, 1998, pp. 682–688. [13] J.B. Tenenbaum, V.d. Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (5500) (2000) 2319–2323. [14] J. Lee, A. Lendasse, M. Verleysen, Curvilinear distance analysis versus isomap, in: Proceedings of the European Symposium on Artificial Neural NetworksBruges (Belgium), 2002, pp. 185–192. [15] L.v.d. Maaten, G. Hinton, Visualizing data using t-SNE, J. Mach. Learn. Res. 9 (Nov) (2008) 2579–2605. [16] C.M. Bishop, M. Svensen, C.K.I. Williams, GTM: the generative topographic mapping, Neural Comput. 10 (1) (1998) 215–234. [17] M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, Lapl. Eigenmaps Spect. Techn. Embedd. Cluster. (2001) 585–591. [18] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15 (6) (2003) 1373–1396. [19] T.M. Martinetz, S.G. Berkovich, K.J. Schulten, ’Neural-gas’ network for vector quantization and its application to time-series prediction, IEEE Trans. Neural Netw. 4 (4) (1993) 558–569. [20] T. Kohonen, Self-organized formation of topologically correct feature maps, Biolog. Cybernet. 43 (1) (1982) 59–69. [21] A. Qin, P. Suganthan, Robust growing neural gas algorithm with application in cluster analysis, Neural Netw. 17 (8) (2004) 1135–1148. [22] E. Cambria, G.B. Huang, L.L.C. Kasun, H. Zhou, C.M. Vong, J. Lin, J. Yin, Z. Cai, Q. Liu, K. Li, V.C.M. Leung, L. Feng, Y.S. Ong, M.H. Lim, A. Akusok, A. Lendasse, F. Corona, R. Nian, Y. Miche, P. Gastaldo, R. Zunino, S. Decherchi, X. Yang, K. Mao, B.S. Oh, J. Jeon, K.A. Toh, A.B.J. Teoh, J. Kim, H. Yu, Y. Chen, J. Liu, Extreme learning machines [trends controversies], IEEE Intell. Syst. 28 (6) (2013) 30–59. [23] Y. Miche, M. van Heeswijk, P. Bas, O. Simula, A. Lendasse, Trop-elm: a double-regularized elm using Lars and tikhonov regularization, Neurocomputing 74 (16) (2011) 2413–2421. Advances in Extreme Learning Machine: Theory and Applications Biological Inspired Systems. Computational and Ambient Intelligence [24] C.-M. Vong, J. Du, C.-M. Wong, J.-W. Cao, Postboosting using extended g-mean for online sequential multiclass imbalance learning, IEEE Trans. Neural Netw. Learn. Syst. (99) (2018) 1–15.

156

R. Hu, K. Ratner and E. Ratner et al. / Neurocomputing 365 (2019) 147–156

[25] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: a new learning scheme of feedforward neural networks, in: Proceedings of the IEEE International Joint Conference on Neural Networks, Vol. 2, IEEE, 2004, pp. 985–990. [26] C.M. Vong, K.I. Tai, C.M. Pun, P.K. Wong, Fast and accurate face detection by sparse Bayesian extreme learning machine, Neural Comput. Appl. 26 (5) (2015) 1149–1156. [27] C. Chen, C.-M. Vong, C.-M. Wong, W. Wang, P.-K. Wong, Efficient extreme learning machine via very sparse random projection, Soft Comput. 22 (11) (2018) 3563–3574. [28] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and applications, Neurocomputing 70 (1) (2006) 489–501. Neural Networks. [29] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, A. Lendasse, Op-elm: optimally pruned extreme learning machine, IEEE Trans. Neural Netw. 21 (1) (2010) 158–162. [30] A. Akusok, K.M. Björk, Y. Miche, A. Lendasse, High-performance extreme learning machines: a complete toolbox for big data applications, IEEE Access 3 (2015) 1011–1025. [31] A. Gritsenko, Z. Sun, S. Baek, Y. Miche, R. Hu, A. Lendasse, Deformable surface registration with extreme learning machines, in: Proceedings of the International Conference on Extreme Learning Machine, Springer, 2017, pp. 304–316. [32] Y. Peng, W. Kong, B. Yang, Orthogonal extreme learning machine for image classification, Neurocomputing 266 (2017) 458–464. [33] Y. Peng, B.-L. Lu, Discriminative extreme learning machine with supervised sparsity preserving for image classification, Neurocomputing 261 (2017) 242–252. [34] G.-B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regression and multiclass classification, IEEE Trans. Syst. Man Cybern. Part B Cybern. 42 (2) (2012) 513–529. [35] G.-B. Huang, L. Chen, C.-K. Siew, Universal approximation using incremental constructive feedforward networks with random hidden nodes, IEEE Trans. Neural Netw. 17 (4) (2006) 879–892. [36] G.-B. Huang, What are extreme learning machines? filling the gap between frank rosenblatt’s dream and john von neumann’s puzzle, Cogn. Comput. 7 (3) (2015) 263–278. [37] A. Lendasse, M. Cottrell, V. Wertz, M. Verleysen, Prediction of electric load using Kohonen maps - application to the polish electricity consumption, in: Proceedings of the American Control Conference (IEEE Cat. No.CH37301), Vol. 5, 2002, pp. 3684–3689. [38] S. Dablemont, G. Simon, A. Lendasse, A. Ruttiens, F. Blayo, M. Verleysen, Time series forecasting with som and local non-linear models - application to the dax30 index prediction, in: Proceedings of the Workshop on Self-Organizing Maps, Kitakyushu, Japan, 2003, pp. 340–345. [39] P. Merlin, A. Sorjamaa, B. Maillet, A. Lendasse, X-Som and L-Som: a double classification approach for missing value imputation, Neurocomputing 73 (7) (2010) 1103–1108. Advances in Computational Intelligence and Learning. [40] D.M. Allen, The relationship between variable selection and data agumentation and a method for prediction, Technometrics 16 (1) (1974) 125–127. [41] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, A. Lendasse, OP-ELM: Optimally pruned extreme learning machine, IEEE Trans. Neural Netw. 21 (1) (2010) 158–162. [42] A. Asuncion, D. Newman, Abalone, (https://archive.ics.uci.edu/ml/datasets/ abalone). [43] Worldbank, Countries, (http://databank.worldbank.org/data/). [44] J. Venna, J. Peltonen, K. Nybo, H. Aidos, S. Kaski, Information retrieval perspective to nonlinear dimensionality reduction for data visualization, J. Mach. Learn. Res. 11 (Feb) (2010) 451–490. [45] B. German, V. Spiehler, Glass identification, (https://archive.ics.uci.edu/ml/ datasets/glass+identification). [46] Y. LeCun, C. Institute, C. Cortes, C.J. Burges, MNIST, (http://yann.lecun.com/ exdb/mnist/). [47] W.H. Wolberg, O. Mangasarian, Breast cancer, (https://archive.ics.uci.edu/ml/ datasets/breast+cancer+wisconsin+(original)). [48] A.S. Weigend, N.A. Gershenfeld, Results of the time series prediction competition at the Santa Fe Institute, in: Proceedings of the IEEE International Conference on Neural Networks, IEEE, 1993, pp. 1786–1793. [49] I.-C. Yeh, K.-J. Yang, Ting, Tao-Ming, Blood transfusion, (https://archive.ics.uci. edu/ml/datasets/Blood+Transfusion+Service+Center). [50] Y. Cortez, A. Cerdeira, F. Almeida, T. Matos, J. Reis, Wine quality, (https:// archive.ics.uci.edu/ml/datasets/wine+quality). Renjie Hu was born in 1992 in China. He is a Ph.D. candidate in the Department of Industrial and Systems Engineering at the University of Iowa. He is also working as a Research Assistant Professor in the Department of Information and Logistics Technology at University of Houston. He received his B.S. degree in Management Information Systems from Beijing University of Posts and Telecommunications (BUPT), China, and the M.S. degree in Operations Research from Columbia University. His research interest mainly focuses on Machine Learning, in particular: Missing Value Imputation, Big Data Analytics, Feature Selection, Dimensionality Reduction, and Data Visualization as well as their applications in multi-disciplines including healthcare, transportation, and engineering. He is also a member of INFORMS, IISE, and IEEE.

Karl Ratner is an undergraduate student at the University of Minnesota studying economics with a strong interest in data science and analytics. In the summer of 2018, he was an undergraduate research assistant at the University of Iowa in the Industrial Engineering Department. Karl also works as a data science intern at Edammo Inc.

Edward Ratner has over 15 years of industry experience in creating novel algorithms from conception through development to product with several products deployed to a wide customer base. His work over the course of his career has resulted in over 30 U.S. patents. Ed was elevated to Senior Member of Institute of Electrical and Electronics Engineers in 2004. During his undergraduate studies at Caltech, he received the Froehlich Prize. He was also a Hertz Foundation Fellow at Stanford University where he received his Ph.D. In 2009, he was listed as one of Caltech’s notable alumni by Forbes magazine.

Yoan Miche was born in 1983 in France. He received an Engineer’s Degree from Institut National Polytechnique de Grenoble (INPG, France), and more specifically from TELECOM, INPG, on September 2006. He also graduated with a Master’s Degree in Signal, Image and Telecom from ENSERG, INPG, at the same time. He recently received his Ph.D. degree in Computer Science and Signal and Image Processing from both the Aalto University School of Science and Technology (Finland) and the INPG (France). His main research interests are steganography/steganalysis and machine learning for classification/regression. Kaj-Mikael Björk received the master’s and Ph.D. degrees ˚ in chemical engineering from Abo Akademi University, in 1999 and 2002, respectively, and the Ph.D. degree in ˚ business administration (information systems) from Abo Akademi University, in 2006. He was a Visiting Researcher with Carnegie Mellon University, Pittsburg, USA, in 20 0 0, the University of Linköping, Sweden, in 2001, and UC Berkeley, CA, USA, from 2005 to 2006. Before working as the Head of Department, he was a Principal Lecturer with Logistics (Arcada) and an Assistant Professor in Informa˚ tion Systems with Abo Akademi University. He has held approximately 15 different courses in the fields of logistics and management science and engineering. Within the research projects, he has participated in approximately 60 scientific peer-reviewed articles and has an h-index of 10 (according to a Google scholar). His research interests are in information systems, analytics, supply chain management, machine learning, fuzzy logic, and optimization. Amaury Lendasse was born in 1972 in Belgium. He received the M.S. degree in mechanical engineering from the Universite’ catholique de Louvain (Belgium) in 1996, M.S. in control in 1997 and Ph.D. in 2003 from the same University. In 2003, he has been a postdoctoral researcher in the Computational Neurodynamics Lab at the University of Memphis. Since 2004, he is a senior researcher in the Adaptive Informatics Research Centre in the Helsinki University of Technology in Finland. He is leading the Time Series Prediction Group. He is the author or the coauthor of 64 scientific papers in international journals, books or communications to conferences with reviewing committee. His research includes time series prediction, chemometrics, variable selection, noise variance estimation, determination of missing values in temporal databases, nonlinear approximation in financial problems, functional neural networks and classification.