Multiple Universum Empirical Kernel Learning

Engineering Applications of Artificial Intelligence 89 (2020) 103461 Contents lists available at ScienceDirect Engineering Applications of Artificia...

Download PDF

1MB Sizes 0 Downloads 103 Views

Report

PDF Reader
Full Text

Engineering Applications of Artificial Intelligence 89 (2020) 103461

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai

Multiple Universum Empirical Kernel Learning✩ Zhe Wang a,b ,∗, Sisi Hong b , Lijuan Yao b , Dongdong Li b ,∗, Wenli Du a ,∗, Jing Zhang b a

Key Laboratory of Advanced Control and Optimization for Chemical Processes, Ministry of Education, East China University of Science and Technology, Shanghai, 200237, PR China b Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai, 200237, PR China

ARTICLE

INFO

Keywords: Multiple kernel learning Empirical kernel mapping Universum learning Imbalanced data Pattern recognition

ABSTRACT This paper proposes a novel framework called Multiple Universum Empirical Kernel Learning (MUEKL) that combines the Universum learning with Multiple Empirical Kernel Learning (MEKL) for the first time to inherit the advantages of both techniques. The proposed MUEKL not only obtained supplementary information of multiple feature spaces through MEKL, but also obtained priori information of samples by Universum learning. MUEKL incorporates a novel method, Imbalanced Modified Universum (IMU), to generate more efficient Universum samples by introducing the imbalanced ratio of data. MUEKL develops the basic multiple kernel learning framework by introducing a regularization of Universum data. The function of the introduced regularization is to adjust the classifier boundary closer to the Universum data to alleviate the influence of the imbalanced data. Moreover, MUEKL performs excellent generalization for both the imbalanced and balanced problems. Extensive experiments verify the effectiveness of the MUEKL and IMU.

1. Introduction Machine learning has been successfully applied in many field of pattern recognition (Sa-Couto and Wichert, 2019), i.e., data mining (Sparks et al., 2016), image processing (Li et al., 2015), and natural language processing (Chen et al., 2019). Specifically, various learning methods have been applied in many applications offering superior performance (Wu et al., 2019a; Zhu et al., 2018), such as using fuzzy information (Capuano et al., 2018) to determine linguistic preference relation (Liu et al., 2019; Wu et al., 2019b) and make group decisions (Wu et al., 2019, 2018), and using kernel methods to solve object recognition problems (Medjahed et al., 2017). Since numerous practical pattern recognition problems are highdimensional and complex, it is necessary and meaningful to study the analysis and classification methods of complex patterns. Kernelbased methods (Chong et al., 2018; Wang et al., 2019b) are relatively new learning methods developed from statistical learning theory, which effectively overcomes the shortcomings of local minimization and incomplete statistical analysis of traditional pattern recognition methods. They can effectively solve non-linear problems by mapping the input data 𝜒 into a new feature space 𝐹 , 𝛷 ∶ 𝜒 → 𝐹 (Muller et al., 2001). Here the mapping 𝛷 is represented by introducing a kernel. The kernel mapping involves the two forms of Implicit Kernel Mapping (IKM) (Bucak et al., 2014) and Empirical Kernel Mapping (EKM) (Xiong et al., 2005a). IKM is achieved by a kernel function

𝑘(𝒙𝑖 , 𝒙𝑗 ) = 𝛷𝑖 (𝒙𝑖 ).𝛷𝑖 (𝒙𝑗 ) by which the explicit form of 𝛷𝑖 (𝒙) is not required. In contrast, EKM must provide an explicit form of 𝛷𝑒 (𝒙) to obtain the exact features of the learning process. The explicit feature space can be obtained by EKM so that some effective processes and algorithms can be applied (Marukatat, 2016). From the perspective of the number of kernels, kernel-based learning also includes single kernel learning and multiple kernel learning (Wang et al., 2019a). Single kernel learning is for mapping the dataset into a new feature space by the specific kernel function. Because the characteristics of different kernel functions vary, the kernel function behaves unequally on different scenarios. In some cases, the choice of the optimal kernel is crucial to promote the performance for a specific task (Xiong et al., 2005b). Thus, instead of using a single kernel, multiple kernel learning (MKL) (Lewis et al., 2006) was proposed. Compared to kernel-based algorithms employing a fixed kernel, MKL exhibits flexibility for dealing with the problems arising from multiple and heterogeneous data sources (Nazarpour and Adibi, 2014). The input space can be mapped into multiple feature spaces where each shows the different information of the original data (Rakotomamonjy et al., 2007). Most existing MKL adopt IKM, thus, is called Multiple Implicit Kernel Learning (MIKL), while, in contrast, MKL with EKM is called Multiple Empirical Kernel Learning (MEKL). Considering the advantages of implicit kernel and empirical kernel learning, there exist many

✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2019.103461. ∗ Corresponding author. E-mail addresses: [email protected] (Z. Wang), [email protected] (D. Li), [email protected] (W. Du).

https://doi.org/10.1016/j.engappai.2019.103461 Received 19 March 2019; Received in revised form 17 October 2019; Accepted 24 December 2019 Available online xxxx 0952-1976/© 2019 Elsevier Ltd. All rights reserved.

Z. Wang, S. Hong, L. Yao et al.

Engineering Applications of Artificial Intelligence 89 (2020) 103461

2. Universum learning

studies based on these approaches. For example, Leski (2004) applied the IKM into the Modification of Ho–Kashyap algorithm with Squared approximation of the misclassification errors (MHKS) (Leski, 2003) algorithm. Alain et al. (2008) proposed an iterative algorithm named SimpleMKL, which is classified as MIKL, that implemented a linear approach with kernel weights in a simplex. Cortes et al. (2009) applied the MIKL into the Support Vector Machine (SVM) to propose the Nonlinear Multiple Kernel Support Vector Machine (NLMKSVM) algorithm. Wang et al. (2008) proposed a novel Multiple Kernel Learning Algorithm the multiple empirical kernel algorithm MultiK-MHKS where the input data maps into multiple different feature spaces by the corresponding empirical kernel. Moreover, they applied the Canonical Correlation Analysis (CCA) to the MultiK-MHKS to maximally correlate the multiple feature spaces. Since EKM can show the specific form of the sample in the kernel space, after mapping the sample to the kernel space, nearly all the classifiers can be directly applied to the kernel space. This enables a more straightforward process for analyzing the structure of the sample in the kernel space. The existing MEKL is considered an alternative approach of kernel learning (Gu et al., 2017), which introduces the existing techniques into MKL and illustrates the differences between EKM and IKM. However, little research concentrates on the inherent characteristics of EKM. The traditional MEKL problem optimizes the learning framework by minimizing the empirical risk, the regularization risk, and the loss term of the multiple feature spaces (Wang et al., 2008). It also ignores some information among the training samples, which may provide important contributions to the classification performance. Fortunately, Universum learning (Liu et al., 2017) meets this need. Universum learning provides a collection of ‘‘non-examples’’ that do not belong to either class (Weston et al., 2006), and these unlabeled samples, called the Universum samples, imply prior knowledge. This learning process is very effective due to its ability to capture prior knowledge in the data domain (Tencer et al., 2017) and greatly enrich the pattern representation. Leveraging the advantages of MEKL and Universum learning, this paper applies the Universum learning into the multiple empirical kernel framework to propose a novel algorithm called the Multiple Universum Empirical Kernel Learning (MUEKL). However, simply combining Universum learning with MEKL does not take full advantage of multiple kernels, and its performance improvement is still limited. By taking into account the imbalanced ratio (Fan et al., 2017; Wang and Cao, 2019) of the data set, we designed a new method named IMU to generate the Universum sample. Based on the newly generated samples, a new regularization term 𝑅𝑢𝑛𝑖 was introduced to the base framework MultiK-MHKS to combine Universum learning with each empirical kernel, which can work together to promote Universum learning in the optimization process of the objective function. Numerous experiments have demonstrated that the proposed MUEKL method not only performs well on imbalanced data sets, but also on balanced data sets. The contributions of this paper include the following. ∙ This paper proposes a novel framework called Multiple Universum Empirical Kernel Learning (MUEKL) that combines the Universum learning with multiple empirical kernel learning for the first time. ∙ MUEKL incorporates a novel method, Imbalanced Modified Universum (IMU), to generate more efficient Universum samples by introducing the imbalanced ratio of data. ∙ MUEKL develops the basic multiple kernel learning framework by introducing a regularization 𝑅𝑢𝑛𝑖 of Universum data. 𝑅𝑢𝑛𝑖 utilizes collaborative learning between multiple empirical kernels to optimize Universum learning and improve the classification performance of each empirical kernel. Extensive experiments verify the effectiveness of the MUEKL and IMU on balanced and imbalanced problems. The remainder of this paper is organized as follows. In Section 2, we describe the Universum learning. In Section 3, we introduce our MUEKL framework and the novel way that IMU generates Universum samples. Experimental results are shown in Section 4 with conclusions and future work summarized in Section 5.

We first review the development of Universum learning with a focus on the combination of Universum learning and kernel learning. Next, we introduce the generation method of Universum samples. 2.1. Development of universum learning The development of Universum learning began when Vapnik and Vapnik (1998) introduced an alternative capacity concept for large margin approaches. Weston et al. (2006) studied a new framework based on the work of Vapnik by providing a collection of ‘‘nonexamples’’ that do not belong to either class in binary classification. These unlabeled samples, called the Universum samples, implied prior knowledge to assist the formation of a classifier boundary. The proposed algorithm named USVM (Weston et al., 2006) applied the Universum samples into an SVM framework and employed prior knowledge to improve the performance of the classifier. The USVM also applied the Universum samples into supervised learning. Generally, Universum Learning offered two forms of non-kernelbased and kernel-based. For non-kernel-based Universum learning, Zhang et al. (2009) applied the Universum into semi-supervised learning by employing labeled data, unlabeled data, and Universum data to improve the classification performance. They also proposed a graphbased method to utilize the Universum data to depict the prior information for possible classifiers. Experiments demonstrated that the proposed method employing Universum data obtained superior performances over conventional supervised and semi-supervised methods proving the effectiveness of the approach. With this success, researchers began to apply the Universum learning into other classifier methods. For example, Shen et al. (2012) designed a novel boosting algorithm that takes advantage of the available Universum data, hence its name UBoost, and experimentally demonstrated that it could deliver improved classification accuracy over standard boosting algorithms that use labeled data alone. For kernel-based Universum learning, Qi et al. (2012) utilized Universum data to improve the performance of the Twin Support Vector Machine approach with the proposed new algorithm called U-TSVM. Xu et al. (2016) furthered this advance with a proposed least squares TSVM with Universum samples and implemented the structural risk minimization principle by introducing a regularization term. Tian et al. (2016) applied the Universum learning to a semisupervised classification problem and utilized the Universum data to improve its accuracy. Zhu and Gao (2015) introduced the Universum samples into implicit kernel learning and generated the Universum samples before the kernel mapping. As introduced above, in this paper, we combine the Universum learning with multiple empirical learning such that our MUEKL generates the corresponding Universum samples in each kernel feature space. 2.2. Universum samples Many study efforts have demonstrated the effectiveness and superiority of Universum learning by employing prior knowledge of unlabeled Universum data to improve classification performance. Generating the Universum data is a core requirement of the process. In the USVM framework proposed by Weston et al. (2006), four types of Universum examples were defined as 𝑈𝑁𝑜𝑖𝑠𝑒 , 𝑈𝑅𝑒𝑠𝑡 , 𝑈𝐺𝑒𝑛 , and 𝑈𝑀𝑒𝑎𝑛 . 𝑈𝑁𝑜𝑖𝑠𝑒 includes images of ‘‘random noise’’ by generating uniformly distributed pixel features. 𝑈𝑅𝑒𝑠𝑡 are the data that belong to other classes while 𝑈𝐺𝑒𝑛 generates artificial data. 𝑈𝑀𝑒𝑎𝑛 creates artificial data by randomly selecting two data belonging to different classes and then constructing their mean. Experiments showed that 𝑈𝑁𝑜𝑖𝑠𝑒 and 𝑈𝐺𝑒𝑛 are worse than 𝑈𝑅𝑒𝑠𝑡 and 𝑈𝑀𝑒𝑎𝑛 because of their randomness. Chen and Zhang (2009) also found that not all the Universum samples are helpful. Even worse, some Universum examples would play the role of outliers, which could 2

Z. Wang, S. Hong, L. Yao et al.

Engineering Applications of Artificial Intelligence 89 (2020) 103461

severely disrupt the training process. So, they proposed a method to select only the informative ones, i.e., the in-between Universum samples (IBU). Research shows that when the Universum samples are positioned in between different classes and close to the classification hyperplane, the classifier can obtain superior performance compared to other Universum data. The IBU data was generated by calculating the means of each pair between two patterns, respectively, from two classes of interest. In this paper, we introduce a novel method for the IMU to generate Universum data and adopt the IMU in subsequent experiments. In the experimental section, this paper also compares the performance of IBU and IMU proving that IMU offers a superior result while also improving the performance in imbalanced datasets.

equal to 1, and 𝒃𝑁×1 represents all entries equal to the nonnegative values 𝑏. The base classifier MultiK-MHKS used in this paper is different from the existing multi-kernel learning method by convexly combining several kernel matrices. MultiK-MHKS utilizes the principle of the motivating argument from Canonical Correlation Analysis (CCA) that maps input samples to different feature spaces through multiple kernel functions for generating different view modes. Then, the generated patterns are merged into a single learning process to ensure the output of the algorithm is as consistent as possible between the various perspectives. Finally, the output error under each feature space is minimized, and the value of the weight parameter is constrained until it is iterated to the optimal value to generate a discriminant function.

3. Method

3.2. Regularization term, 𝑅𝑢𝑛𝑖

This section presents how to introduce Universum learning into MEKL. We select the MultiK-MHKS (Wang et al., 2008) as the base classifier for developing our novel algorithm, MUEKL.

We propose a novel method, called IMU, to generate Universum samples depending on the imbalanced size of samples. The IMU is represented as

3.1. Multiple empirical kernel learning

𝒙𝑢 =

First, we review empirical kernel learning. Suppose there are 𝑁 training data, {(𝒙𝑖 , 𝜑𝑖 )}𝑁 , where 𝒙𝑖 ∈ R𝑑 and 𝜑𝑖 ∈ {+1, −1}. For the 𝑖=1 , 𝑿 denotes the 𝑁 × 𝑑 sample matrix where each sample set {(𝒙𝑖 , 𝜑𝑖 )}𝑁 𝑖=1 row represents a sample, 𝑑 denotes the dimension of the sample, and 𝜑𝑖 represents the class label with the positive class labeled as +1 and the negative class as −1. 𝑲 = [𝑘𝑒𝑟𝑖𝑗 ]𝑁×𝑁 denotes the 𝑁 × 𝑁 kernel matrix where 𝑘𝑒𝑟𝑖𝑗 = 𝛷(𝒙𝑖 ) ⋅ 𝛷(𝒙𝑗 ) = 𝑘𝑒𝑟(𝒙𝑖 , 𝒙𝑗 ). If the rank of 𝑲 is 𝑟, then the kernel matrix 𝑲 can be decomposed as

where 𝑁𝑝 denotes the number of positive samples, 𝑁𝑛 denotes the number of negative samples with 𝒙𝑖 and 𝒙𝑗 representing the positive and negative samples, respectively. When 𝑁𝑝 is equal to 𝑁𝑛 , IMU is degraded into IBU. The Universum samples generated by IMU satisfies the characteristics of the imbalanced data. When the imbalanced ratio is higher, the classifier boundary is closer to the minority class. However, the Universum samples generated by IMU also play a more critical role in adjusting the classifier boundary. We next leverage Universum learning to obtain prior knowledge of the data, so the regularization term 𝑅𝑢𝑛𝑖 is introduced into the MEKL , and framework. Suppose there are 𝑈 Universum samples, {(𝒙∗𝑖 , 𝜑𝑖 )}𝑈 𝑖=1 𝜑𝑖 ∈ {+1, −1}. The Universum regularization term 𝑅𝑢𝑛𝑖 is defined as

𝑲𝑁×𝑁 = 𝑸𝑁×𝑟 ∧𝑟×𝑟 𝑸𝑇𝑟×𝑁

(1)

where ∧ is a diagonal matrix consisting of the 𝑟 positive eigenvalues of 𝑲, and 𝑸 consists of the corresponding orthonormal eigenvectors. Thus, the explicit mapping, also called the EKM (Wang et al., 2008), is given as 𝛷𝑒 ∶ 𝜒 → ϝ𝑑𝑒 such that 𝜒 →∧

−1∕2

𝑇

𝑇

𝒀∗

Therefore, 𝑖 ) denotes the sample 𝒙𝑖 mapped into a new feature space by kernel 𝛷𝑒 . Each 𝛷𝑒 (𝒙𝑖 ), also called one view of the input space, and kernel correspond to one feature space. The mapped sample can then be obtained explicitly. Therefore, MEKL has multiple kernel feature spaces with the corresponding EKMs denoted as 𝑒 (𝒙 )}𝑀 . {𝛷1𝑒 (𝒙𝑖 ), … , 𝛷𝑖𝑒 (𝒙𝑖 ), … , 𝛷𝑀 𝑖 𝑖=1 Our approach adapts the MEKL, MultiK-MHKS, as our basic framework with 𝑀 ∑ [𝑅𝑒𝑚𝑝 + 𝑐𝑖 𝑅𝑟𝑒𝑔 ] + 𝛾𝑅𝐼𝐹 𝑆𝐿 ,

(3)

𝑖=1

where for the 𝑖𝑡ℎ view, 𝑅𝑒𝑚𝑝 is the empirical risk, 𝑅𝑟𝑒𝑔 denotes the regularization term that controls the generalization ability of the classifier, and 𝑐𝑖 ≥ 0 is a regularization parameter that controls the trade-off between 𝑅𝑒𝑚𝑝 and 𝑅𝑟𝑒𝑔 . The 𝑅𝐼𝐹 𝑆𝐿 plays an important role in the agreement penalization on the outputs of multiple kernel as the parameter 𝛾 controls the weight of its regularization. A detailed description of the three terms includes the following: 𝑅𝑒𝑚𝑝 = (𝒀 𝝎 − 𝟏𝑁×1 − 𝒃𝑁×1 )𝑇 (𝒀 𝝎 − 𝟏𝑁×1 − 𝒃𝑁×1 ),

(4)

̃𝑇 𝝎 ̃, 𝑅𝑟𝑒𝑔 = 𝝎

(5)

𝑅𝐼𝐹 𝑆𝐿 =

𝑀 ∑ 𝑖=1

(𝒀𝑖 𝝎𝑖 −

1 𝑀

𝑀 ∑ 𝑗=1;𝑗≠𝑖

𝒀𝑗 𝝎𝑗 )𝑇 (𝒀𝑖 𝝎𝑖 −

1 𝑀

𝑀 ∑

𝒀𝑗 𝝎𝑗 ),

𝑁𝑛 𝒙, 𝑁𝑝 + 𝑁𝑛 𝑗

(7)

[(𝒙∗1 , 1); ...; (𝒙∗𝑈 , 1)],

(8) R𝑑

̃∈ 𝝎 are the weight vectors with the where = augmented weight vector 𝝎 = [̃ 𝝎𝑇 , 𝝎0 ]. The Universum samples are near to the classifier boundary according to the generation criterion of Universum samples. The introduced regularization term 𝑅𝑢𝑛𝑖 reflects knowledge about the domain of the entire data distribution, so the Universum samples assist the classifier in developing a new boundary with the boundary perturbation. Universum learning can improve the performance of the classifier by utilizing the domain of the complete data distribution as well as solve the imbalanced problem. For the imbalanced dataset, the classifier is closer to the minority class so that its performance is not good for the imbalanced dataset. When the imbalanced ratio is higher, the Universum samples generated by IMU is closer to the majority class (Zhu et al., 2019). The Universum regularization term 𝑅𝑢𝑛𝑖 tries to locate the Universum samples at the classifier boundary, so the classifier boundary diverges from the minority class appropriately.

𝛷𝑒 (𝒙

𝑚𝑖𝑛𝐿 =

𝒙𝑖 +

𝑅𝑢𝑛𝑖 = (𝒀 ∗ 𝝎)𝑇 (𝒀 ∗ 𝝎),

(2)

𝑸 [𝑘𝑒𝑟(𝒙, 𝒙1 ), … , 𝑘𝑒𝑟(𝒙, 𝒙𝑁 )] .

𝑁𝑝 𝑁𝑝 + 𝑁𝑛

3.3. Proposed MUEKL method As discussed, the MUEKL framework combines MEKL with Universum learning. MUEKL inherits the advantages of MEKL while incorporating the superior Universum learning approach and being capable of solving the imbalanced problem. The novel method is very effective due to its ability to capture prior knowledge in the data domain (Tencer et al., 2017) and greatly enrich the pattern representation. Fig. 1 shows a flow chart for MUEKL. First, the original samples {𝒙𝑖 }𝑁 are subjected 𝑖=1 to empirical kernel mapping by the number of 𝑀 kernel functions to obtain more supplemental information of the training samples. For 𝑖𝑡ℎ kernel space, the mapped data is 𝒀𝑖 . Then, the unlabeled Universum

(6)

𝑗=1;𝑗≠𝑖

̃ ∈ R𝑑 in Eq. (5) are the weight vectors with the augmented where 𝝎 weight vector 𝝎 = [̃ 𝝎𝑇 , 𝝎0 ]. The matrix 𝒀 is defined as 𝒀 = [𝜑1 (𝒙𝑇1 , 1); ...; 𝜑𝑁 (𝒙𝑇𝑁 , 1)], 𝟏𝑁×1 denotes the vector of dimension with all entries 3

Z. Wang, S. Hong, L. Yao et al.

Engineering Applications of Artificial Intelligence 89 (2020) 103461

Fig. 1. The flow chart for MUEKL. (First, the original samples {𝑥𝑖 }𝑁 are subjected to empirical kernel mapping by the number of 𝑀 kernel functions. For 𝑖𝑡ℎ kernel space, 𝑖=1 the mapped data is 𝑌𝑖 . Then, the unlabeled Universum samples generated by the IMU method from the mapped data. IMU takes the imbalance ratio into account and is more advantageous for the imbalanced problem. All mapped samples undergo Universum learning within different feature spaces. Finally, The Universum samples perturb the formation of the classification boundary to achieve a better classification result.)

samples generated by the IMU method from the mapped data in different feature spaces. IMU takes the imbalanced ratio into account and is more advantageous for the imbalanced problem. Finally, all mapped samples undergo Universum learning within different feature spaces and the Universum samples perturb the formation of the classification boundary to achieve a better classification result. Combining Eq. (3) with Eq. (8), the framework of MUEKL is defined as 𝑚𝑖𝑛𝐿 =

𝑀 ∑

where 𝒃𝑖 denotes the distance from the samples to the hyperplane. The components of 𝒃𝑖 need to be nonnegative, and we employ the iterative algorithm for selecting the optimal 𝝎𝑖 and 𝒃𝑖 . Thus, 𝒃𝑝𝑖 denotes vector 𝒃 in the 𝑖𝑡ℎ view of the 𝑝𝑡ℎ iteration, so we initialize 𝒃1𝑖 ≥ 0 and keep 𝒃𝑝𝑖 ≥ 0 at each iteration 𝑝. We then have { 1 𝒃𝑖 ≥ 0 (14) 𝑝+1 𝒃𝑖 = 𝒃𝑝𝑖 + 𝜌𝑖 (𝒆𝑝𝑖 + |𝒆𝑝𝑖 |), where 𝒆𝑝𝑖 = 𝒀𝑖 𝝎𝑖 − 𝟏𝑁×1 − 𝒃𝑖 , the parameter 𝜌𝑖 is the learning rate of the 𝑖𝑡ℎ view, and 𝜌𝑖 > 0. We also define a termination criterion 𝜉 such that the MUEKL stops iterating when ‖𝐿𝑖+1 − 𝐿𝑖 ‖2 ≤ 𝜉.

(9)

[𝑅𝑒𝑚𝑝 + 𝑐𝑖 𝑅𝑟𝑒𝑔 + 𝑢𝑖 𝑅𝑢𝑛𝑖 ] + 𝛾𝑅𝐼𝐹 𝑆𝐿 ,

𝑖=1

where 𝑢𝑖 is the regularization parameter that controls the impact on the regularization 𝑅𝑢𝑛𝑖 . Combining Eqs. (4), (5), (6), and (8), the optimization function of the proposed MUEKL becomes 𝑚𝑖𝑛𝐿 =

Finally, for the sample 𝒛, the decision function of MUEKL by using the obtained 𝝎𝑖 , 𝑖 = 1, … , 𝑀 is defined as

𝑀 ∑ ̃ 𝑇𝑖 𝝎 ̃𝑖+ [(𝒀𝑖 𝝎𝑖 − 𝟏𝑁×1 − 𝒃𝑖 )𝑇 (𝒀𝑖 𝝎𝑖 − 𝟏𝑁×1 − 𝒃𝑖 ) + 𝑐𝑖 𝝎 𝑖=1 𝑀 ∑

𝑢𝑖 (𝒀𝑖∗ 𝝎𝑖 )𝑇 (𝒀𝑖∗ 𝝎𝑖 )] + 𝜆

(𝒀𝑖 𝝎𝑖 −

𝑖=1

1 𝑀

𝑀 ∑

𝒀𝑗 𝝎𝑗 )𝑇 (𝒀𝑖 𝝎𝑖 −

𝑗=1;𝑗≠𝑖

1 𝑀

𝑀 ∑

𝐿(𝒛) =

𝑀 1 ∑ 𝑇 𝝎 [(𝛷𝑖𝑒 (𝒛))𝑇 , 1] 𝑀 𝑖=1 𝑖

{

>0 <0

𝑡ℎ𝑒𝑛 𝑡ℎ𝑒𝑛

𝒛 ∈ 𝑐𝑙𝑎𝑠𝑠 + 1 𝒛 ∈ 𝑐𝑙𝑎𝑠𝑠 − 1.

(15)

The algorithm of MUEKL is summarized in Table 1.

𝒀𝑗 𝝎𝑗 ).

𝑗=1;𝑗≠𝑖

(10) 3.4. Computational complexity of the MUEKL algorithm

Each 𝝎𝑖 can be optimized with a heuristic gradient descent of 𝐿. Then, taking the gradient of Eq. (10) with respect to 𝝎𝑖 , we obtain 𝑑𝐿 as 𝑑𝜔 𝑑𝐿 ̃ 𝑖 + 𝑢𝑖 (𝒀𝑖∗ )𝑇 𝒀𝑖∗ 𝝎𝑖 + = 𝒀𝑖𝑇 (𝒀𝑖 𝝎𝑖 − 𝟏𝑁×1 − 𝒃𝑖 ) + 𝑐𝑖 𝝎 𝑑𝜔 𝑀 1 ∑ 𝜆 𝑇 𝒀𝑖 )(𝒀𝑖 𝝎𝑖 − (𝜆𝒀𝑖𝑇 − 𝒀𝝎) 𝑀 𝑀 𝑗=1;𝑗≠𝑖 𝑗 𝑗

In MUEKL, to first decompose the kernel matrix 𝑲𝑁×𝑁 according to Eq. (1), the computational complexity is 𝑂(𝑀 × 𝑛3 ) where 𝑀 is the number of kernels and 𝑛 indicates the number of samples. Then, the empirical kernel mapping is performed according to Eq. (2), and its computational complexity is 𝑂(𝑀 × 𝑑 3 ) where 𝑑 represents the mapped sample dimension. Iteratively solving 𝜔 has a computational complexity of 𝑂(𝑀 × 𝑡 × 𝑛 × 𝑑) where 𝑡 is the number of iterations. Therefore, the computational complexity of MUEKL is

(11)

Finally, setting Eq. (11) to zero, we obtain 𝝎𝑖 = [(1 + 𝜆

𝑀 𝑀 −1 𝑇 𝑀 −1 ∑ )𝒀𝑖 𝒀𝑖 + 𝑐𝑖 𝑰̂ + 𝑢𝑖 (𝒀𝑖∗ )𝑇 𝒀𝑖∗ ]−1 𝒀𝑖𝑇 (𝒃𝑖 + 𝟏𝑁×1 + 𝜆 𝒀𝝎) 𝑀 𝑀 2 𝑗=1;𝑗≠𝑖 𝑗 𝑗

𝑂(𝑀 × 𝑛3 + 𝑀 × 𝑑 3 + 𝑀 × 𝑡 × 𝑛 × 𝑑).

(12)

Of course the proposed algorithm still needs to take into consideration its own computational complexity. However, the main focus of this paper is on the improvement of the classification accuracy obtained by these approaches. Reducing the complexity of the algorithm is an area that must be explored in future work.

where 𝑰̂ is a diagonal matrix with full 1𝑠 except the last element is set to zero. Similarly, continuing to set the gradient of Eq. (10) with respect to 𝒃𝑖 to zero, we see that 𝑑𝐿 = 𝒀𝑖 𝝎𝑖 − 𝟏𝑁×1 − 𝒃𝑖 = 0 𝑑𝑏

(16)

(13) 4

Z. Wang, S. Hong, L. Yao et al.

Engineering Applications of Artificial Intelligence 89 (2020) 103461 Table 1 Algorithm of MUEKL. Input: {(𝒙𝑖 , 𝜑𝑖 )}𝑁 , 𝜑𝑖 ∈ {+1, −1} and 𝑀 kernels {𝑘𝑒𝑟𝑙 (𝒙𝑖 , 𝒙𝑗 )}𝑀 . 𝑖=1 𝑙=1 Output: The augmented weight vectors 𝝎𝑙 . 𝑒 1. Explicitly map {𝒙𝑖 }𝑁 into {𝛷1𝑒 (𝒙𝑖 ), … , 𝛷𝑖𝑒 (𝒙𝑖 ), … , 𝛷𝑀 (𝒙𝑖 )}𝑀 by 𝑀 kernels as shown in Eq. (2). 𝑖=1 𝑖=1

2. For the 𝑖𝑡ℎ kernel space, map the dataset 𝒀𝑖 = [(𝛷𝑖𝑒 (𝒙1 ), 𝜑1 ), … , (𝛷𝑖𝑒 (𝒙𝑁 ), 𝜑𝑁 )], 𝑖 = 2, … , 𝑀. 3. For the 𝑖𝑡ℎ kernel space, generate Universum samples with data 𝒀𝑖 according to Eq. (7), 𝑖 = 1, … , 𝑀. The Universum samples are referred to as 𝒀𝑖∗ . 4. Initialize 𝜌𝑖 > 0, 𝜉 > 0, 𝑐𝑖 ≥ 0, 𝑢𝑖 > 0, 𝒃1𝑖 ≥ 0, and 𝝎1𝑖 , 𝑖 = 1...𝑀. Set the iteration index 𝑝 = 1. 5. Calculate the 𝝎𝑝𝑖 according to Eq. (12). 6. 𝒆𝑝𝑖 = 𝒀𝑖 𝝎𝑖 − 𝟏𝑁×1 − 𝒃𝑖 , 𝑖 = 1, … , 𝑀. 7. 𝒃𝑝+1 = 𝒃𝑝𝑖 + 𝜌𝑖 (𝒆𝑝𝑖 + |𝒆𝑝𝑖 |), 𝑖 = 1, … , 𝑀. 𝑖 8. Iteratively calculate 𝝎𝑝+1 according to Eq. (12). 𝑖 If ‖𝐿𝑖+1 − 𝐿𝑖 ‖2 ≤ 𝜉, then 𝑝 = 𝑝 + 1 and go to Step 6; else stop.

4. Experiments

4.2. Experiments on artificial imbalanced dataset In order to illustrate the process of the proposed algorithms more clearly, the visualized data is used. We present the process of Universum learning on the artificial dataset and the effect of the Universum samples on classification boundary. Fig. 2 shows the classification process of Universum learning on the artificial imbalanced dataset. The dataset in the four figures are moontype distribution. In these sub-figures, red represents minority samples and blue denotes majority samples. The imbalanced ratio (IR) is 4. From sub-figures (b) to (d), the effect of Universum learning on the classification hyperplane is shown. In these figures, the classification boundary of original MultiK-MHKS bias to the minority class, thus giving disadvantages for the recognition of minority samples. However, the classification hyperplane of MUEKL-1 and MUEKL-2 are more biased to the majority samples. The reason is that the Universum samples perturb the classification hyperplane. In detailed, the original sample 𝒙𝑖 is mapped to 𝑀 feature spaces by 𝑀 RBF kernels. Fig. 2 shows the samples of 𝑖𝑡ℎ kernel space. According to sub-figure (b), the unlabeled Universum samples with black star generated by the IMU method from the positive and negative samples. The number is 25. In sub-figure (c), the classification boundary generated from the original frame MultiK-MHKS. It can be seen that the positive samples around the black stars are not correctly classified. In subfigure (d), the classification boundaries generated by the MUEKL with different regularization parameter. The larger the parameter value, the more biased toward the majority. So the classification hyperplane of MUEKL-1 and MUEKL-2 are more biased to the majority samples. In summary, the Universum samples can perturb the generation of classification boundary, which can better classify imbalanced data.

The proposed MUEKL can solve the balanced and imbalanced problems, and to validate its effectiveness, several experiments on UCI (Bache and Lichman, 2013) and KEEL datasets are conducted. Six representative classifiers were selected for comparison with the proposed MUEKL. Comparison algorithms include USVM, MultiK-MHKS, SimpleMKL, NLMKSVM, SVM and EasyMKL (Aiolli and Donini, 2015). The reason we chose these classifiers is as follows: The core idea of the USVM algorithm is to introduce Universum learning into the SVM framework, which is similar to the starting point of MUEKL. MultiK-MHKS is a classic and efficient multi-kernel algorithm, and the proposed algorithm is based on the MultiK-MHKS model framework. Therefore, the base algorithm needs to be compared to reflect the effectiveness of the proposed MUEKL. SimpleMKL, NLMKSVM and EasyMKL are all efficient algorithms based on multi-kernel, and SVM is considered to be the classic classifier in the field of pattern recognition. The related experimental settings are reported in Section 4.1. Section 4.2 shows the experiments on artificial imbalanced dataset, and Sections 4.3 and 4.4 show the experimental results on the KEEL and UCI datasets. The experiments on the Universum samples are overviewed in Section 4.5. Finally, further discussion on the characteristics of MUEKL is included in Section 4.6.

4.1. Experimental settings

In the experiments, we compare our proposed MUEKL with USVM, MultiK-MHKS, SimpleMKL, NLMKSVM, SVM, and EasyMKL on 38 KEEL datasets and 28 UCI datasets. We select the Radial Basis Function (RBF) ‖𝒙 −𝒙 ‖2 kernel 𝑘𝑒𝑟(𝒙𝑖 , 𝒙𝑗 ) = 𝑒𝑥𝑝(− 𝑖2𝜎 2𝑗 ) with different kernel parameters ∑ 2 𝜎 2 = 𝛽 ⋅ 𝑁12 𝑁 𝑖,𝑗=1 ‖𝒙𝑖 − 𝒙𝑗 ‖ as the candidate kernels, where 𝛽 ∈ −2 −1 [10 , 10 , 1, 10, 102 ] and 𝑁 is the number of the samples used to generate the feature space. For convenient comparison, all comparing kernel algorithms use the RBF kernel (Vatankhah et al., 2013) and its parameter is the same as the MUEKL algorithm. We simply set the number of the kernel 𝑀 to 3 in proposed MUEKL. The 𝜆, 𝑐, and 𝑢 are chosen from {10−2 , 10−1 , 1, 10, 102 }. The proposed IMU method that generates the Universum samples are applied to MUEKL algorithm. Tables 3 and 5 present the MUEKL results under the IMU method. The parameter 𝑘 is set to 5 when we generate the Universum samples such that there are at most 25 Universum samples in a binary classifier. A one-against-one strategy is adopted to decompose a multiclass problem into multiple binary class sub-problems, and the records are obtained by the best combination of these parameters chosen by 5-𝑓 𝑜𝑙𝑑 cross-validation (Georgios and Markatou, 2019). All algorithms are implemented on an Inter(R) Xeon(R) Core with 1.80 GHz, 8 G RAM, and a Windows 7 operating system within a MATLAB environment.

4.3. Experiments on the KEEL datasets In this section, we verify the performance of the proposed MUEKL on 38 KEEL datasets. The descriptions of the KEEL datasets are listed in Table 2, including the dimension, Imbalanced Ratio (IR), and size of each dataset. To validate the effectiveness of MUEKL in the imbalanced problem, we select the datasets with IR ranges from 0.54 to 43.80. Considering the characteristics of the imbalanced problem, the Average Accuracy (AAcc) (He and Garcia, 2009) is employed to evaluate the classification performance, which is defined as 1 + 𝑇 𝑃𝑟𝑎𝑡𝑒 − 𝐹 𝑃𝑟𝑎𝑡𝑒 , (17) 2 where 𝑇 𝑃𝑟𝑎𝑡𝑒 and 𝐹 𝑃𝑟𝑎𝑡𝑒 are the percentage of the positive samples classified correctly and the negative samples misclassified.

𝐴𝐴𝑐𝑐 =

4.3.1. Results of the KEEL datasets Table 3 shows the classification accuracy, the standard deviation of the proposed MUEKL, and the compared algorithms in detail. The best 5

Z. Wang, S. Hong, L. Yao et al.

Engineering Applications of Artificial Intelligence 89 (2020) 103461

Fig. 2. The Universum learning process and the effect of Universum samples on classification boundary. (In these sub-figures, red represents minority samples and blue denotes majority samples. The original sample 𝑥𝑖 is mapped to 𝑀 feature spaces by 𝑀 RBF kernels. These sub-figures show the samples of 𝑖𝑡ℎ kernel space. According to sub-figure (b), the unlabeled Universum samples with black star generated by the IMU method from the positive and negative samples. In sub-figure (c), the classification boundary generated from the original frame MultiK-MHKS. It can be seen that the positive samples around the black stars are not correctly classified. In sub-figure (d), the classification boundaries generated by the MUEKL with different regularization parameter. The larger the parameter value, the more biased toward the majority. So the classification hyperplane of MUEKL-1 and MUEKL-2 are more biased to the majority samples, which can better classify imbalanced data.) (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 2 KEEL datasets description. Dataset

Attributes

IR

Instances

Dataset

Attributes

IR

Instances

ecoli_01 glass1 wisconsin pima haberman vehicle2 vehicle1 vehicle3 vehicle0 ecoli1 new_thyroid1 new_thyroid2 ecoli2 segment0 glass6 yeast3 ecoli-0345 ecoli-02345 ecoli-0465

7 9 9 8 3 18 18 18 18 7 5 5 7 19 9 8 7 7 6

0.54 1.85 1.86 1.87 2.81 2.89 2.91 3.00 3.25 3.39 5.14 5.14 5.54 6.02 6.43 8.13 9.00 9.06 9.13

220 214 683 768 306 846 846 846 846 336 215 215 336 2308 214 1484 200 202 203

yeast-02563789 yeast-02579368 ecoli-03465 ecoli-034756 ecoli-01235 ecoli-06735 glass-045 ecoli-026735 vowel0 ecoli-0675 ecoli-01472356 ecoli-015 ecoli-014756 ecoli-01465 shuttle_c0c4 page_blocks134 abalone9_18 shuttle_c2c4 ecoli_013726

8 8 7 7 7 7 9 7 13 6 7 6 6 6 9 10 8 9 7

9.16 9.16 9.25 9.25 9.26 9.41 9.43 9.53 9.97 10.00 10.65 11.00 12.25 13.00 13.78 16.14 16.70 24.75 43.80

1004 1004 205 257 244 222 92 224 988 220 336 240 332 280 1829 472 731 129 281

6

Z. Wang, S. Hong, L. Yao et al.

Engineering Applications of Artificial Intelligence 89 (2020) 103461

Table 3 Test result of all used comparison algorithms on the KEEL datasets are listed in the table. (The best result on each dataset is written in bold.) Dataset

MUEKL AAcc ± std (%)

USVM AAcc ± std (%)

MultiK-MHKS AAcc ± std (%)

SimpleMKL AAcc ± std (%)

NLMKSVM AAcc ± std (%)

SVM AAcc ± std (%)

EasyMKL AAcc ± std (%)

ecoli_01 glass1 wisconsin pima haberman vehicle2 vehicle1 vehicle3 vehicle0 ecoli1 new_thyroid1 new_thyroid2 ecoli2 segment0 glass6 yeast3 ecoli-0345 ecoli-02345 ecoli-0465 yeast02563789 yeast02579368 ecoli-03465 ecoli-034756 ecoli-01235 ecoli-06735 glass-045 ecoli-026735 vowel0 ecoli-0675 ecoli01472356 ecoli-015 ecoli-014756 ecoli-01465 shuttle_c0c4 page_blocks134 abalone9_18 shuttle_c2c4 ecoli013726

𝟗𝟖.𝟔𝟕 ± 𝟏.𝟖𝟑 𝟕𝟗.𝟔𝟔 ± 𝟕.𝟒𝟏 𝟗𝟕.𝟗𝟗 ± 𝟎.𝟔𝟏 𝟕𝟑.𝟎𝟑 ± 𝟑.𝟏𝟏 𝟔𝟒.𝟐𝟕 ± 𝟒.𝟑𝟓 𝟗𝟗.𝟏𝟓 ± 𝟎.𝟔𝟖 𝟕𝟕.𝟒𝟑 ± 𝟒.𝟏𝟓 𝟕𝟔.𝟒𝟕 ± 𝟒.𝟖𝟏 𝟗𝟗.𝟏𝟖 ± 𝟎.𝟔𝟔 𝟗𝟎.𝟒𝟖 ± 𝟔.𝟐𝟗 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 𝟗𝟒.𝟑𝟏 ± 𝟒.𝟒𝟕 𝟗𝟗.𝟐𝟐 ± 𝟎.𝟗𝟎 93.06 ± 7.08 87.72 ± 2.18 92.22 ± 11.73 93.89 ± 7.89 𝟗𝟐.𝟐𝟑 ± 𝟏𝟎.𝟗𝟕 𝟖𝟎.𝟑𝟗 ± 𝟒.𝟖𝟒 𝟗𝟎.𝟔𝟓 ± 𝟒.𝟔𝟖 91.96 ± 6.76 𝟗𝟒.𝟒𝟗 ± 𝟓.𝟐𝟎 𝟗𝟎.𝟔𝟖 ± 𝟏𝟕.𝟔𝟕 𝟖𝟗.𝟓𝟎 ± 𝟏𝟔.𝟗𝟕 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 86.51 ± 11.90 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 𝟗𝟏.𝟕𝟓 ± 𝟕.𝟎𝟓 𝟖𝟖.𝟎𝟎 ± 𝟕.𝟐𝟖 𝟗𝟏.𝟓𝟗 ± 𝟏𝟎.𝟕𝟕 𝟗𝟏.𝟓𝟏 ± 𝟒.𝟕𝟔 90.00 ± 13.69 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 84.21 ± 19.45 𝟕𝟓.𝟎𝟔 ± 𝟏𝟐.𝟓𝟑 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 85.00 ± 22.36

97.29 ± 2.50 67.64 ± 4.64 97.09 ± 1.77 70.16 ± 5.63 62.84 ± 4.56 83.59 ± 1.52 62.37 ± 5.14 65.08 ± 3.32 81.28 ± 6.51 87.16 ± 5.03 96.03 ± 3.70 94.37 ± 4.49 88.78 ± 5.23 89.02 ± 3.74 90.63 ± 6.33 𝟖𝟗.𝟔𝟎 ± 𝟐.𝟏𝟐 91.11 ± 11.65 88.93 ± 10.36 89.19 ± 11.15 78.46 ± 5.17 88.25 ± 4.07 88.24 ± 7.94 88.40 ± 11.89 85.27 ± 14.45 86.00 ± 16.62 88.13 ± 16.59 78.82 ± 12.03 93.61 ± 3.63 87.50 ± 7.55 86.73 ± 9.40 88.41 ± 9.62 88.13 ± 4.05 89.62 ± 11.34 99.77 ± 0.27 71.49 ± 16.64 69.92 ± 12.75 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 𝟗𝟐.𝟖𝟖 ± 𝟏.𝟔𝟑

98.31 ± 1.67 76.11 ± 5.22 97.09 ± 1.28 72.07 ± 3.18 61.11 ± 8.14 96.54 ± 2.02 72.69 ± 4.74 71.74 ± 5.29 96.08 ± 1.12 88.94 ± 5.90 98.29 ± 3.82 98.57 ± 3.19 90.96 ± 3.29 99.04 ± 0.98 91.40 ± 5.92 85.70 ± 1.93 91.94 ± 11.88 89.44 ± 10.70 89.46 ± 10.24 71.91 ± 3.76 88.95 ± 4.78 89.46 ± 5.17 91.35 ± 8.26 87.82 ± 15.11 88.00 ± 21.68 100.00 ± 0.00 𝟗𝟎.𝟐𝟓 ± 𝟓.𝟔𝟐 100.00 ± 0.00 90.00 ± 5.59 87.52 ± 4.01 89.55 ± 10.27 91.84 ± 8.42 89.81 ± 16.16 99.97 ± 0.07 74.66 ± 11.79 73.81 ± 8.61 94.58 ± 10.98 84.81 ± 22.47

97.64 ± 1.48 71.95 ± 7.20 96.88 ± 0.96 69.51 ± 1.17 59.52 ± 8.38 68.09 ± 13.00 62.96 ± 5.35 53.11 ± 2.46 59.31 ± 18.45 89.25 ± 4.94 95.44 ± 3.57 96.87 ± 6.26 92.05 ± 2.06 77.57 ± 3.37 90.87 ± 9.01 89.52 ± 1.82 𝟗𝟑.𝟖𝟗 ± 𝟕.𝟕𝟕 𝟗𝟑.𝟗𝟎 ± 𝟏𝟏.𝟑𝟑 91.42 ± 7.30 74.16 ± 4.24 89.28 ± 3.97 90.88 ± 6.04 77.78 ± 10.66 82.50 ± 19.84 74.00 ± 19.89 85.00 ± 22.36 75.00 ± 9.52 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 87.00 ± 15.07 68.01 ± 12.56 85.00 ± 10.49 63.84 ± 11.66 𝟗𝟑.𝟔𝟓 ± 𝟕.𝟒𝟒 99.97 ± 0.07 70.99 ± 13.99 65.63 ± 17.35 95.00 ± 11.18 84.45 ± 22.16

98.31 ± 1.67 74.38 ± 10.92 97.09 ± 1.28 70.74 ± 2.19 58.80 ± 7.53 72.47 ± 12.73 62.96 ± 5.35 53.11 ± 2.44 59.67 ± 19.26 88.39 ± 4.63 96.87 ± 3.70 98.29 ± 3.10 92.14 ± 2.14 94.77 ± 1.98 90.86 ± 9.01 84.60 ± 3.14 𝟗𝟑.𝟖𝟗 ± 𝟕.𝟕𝟕 𝟗𝟑.𝟗𝟎 ± 𝟏𝟏.𝟑𝟑 91.42 ± 7.30 74.72 ± 4.30 89.51 ± 3.88 90.88 ± 6.04 87.57 ± 4.26 82.50 ± 19.84 79.00 ± 21.33 90.00 ± 13.69 83.00 ± 14.51 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 87.50 ± 8.84 76.85 ± 8.60 85.00 ± 10.46 79.68 ± 10.21 𝟗𝟑.𝟔𝟓 ± 𝟕.𝟒𝟒 99.97 ± 0.07 70.99 ± 13.99 64.78 ± 8.58 95.00 ± 11.18 84.45 ± 22.16

96.23 ± 4.05 75.54 ± 11.50 97.20 ± 1.16 69.74 ± 2.98 60.97 ± 11.01 81.35 ± 7.80 63.24 ± 6.76 66.24 ± 2.34 88.19 ± 3.65 88.24 ± 3.93 94.64 ± 4.40 96.63 ± 3.84 91.78 ± 3.86 87.23 ± 2.97 91.76 ± 6.74 87.55 ± 3.36 90.00 ± 10.64 90.29 ± 11.40 90.34 ± 11.88 78.90 ± 4.23 87.92 ± 1.71 90.61 ± 6.25 90.12 ± 7.82 87.09 ± 9.53 84.50 ± 11.20 94.41 ± 10.93 82.55 ± 13.59 94.77 ± 3.18 86.00 ± 8.54 83.83 ± 7.85 91.14 ± 6.60 89.75 ± 2.64 89.62 ± 11.10 99.82 ± 0.12 78.51 ± 16.22 65.07 ± 7.02 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 82.86 ± 19.42

94.79 ± 2.6 71.09 ± 5.27 91.38 ± 1.54 62.61 ± 4.20 56.85 ± 4.98 94.29 ± 1.69 64.85 ± 4.28 62.54 ± 4.23 92.41 ± 1.11 88.85 ± 2.91 94.05 ± 6.93 97.22 ± 2.91 85.75 ± 2.81 96.84 ± 2.73 𝟗𝟑.𝟔𝟓 ± 𝟒.𝟎𝟖 84.22 ± 5.39 91.67 ± 5.76 86.13 ± 10.58 84.18 ± 8.43 74.45 ± 5.79 87.82 ± 3.14 𝟗𝟑.𝟗𝟐 ± 𝟓.𝟐𝟔 74.27 ± 15.28 86.64 ± 13.85 83.75 ± 10.69 99.41 ± 1.18 84.60 ± 9.86 99.44 ± 2.35 90.75 ± 8.68 82.78 ± 9.28 84.77 ± 12.67 75.72 ± 16.99 81.92 ± 18.43 99.94 ± 0.12 𝟗𝟑.𝟕𝟐 ± 𝟑.𝟖𝟒 65.55 ± 11.37 99.60 ± 0.80 84.08 ± 19.79

Average

𝟗𝟎.𝟐𝟕 ± 𝟔.𝟒𝟓

85.31 ± 6.97

88.15 ± 6.66

81.89 ± 8.80

83.89 ± 8.08

85.91 ± 6.90

85.17 ± 6.63

result from each dataset is highlighted in 𝐁𝐎𝐋𝐃. The average AAcc over all datasets on each algorithm is respectively listed in the last one row of the table. As seen in Table 3, the proposed MUEKL performs best in 29 datasets of the 38 KEEL datasets, ranks second in 5 datasets, and ranked third in 4 datasets. MUEKL achieves 90.27% for the average AAcc. Since the proposed MUEKL generates Universum samples and introduces a regularization term to adjust the classifier boundary, it is expected to behave better than MultiK-MHKS and USVM. The average AAcc of the basic framework, MultiK-MHKS, is 88.15%, which is 2.12% less than that of MUEKL. USVM and NLMKSVM reach 85.31% and 83.89%, respectively, which were 4.96% and 6.38% lower than MUEKL. USVM is similar to SVM in performance because USVM introduces the Universum samples directly instead of through regularization. SimpleMKL behaves the worst in average AAcc. In summary, MUEKL behaves superior both with a low and high imbalanced ratio.

define that two classifiers whose mean difference is more than 1% are non-equivalent. So, the parameter 𝑞 is set in 1% in this analysis. From Fig. 3, it appears that the proposed MUEKL achieves significant advantages in AAcc, and it performs better than all other methods with 𝑞 = 1%. This result as well as the results from Table 3 suggest that the proposed MUEKL algorithm performs much better than all other comparison algorithms. 4.4. Experiments on the UCI datasets In this section, we verify the performance of the proposed MUEKL on 28 UCI datasets with a description of the UCI datasets listed in Table 4, including the dimension, classes, and size of the datasets. 4.4.1. Results of the UCI datasets Table 5 details the classification accuracy (ACC), the standard deviation of the proposed MUEKL, and the comparing algorithms. The best result for each dataset is highlighted in 𝐁𝐎𝐋𝐃. Furthermore, the average classification accuracies of each algorithm are respectively listed in the last row of the table. As shown in Table 5, the proposed MUEKL is superior as it behaves best in 18 of 28 UCI datasets. MUEKL achieves 88.27% for the average ACC, ranking first. The average ACC of the basic framework, MultiKMHKS, is 86.81%, which is 1.46% less than that of MUEKL. MUEKL also demonstrates a 10.03% increase in average accuracy over USVM as well as better performance than SimpleMKL, NLMKSVM, SVM, and EasyMKL.

4.3.2. Bayesian analysis of the KEEL datasets To further compare the performance of these classifiers on the KEEL datasets, Bayesian analysis (Benavoli et al., 2016) is considered here. Bayesian analysis is a conventional algorithm performance estimation method for machine learning classification tasks, which can take both magnitude and uncertainty into account (Benavoli et al., 2016). The Bayesian analysis assumes the difference between two estimated algorithms in a certain metric to be a normal distribution. In Fig. 3, the probability matrices obtained from a Bayesian signed rank test is shown, which indicates the probabilities that the difference of two methods is more than 𝑞. In classification, it is common to 7

Z. Wang, S. Hong, L. Yao et al.

Engineering Applications of Artificial Intelligence 89 (2020) 103461

Fig. 3. The probability matrices obtained from a Bayesian signed rank test on the KEEL datasets. The value in row 𝑖𝑡ℎ and column 𝑗𝑡ℎ means the probability that 𝑚𝑒𝑡ℎ𝑜𝑑𝑖𝑡ℎ exceeds 𝑚𝑒𝑡ℎ𝑜𝑑𝑗𝑡ℎ with 𝑞 for the specific metric.

Table 4 Description of each UCI dataset involved in performance verification. Dataset

Attributes

Classes

Instances

Breast Cancer Wisconsin (BCW) Haberman House Votes (HV) Ionosphere Iris Blood Transfusion (BT) Pima Seeds Clean Sonar Water Wine Glass Housing Mammographic masses(Mam) Secom Semeion Spambase Dermatology Cmc TTT Segmentation Arrhythmia Page_blocks Vertebralcolumn Waveform Ecoli CNAE−9

9 3 16 33 4 4 8 7 166 60 38 12 9 13 6 590 256 57 34 9 9 18 279 10 6 21 7 856

2 2 2 2 3 2 2 3 2 2 2 3 6 2 2 2 10 2 6 3 2 7 13 5 2 3 8 9

683 306 435 234 150 748 512 210 476 208 116 178 214 506 961 1567 1593 4601 366 1473 958 2310 452 5472 310 5000 336 1080

Table 5 suggest that the proposed MUEKL algorithm performs much better than all other comparison algorithms. 4.5. Experiments on the Universum samples 4.5.1. Comparison of Universum sample generation As discussed above, we select In-Between Universum (IBU) samples by first calculating the distance between positive and negative samples. Second, we select 𝑘 nearest points. For example, if we select 5 positive samples and 5 negative samples, then 𝑘 is equal to 5. So, we can obtain 25 Universum samples on average from each positive and negative sample. The IBU is defined as 𝒙𝑖 + 𝒙𝑗

, (18) 2 where 𝒙𝑖 is a positive sample, 𝒙𝑗 is a negative sample, and 𝒙𝑢 denotes the corresponding Universum sample. 𝒙𝑢 =

4.5.2. Results of the Universum samples The proposed generation method of IMU for the Universum samples considers the characteristics of the imbalanced problem. To validate the effectiveness of the proposed IMU, we compare the MUEKL results with the IBU and IMU methods. As seen in Table 6, IMU is better than IBU on 31 of the 38 KEEL datasets. Also, the average AAcc of MUEKL with IMU reaches 90.27% and IBU reaches 89.52% representing a 0.75% increase over IBU. The generation of Universum samples plays an essential role in performance as the proposed IMU is superior because it introduces the imbalanced characteristics of the Universum samples. This result demonstrates the validity of the proposed generation method of IMU. In addition, MUEKL with IBU has approximately a 1.37% increase in performance compared with the basic framework, MultiK-MHKS. In summary, the Universum sample generation method has a good effect on the classification of imbalanced data. As shown in Table 7, IMU is similar to IBU in performance because when the data is balanced, IMU degrades to IBU. IMU performs the same as IBU on 6 UCI datasets with an average accuracy for IMU reaching 88.27% and IBU reaching 87.81%. IMU performs better

4.4.2. Bayesian analysis of the UCI datasets To further compare the performance of these classifiers on the UCI datasets, Bayesian analysis (Benavoli et al., 2016) is considered here. In Fig. 4, the probability matrices obtained from a Bayesian signed rank test is shown with the parameter 𝑞 set to 1%. From Fig. 4, it appears that the proposed MUEKL achieves significant advantages in ACC and is better than all other methods with 𝑞 = 1%. MultiK-MHKS also result in good performances for this metric. This result as well as the results from 8

Z. Wang, S. Hong, L. Yao et al.

Engineering Applications of Artificial Intelligence 89 (2020) 103461

Fig. 4. Probability matrices obtained from a Bayesian signed rank test on the UCI datasets. The value in row 𝑖𝑡ℎ and column 𝑗𝑡ℎ represents the probability that 𝑚𝑒𝑡ℎ𝑜𝑑𝑖𝑡ℎ exceeds 𝑚𝑒𝑡ℎ𝑜𝑑𝑗𝑡ℎ with 𝑞 for the corresponding metric.

Table 5 Test result of all used comparison algorithms on the UCI datasets are listed in the table. (The best result on each dataset is written in bold.) Dataset

MUEKL ACC ± std (%)

USVM ACC ± std (%)

MultiK-MHKS ACC ± std (%)

SimpleMKL ACC ± std (%)

NLMKSVM ACC ± std (%)

SVM ACC ± std (%)

EasyMKL ACC ± std (%)

BCW Haberman HV Ionosphere Iris BT Pima Seeds Clean Sonar Water Wine Glass Housing Mam Secom Semeion Spambase Dermatology Cmc TTT Segmentation Arrhythmia Page blocks Vertebralcolumn Waveform Ecoli CNAE−9

𝟗𝟕.𝟓𝟏 ± 𝟎.𝟒𝟑 𝟕𝟕.𝟏𝟒 ± 𝟑.𝟓𝟐 94.46 ± 2.93 𝟗𝟒.𝟒𝟓 ± 𝟏.𝟏𝟐 𝟗𝟖.𝟎𝟎 ± 𝟏.𝟖𝟑 𝟕𝟗.𝟒𝟎 ± 𝟑.𝟐𝟎 𝟕𝟔.𝟓𝟑 ± 𝟑.𝟓𝟑 𝟗𝟓.𝟐𝟒 ± 𝟒.𝟒𝟓 𝟗𝟕.𝟎𝟕 ± 𝟏.𝟕𝟎 𝟖𝟖.𝟗𝟓 ± 𝟐.𝟕𝟑 95.69 ± 4.35 97.76 ± 2.44 𝟕𝟔.𝟖𝟏 ± 𝟑.𝟒𝟐 𝟗𝟑.𝟒𝟖 ± 𝟏.𝟔𝟕 𝟓𝟗.𝟓𝟐 ± 𝟏.𝟐𝟐 𝟗𝟑.𝟑𝟕 ± 𝟎.𝟓𝟎 𝟗𝟓.𝟔𝟏 ± 𝟎.𝟗𝟏 89.18 ± 0.72 97.29 ± 1.91 𝟓𝟕.𝟏𝟏 ± 𝟏.𝟔𝟔 98.85 ± 0.68 𝟗𝟖.𝟏𝟖 ± 𝟎.𝟕𝟕 71.14 ± 5.80 93.06 ± 0.86 𝟖𝟔.𝟕𝟕 ± 𝟑.𝟓𝟎 87.02 ± 0.74 𝟖𝟕.𝟕𝟓 ± 𝟒.𝟏𝟑 94.35 ± 1.55

97.37 ± 1.20 74.19 ± 5.41 91.47 ± 3.78 85.46 ± 3.23 96.67 ± 2.36 68.46 ± 5.22 67.92 ± 3.68 92.38 ± 5.16 75.04 ± 6.66 74.51 ± 7.81 93.08 ± 5.85 89.27 ± 2.53 60.63 ± 11.70 66.20 ± 13.40 54.21 ± 2.71 62.51 ± 9.39 89.83 ± 1.29 72.08 ± 5.31 93.67 ± 2.24 49.30 ± 1.69 63.98 ± 1.59 90.91 ± 1.28 61.46 ± 6.10 80.50 ± 5.71 79.03 ± 7.57 85.36 ± 1.46 84.55 ± 5.09 90.65 ± 1.65

96.91 ± 0.43 74.85 ± 2.20 𝟗𝟒.𝟗𝟕 ± 𝟐.𝟗𝟑 88.80 ± 2.02 𝟗𝟖.𝟎𝟎 ± 𝟏.𝟖𝟑 76.61 ± 1.43 75.83 ± 3.17 94.29 ± 4.45 92.24 ± 1.91 88.43 ± 2.73 𝟗𝟔.𝟓𝟔 ± 𝟒.𝟑𝟓 96.11 ± 3.61 67.74 ± 4.42 93.08 ± 1.47 58.27 ± 1.22 𝟗𝟑.𝟑𝟕 ± 𝟎.𝟓𝟓 93.90 ± 0.91 83.50 ± 0.76 97.30 ± 2.33 55.54 ± 1.59 98.32 ± 1.09 97.06 ± 0.90 𝟕𝟏.𝟑𝟔 ± 𝟔.𝟑𝟒 92.66 ± 0.54 86.45 ± 3.14 𝟖𝟕.𝟎𝟒 ± 𝟎.𝟖𝟑 𝟖𝟕.𝟕𝟓 ± 𝟒.𝟏𝟑 93.80 ± 1.37

96.93 ± 0.58 73.53 ± 0.53 94.24 ± 3.41 93.15 ± 2.86 96.00 ± 5.48 63.09 ± 22.40 73.21 ± 2.22 95.24 ± 3.76 56.51 ± 0.28 82.57 ± 7.17 93.08 ± 5.85 97.17 ± 2.09 69.70 ± 5.90 93.08 ± 0.03 55.36 ± 3.70 93.37 ± 0.50 94.41 ± 1.24 83.88 ± 1.22 97.30 ± 1.91 55.95 ± 0.98 𝟗𝟗.𝟓𝟖 ± 𝟎.𝟒𝟒 97.32 ± 0.95 67.02 ± 4.97 96.13 ± 0.51 84.84 ± 1.84 86.86 ± 1.20 86.26 ± 4.98 𝟗𝟒.𝟑𝟓 ± 𝟎.𝟖𝟑

97.09 ± 1.10 73.53 ± 0.53 93.31 ± 3.84 92.71 ± 2.53 96.00 ± 5.48 74.76 ± 3.91 75.38 ± 2.12 92.86 ± 4.12 56.51 ± 0.28 84.52 ± 5.26 93.08 ± 5.85 𝟗𝟖.𝟑𝟓 ± 𝟏.𝟓𝟐 71.09 ± 5.86 93.08 ± 0.03 56.92 ± 3.18 93.37 ± 0.50 94.74 ± 1.07 86.66 ± 1.29 𝟗𝟕.𝟓𝟖 ± 𝟎.𝟗𝟗 54.87 ± 2.25 99.38 ± 0.68 97.36 ± 0.92 70.76 ± 4.72 𝟗𝟕.𝟎𝟒 ± 𝟎.𝟓𝟔 85.16 ± 1.35 86.90 ± 1.24 73.35 ± 6.76 84.26 ± 1.35

97.36 ± 0.69 74.18 ± 1.36 91.45 ± 4.30 90.97 ± 3.98 98.00 ± 2.98 76.21 ± 0.68 71.83 ± 3.88 90.48 ± 4.45 57.35 ± 0.59 85.57 ± 6.23 56.05 ± 1.05 94.23 ± 3.09 70.22 ± 5.94 93.08 ± 0.03 56.19 ± 4.60 93.37 ± 0.50 91.58 ± 1.07 70.55 ± 1.25 96.28 ± 2.65 52.69 ± 2.06 81.64 ± 2.32 91.39 ± 0.90 63.83 ± 2.89 90.35 ± 0.46 84.19 ± 2.10 85.58 ± 1.01 85.99 ± 3.11 92.78 ± 1.25

92.66 ± 3.14 68.61 ± 8.92 93.38 ± 3.47 82.47 ± 4.77 96.67 ± 3.27 65.87 ± 13.14 72.04 ± 3.7 89.05 ± 9.11 65.37 ± 11.66 64.15 ± 12.76 91.30 ± 6.15 93.05 ± 7.41 61.81 ± 10.55 91.11 ± 1.24 52.55 ± 1.34 80.00 ± 4.58 84.98 ± 1.74 𝟖𝟗.𝟒𝟔 ± 𝟖.𝟓𝟔 96.56 ± 2.2 52.87 ± 2.61 87.23 ± 13.93 82.21 ± 0.99 56.76 ± 1.6 91.21 ± 3.11 82.58 ± 12.47 86.00 ± 0.5 82.32 ± 4.78 77.13 ± 3.97

Average

𝟖𝟖.𝟐𝟕 ± 𝟐.𝟐𝟐

78.24 ± 4.68

86.81 ± 2.24

84.65 ± 3.14

84.67 ± 2.47

81.55 ± 2.34

79.62 ± 5.77

than IBU on 16 UCI datasets and worse on 6 UCI datasets. These experimental results show that the IMU is also more advantageous for the classification of balanced datasets. Also, MUEKL with IBU results in approximately an 1% increase in average ACC compared to basic algorithm, MultiK-MHKS, which again suggests the superiority of the proposed MUEKL.

4.6. Discussions 4.6.1. The impact on parameter u In our proposed MUEKL, we introduce the parameter 𝑢 into the framework as regularization parameter that controls the impact of the regularization 𝑅𝑢𝑛𝑖 . In this section, we study 𝑢 on the classification 9

Z. Wang, S. Hong, L. Yao et al.

Engineering Applications of Artificial Intelligence 89 (2020) 103461

Table 6 Comparing IMU with IBU on the KEEL datasets. (The best result on each dataset is written in bold.) Dataset

IBU AAcc ± std (%)

IMU AAcc ± std (%)

Dataset

IBU AAcc ± std (%)

IMU AAcc ± std (%)

ecoli_01 glass1 wisconsin pima haberman vehicle2 vehicle1 vehicle3 vehicle0 ecoli1 new_thyroid1 new_thyroid2 ecoli2 segment0 glass6 yeast3 ecoli-0345 ecoli-02345 ecoli-0465

𝟗𝟖.𝟔𝟕 ± 𝟏.𝟖𝟑 78.99 ± 7.56 𝟗𝟕.𝟗𝟗 ± 𝟎.𝟔𝟏 𝟕𝟑.𝟐𝟔 ± 𝟑.𝟓𝟓 𝟔𝟒.𝟗𝟐 ± 𝟓.𝟓𝟕 98.85 ± 1.04 76.94 ± 3.46 𝟕𝟔.𝟕𝟎 ± 𝟒.𝟐𝟐 𝟗𝟗.𝟐𝟔 ± 𝟎.𝟔𝟗 𝟗𝟎.𝟒𝟖 ± 𝟔.𝟐𝟗 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 𝟗𝟒.𝟒𝟗 ± 𝟒.𝟕𝟐 99.07 ± 0.96 𝟗𝟑.𝟎𝟔 ± 𝟕.𝟎𝟖 85.71 ± 1.83 89.72 ± 10.96 89.72 ± 10.56 87.50 ± 15.31

𝟗𝟖.𝟔𝟕 ± 𝟏.𝟖𝟑 𝟕𝟗.𝟔𝟔 ± 𝟕.𝟒𝟏 𝟗𝟕.𝟗𝟗 ± 𝟎.𝟔𝟏 73.03 ± 3.11 64.27 ± 4.35 𝟗𝟗.𝟏𝟓 ± 𝟎.𝟔𝟖 𝟕𝟕.𝟒𝟑 ± 𝟒.𝟏𝟓 76.47 ± 4.81 99.18 ± 0.66 𝟗𝟎.𝟒𝟖 ± 𝟔.𝟐𝟗 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 94.31 ± 4.47 𝟗𝟗.𝟐𝟐 ± 𝟎.𝟗𝟎 𝟗𝟑.𝟎𝟔 ± 𝟕.𝟎𝟖 𝟖𝟕.𝟕𝟐 ± 𝟐.𝟏𝟖 𝟗𝟐.𝟐𝟐 ± 𝟏𝟏.𝟕𝟑 𝟗𝟑.𝟖𝟗 ± 𝟕.𝟖𝟗 𝟗𝟐.𝟐𝟑 ± 𝟏𝟎.𝟗𝟕

yeast02563789 yeast02579368 ecoli-03465 ecoli-034756 ecoli-01235 ecoli-06735 glass-045 ecoli-026735 vowel0 ecoli-0675 ecoli01472356 ecoli-015 ecoli-014756 ecoli-01465 shuttle_c0c4 page_blocks134 abalone9_18 shuttle_c2c4 ecoli_013726

78.05 ± 2.66 89.28 ± 3.84 89.73 ± 5.47 92.00 ± 8.37 86.45 ± 14.01 89.25 ± 16.81 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 85.50 ± 10.67 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 87.50 ± 8.84 𝟖𝟗.𝟏𝟗 ± 𝟔.𝟐𝟔 89.77 ± 10.20 90.00 ± 7.07 87.31 ± 17.51 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 𝟗𝟑.𝟔𝟔 ± 𝟏𝟓.𝟎𝟏 73.80 ± 13.20 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 𝟖𝟓.𝟎𝟎 ± 𝟐𝟐.𝟑𝟔

𝟖𝟎.𝟑𝟗 ± 𝟒.𝟖𝟒 𝟗𝟎.𝟔𝟓 ± 𝟒.𝟔𝟖 𝟗𝟏.𝟗𝟔 ± 𝟔.𝟕𝟔 𝟗𝟒.𝟒𝟗 ± 𝟓.𝟐𝟎 𝟗𝟎.𝟔𝟖 ± 𝟏𝟕.𝟔𝟕 𝟖𝟗.𝟓𝟎 ± 𝟏𝟔.𝟗𝟕 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 𝟖𝟔.𝟓𝟏 ± 𝟏𝟏.𝟗𝟎 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 𝟗𝟏.𝟕𝟓 ± 𝟕.𝟎𝟓 88.00 ± 7.28 𝟗𝟏.𝟓𝟗 ± 𝟏𝟎.𝟕𝟕 𝟗𝟏.𝟓𝟏 ± 𝟒.𝟕𝟔 𝟗𝟎.𝟎𝟎 ± 𝟏𝟑.𝟔𝟗 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 84.21 ± 19.45 𝟕𝟓.𝟎𝟔 ± 𝟏𝟐.𝟓𝟑 𝟏𝟎𝟎.𝟎𝟎 ± 𝟎.𝟎𝟎 𝟖𝟓.𝟎𝟎 ± 𝟐𝟐.𝟑𝟔

89.52 ± 6.54

𝟗𝟎.𝟐𝟕 ± 𝟔.𝟒𝟓

The average AAcc of IBU and IMU on the KEEL datasets is:

Fig. 5. AAcc values (%) of MUEKL with the variation of parameter 𝑢 on the KEEL datasets: segment0, ecoli_01, glass_045, shuttle_𝑐0𝑐4, haberman, ecoli_013726, vehicle3 and glass1.

Fig. 6. ACC values (%) of MUEKL with the variation of parameter 𝑢 on the UCI datasets: Secom, Semeion, BCW, Water, Spambase, Haberman, BT and Pima.

performance of the MUEKL algorithm. In the experiments, 𝑢 is selected from {0.01, 0.1, 1, 10, 100} for each datasets while the other parameters are set to the optimal value determined by 5-fold cross validation. The classification accuracies of MUEKL with various 𝑢 is then determined for the selected datasets. So that the datasets evaluated are representative, we select the datasets that have a maximum or minimum of dimension, classes, IR, and instances as well as some datasets at random.

datasets, Secom, Semeion, BCW, Water, Spambase, Haberman, BT and Pima are selected. Fig. 5 shows the AAcc values of MUEKL with the variation of parameter 𝑢 on the KEEL datasets and Fig. 6 shows the ACC values on the UCI datasets. As seen in Figs. 5 and 6, the parameter 𝑢 affects performance. In Fig. 5, the parameter 𝑢 also plays an important role in performance. With an increase of 𝑢 ∈ [0.01, 0.1], the AAcc value does not fluctuate, and with an increase of 𝑢, the AAcc value on ecoli_01 constantly decreases while increases for haberman. The AAcc of most datasets does not fluctuate with varying 𝑢 ∈ [0.1, 10].

For the KEEL datasets, we select segment0, ecoli_01, glass_045, shuttle_𝑐0𝑐4, haberman, ecoli_013726, vehicle3 and glass1. For the UCI 10

Z. Wang, S. Hong, L. Yao et al.

Engineering Applications of Artificial Intelligence 89 (2020) 103461

Table 7 Comparing IMU with IBU on the UCI datasets. (The best result on each dataset is written in bold.) Dataset

IBU ACC ± std (%)

IMU ACC ± std (%)

BCW Haberman HV Ionosphere Iris BT Pima Seeds Clean Sonar water Wine Glass Housing Mam Secom Semeion Spambase dermatology cmc TTT segmentation Arrhythmia Page_blocks Vertebralcolumn Waveform Ecoli CNAE−9

96.70 ± 2.48 77.13 ± 2.46 𝟗𝟒.𝟔𝟔 ± 𝟑.𝟖𝟒 92.01 ± 4.81 𝟗𝟖.𝟎𝟎 ± 𝟏.𝟖𝟑 78.46 ± 2.86 76.36 ± 3.51 𝟗𝟓.𝟐𝟒 ± 𝟒.𝟒𝟓 𝟗𝟕.𝟎𝟕 ± 𝟏.𝟕𝟎 𝟖𝟖.𝟗𝟖 ± 𝟏.𝟖𝟎 𝟗𝟓.𝟔𝟗 ± 𝟒.𝟑𝟓 𝟗𝟕.𝟕𝟔 ± 𝟐.𝟒𝟒 75.94 ± 4.24 𝟗𝟑.𝟔𝟖 ± 𝟐.𝟐𝟕 57.13 ± 2.29 92.21 ± 2.15 94.29 ± 0.71 89.13 ± 0.79 97.01 ± 1.77 56.29 ± 2.76 𝟗𝟖.𝟗𝟓 ± 𝟎.𝟖𝟑 97.27 ± 0.82 𝟕𝟑.𝟔𝟏 ± 𝟒.𝟓𝟐 92.31 ± 1.68 83.55 ± 15.82 𝟖𝟕.𝟎𝟐 ± 𝟏.𝟏𝟐 𝟖𝟖.𝟎𝟖 ± 𝟐𝟕.𝟗𝟑 94.26 ± 1.87

𝟗𝟕.𝟓𝟏 ± 𝟎.𝟒𝟑 𝟕𝟕.𝟏𝟒 ± 𝟑.𝟓𝟐 94.46 ± 2.93 𝟗𝟒.𝟒𝟓 ± 𝟏.𝟏𝟐 𝟗𝟖.𝟎𝟎 ± 𝟏.𝟖𝟑 𝟕𝟗.𝟒𝟎 ± 𝟑.𝟐𝟎 𝟕𝟔.𝟓𝟑 ± 𝟑.𝟓𝟑 𝟗𝟓.𝟐𝟒 ± 𝟒.𝟒𝟓 𝟗𝟕.𝟎𝟕 ± 𝟏.𝟕𝟎 88.95 ± 2.73 𝟗𝟓.𝟔𝟗 ± 𝟒.𝟑𝟓 𝟗𝟕.𝟕𝟔 ± 𝟐.𝟒𝟒 𝟕𝟔.𝟖𝟏 ± 𝟑.𝟒𝟐 93.48 ± 1.67 𝟓𝟗.𝟓𝟐 ± 𝟏.𝟐𝟐 𝟗𝟑.𝟑𝟕 ± 𝟎.𝟓𝟎 𝟗𝟓.𝟔𝟏 ± 𝟎.𝟗𝟏 𝟖𝟗.𝟏𝟖 ± 𝟎.𝟕𝟐 𝟗𝟕.𝟐𝟗 ± 𝟏.𝟗𝟏 𝟓𝟕.𝟏𝟏 ± 𝟏.𝟔𝟔 98.85 ± 0.68 𝟗𝟖.𝟏𝟖 ± 𝟎.𝟕𝟕 71.14 ± 5.80 𝟗𝟑.𝟎𝟔 ± 𝟎.𝟖𝟔 𝟖𝟔.𝟕𝟕 ± 𝟑.𝟓𝟎 𝟖𝟕.𝟎𝟐 ± 𝟎.𝟕𝟒 87.75 ± 4.13 𝟗𝟒.𝟑𝟓 ± 𝟏.𝟓𝟓

Average

87.81 ± 3.86

𝟖𝟖.𝟐𝟕 ± 𝟐.𝟐𝟐

Fig. 8. Convergence analysis for the UCI datasets.

5. Conclusions and future work In this study, we propose a efficient framework that combines the Universum learning with multiple empirical kernel learning to inherit the advantages of both techniques. The proposed MUEKL incorporates a novel method, Imbalanced Modified Universum (IMU), to generate more effective Universum samples by introducing the imbalanced ratio of data. Moreover, MUEKL not only solves balanced problems but also has superior classification performance on imbalanced problems. Finally, extensive experiments verify the effectiveness of the MUEKL and IMU. The Universum learning provides a broader perspective for the future study of imbalanced problems. In the future, we will design a novel and efficient kernel space construction method along with the analysis of the setting of multikernel weights. We will also study the multi-kernel classifiers and Universum generation methods with lower computational complexity without a loss of classification performance. Acknowledgments

Fig. 7. Convergence analysis for the KEEL datasets.

This work is supported by Natural Science Foundation of China under Grant No. 61672227, ‘‘Shuguang Program’’ supported by Shanghai Education Development Foundation, PR China and Shanghai Municipal Education Commission, PR China, Natural Science Foundations of China under Grant No. 61806078, National Science Foundation of China for Distinguished Young Scholars under Grant 61725301, National Major Scientific and Technological Special Project for ‘‘Significant New Drugs Development’’ under Grant No. 2019ZX09201004, the Special Fund Project for Shanghai Informatization Development in Big Data under Grant 201901043, and National Key R&D Program of China under Grant No. 2018YFC0910500.

In Fig. 6, with an increase of 𝑢 ∈ [0.01, 1], the ACC value does not fluctuate, and with an increase of 𝑢 ∈ [1, 100], the value of 𝑢 begins to impact the ACC value. For Haberman, parameter 𝑢 enhances the performance, but for water, an increase in 𝑢 decreases ACC. Therefore, it is apparent that the parameter 𝑢 plays an important role in the algorithm performance.

References

4.6.2. Convergence analysis

Aiolli, F., Donini, M., 2015. Easymkl: a scalable multiple kernel learning algorithm. Neurocomputing 169, 215–224. Alain, R., Francis, B., Stephane, C., Yves, G., 2008. Simplemkl. J. Mach. Learn. Res. 9, 2491–2521. Bache, K., Lichman, M., 2013. UCI machine learning repository. http://archive.ics.uci. edu/ml. Benavoli, A., Corani, G., Demsar, J., 2016. Time for a change: a tutorial for comparing multiple classifiers through bayesian analysis. J. Mach. Learn. Res. 77, 1–36. Bucak, S.S., Jin, R., Jain, A.K., 2014. Multiple kernel learning for visual object recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 36 (7), 1354–1369.

For representative datasets, we select the same as used for the discussion of parameter 𝑢. As seen in Figs. 7 and 8, the log of the object function value at each iteration is presented for iterations 1 to 30. Both show that with more iterations, the log of the object function value decreases rapidly and stops falling around iteration 4 and holds steady. Experimental results demonstrate that MUEKL can converge in a finite iteration. According to the convergence analysis, we can set the maximum number of iterations of MUEKL in practice. 11

Z. Wang, S. Hong, L. Yao et al.

Engineering Applications of Artificial Intelligence 89 (2020) 103461 Shen, C., Wang, P., Shen, F., Wang, H., 2012. Uboost: boosting with the universum. IEEE Trans. Pattern Anal. Mach. Intell. 34 (4), 825–832. Sparks, T.D., Gaultois, M.W., Oliynyk, A., Brgoch, J., Meredig, B., 2016. Data mining our way to the next generation of thermoelectrics. Scr. Mater. 111, 10–15. Tencer, L., Reznakova, M., Cheriet, M., 2017. U fuzzy: Fuzzy models with universum. Appl. Soft Comput. 59, 1–18. Tian, Y., Zhang, Y., Liu, D., 2016. Semi-supervised support vector classification with self-constructed universum. Neurocomputing 189, 33–42. Vapnik, V.N., Vapnik, V., 1998. Statistical Learning Theory, Vol. 1. Wiley New York. Vatankhah, M., Asadpour, V., Fazel-Rezai, R., 2013. Perceptual pain classification using anfis adapted rbf kernel support vector machine for therapeutic usage. Appl. Soft Comput. 13 (5), 2537–2546. Wang, Z., Cao, C., 2019. Cascade interpolation learning with double subspaces and confidence disturbance for imbalanced problems. Neural Netw. 118, 17–31. Wang, Z., Chen, S., Sun, T., 2008. Multik-mhks: a novel multiple kernel learning algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 30 (2), 348–353. Wang, Z., Wang, B.L., Cheng, Y., Li, D.D., Zhang, J., 2019a. Cost-sensitive Fuzzy Multiple Kernel Learning for imbalanced problem. Neurocomputing 366, 178–193. Wang, Z., Zhu, Z., Li, D., 2019b. Collaborative and geometric multi-kernel learning for multi-class classification. Pattern Recognit. http://dx.doi.org/10.1016/j.patcog. 2019.107050, early access, 107050. Weston, J., Collobert, R., Sinz, F., Bottou, L., Vapnik, V., 2006. Inference with the universum. In: International Conference on Machine Learning, pp. 1009–1016. Wu, P., Liu, S.H., Zhou, L.G., Chen, H.Y., 2018. A fuzzy group decision making model with trapezoidal fuzzy preference relations based on compatibility measure and cowga operator. Appl. Intell. 48 (1), 46–67. Wu, P., Wu, Q., Zhou, L.G., Chen, H.Y., Zhou, H., 2019. A consensus model for group decision making under trapezoidal fuzzy numbers environment. Neural Comput. Appl. 31 (2), 377–394. Wu, P., Zhou, L.G., Chen, H.Y., Tao, Z.F., 2019a. Additive consistency of hesitant fuzzy linguistic preference relation with a new expansion principle for hesitant fuzzy linguistic term sets. IEEE Trans. Fuzzy Syst. 27, 716–730. Wu, P., Zhou, L.G., Chen, H.Y., Tao, Z.F., 2019b. Multi-stage optimization model for hesitant qualitative decision making with hesitant fuzzy linguistic preference relations. Appl. Intell. http://dx.doi.org/10.1007/s10489-019-01502-8 (in press), on-line. Xiong, H., Swamy, M.N.S., Ahmad, M.O., 2005a. Optimizing the kernel in the empirical feature space. IEEE Trans. Neural Netw. 16 (2), 460–474. Xiong, H., Swamy, M.N., Ahmad, M.O., 2005b. Optimizing the kernel in the empirical feature space. IEEE Trans. Neural Netw. 16 (2), 460–474. Xu, Y., Chen, M., Li, G., 2016. Least squares twin support vector machine with universum data for classification. Internat. J. Systems Sci. 47 (15), 3637–3645. Zhang, D., Wang, J., Wang, F., Zhang, C., 2009. Semi-supervised classification with universum. In: Proceedings of the 2008 SIAM International Conference on Data Mining, pp. 323–333. Zhu, C., Gao, D., 2015. Improved multi-kernel classification machine with nyström approximation technique. Pattern Recognit. 48 (4), 1490–1509. Zhu, Z., Wang, Z., Li, D., Zhu, Y., Du, W., 2019. Geometric structural ensemble learning for imbalanced problems. IEEE Trans. Cybern. 1–13. http://dx.doi.org/10.1109/ TCYB.2018.2877663, early access. Zhu, Y., Wang, Z., Zha, H., Gao, D., 2018. Boundary-eliminated pseudoinverse linear discriminant for imbalanced problems. IEEE Trans. Neural Netw. Learn. Syst. 29 (6), 2581–2594.

Capuano, N., Chiclana, F., Fujita, H., Herrera-Viedma, E., Loia, V., 2018. Fuzzy group decision making with incomplete information guided by social influence. IEEE Trans. Fuzzy Syst. 26 (3), 1704–1718. Chen, L., Gu, Y., Ji, X., Sun, Z.Y., Li, H.D., Gao, Y., Huang, Y., 2019. Extracting medications and associated adverse drug events using a natural language processing system combining knowledge base and deep learning. J. Am. Med. Inform. Assoc. 27 (1), 56–64. http://dx.doi.org/10.1093/jamia/ocz141. Chen, S., Zhang, C., 2009. Selecting informative universum sample for semi-supervised learning. In: Proceedings of the 21st International Jont Conference on Artifical Intelligence, pp. 1016–1021. Chong, S.C., Ong, T.S., Teoh, A.B.J., 2018. Discriminative kernel-based metric learning for face verification. J. Vis. Commun. Image Represent. 56, 207–219. Cortes, C., Mohri, M., Rostamizadeh, A., 2009. Learning non-linear combinations of kernels. In: Advances in Neural Information Processing Systems. pp. 396–404. Fan, Q., Wang, Z., Li, D., Gao, D., Zha, H., 2017. Entropy-based fuzzy support vector machine for imbalanced datasets. Knowl.-Based Syst. 115, 87–99. Georgios, A., Markatou, M., 2019. Optimality of training/test size and resampling effectiveness in cross-validation. J. Statist. Plann. Inference 199, 286–301. Gu, Y., Chanussot, J., Jia, X., Benediktsson, J.A., 2017. Multiple kernel learning for hyperspectral image classification: a review. IEEE Trans. Geosci. Remote Sens. 55 (11), 6547–6565. He, H.B., Garcia, E.A., 2009. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21 (9), 1263–1284. Leski, J., 2003. Ho–kashyap classifier with generalization control. Pattern Recognit. Lett. 24 (14), 2281–2290. Leski, J., 2004. Kernel ho-kashyap classifier with generalization control. Int. J. Appl. Math. Comput. Sci. 14 (1), 53–61. Lewis, D.P., Jebara, T., Noble, W.S., 2006. Nonstationary kernel combination. In: Proceedings of the 23rd International Conference on Machine Learning. ACM, pp. 553–560. Li, M., Fang, X., Wang, J.J., Zhang, H., 2015. Supervised transfer kernel sparse coding for image classification. Pattern Recognit. Lett. 68, 27–33. Liu, C.L., Hsaio, W.H., Lee, C.H., Chang, T.H., Kuo, T.H., 2017. Semi-supervised text classification with universum learning. IEEE Trans. Cybern. 46 (2), 462–473. Liu, H.B., Ma, Y., Jiang, L., 2019. Managing incomplete preferences and consistency improvement in hesitant fuzzy linguistic preference relations with applications in group decision making. Inf. Fusion 51, 19–29. Marukatat, S., 2016. Kernel matrix decomposition via empirical kernel map. Pattern Recognit. Lett. 77, 50–57. Medjahed, S.A., Saadi, T.A., Benyettou, A., Ouali, M., 2017. Kernel-based learning and feature selection analysis for cancer diagnosis. Appl. Soft Comput. 51, 39–48. Muller, K.-R., Mika, S., Ratsch, G., Tsuda, K., Scholkopf, B., 2001. An introduction to kernel-based learning algorithms. IEEE Trans. Neural Netw. 12 (2), 181–201. Nazarpour, A., Adibi, P., 2014. Two-stage multiple kernel learning for supervised dimensionality reduction. Pattern Recognit. 48 (5), 1854–1862. Qi, Z., Tian, Y., Shi, Y., 2012. Twin support vector machine with universum data. Neural Netw. 36 (3), 112–119. Rakotomamonjy, A., Bach, F., Canu, S., Grandvalet, Y., 2007. More efficiency in multiple kernel learning. In: Proceedings of the 24th International Conference on Machine Learning. ACM, pp. 775–782. Sa-Couto, L., Wichert, A., 2019. Attention Inspired Network: Steep learning curve in an invariant pattern recognition model. Neural Netw. 114, 38–46.

12

Multiple Universum Empirical Kernel Learning

Multiple Universum Empirical Kernel Learning

Recommend Documents