Separability of set-valued data sets and existence of support hyperplanes in the support function machine

Separability of set-valued data sets and existence of support hyperplanes in the support function machine

Information Sciences 430–431 (2018) 432–443 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate...

1MB Sizes 0 Downloads 27 Views

Information Sciences 430–431 (2018) 432–443

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

Separability of set-valued data sets and existence of support hyperplanes in the support function machineR Jiqiang Chen a,b, Xiaoping Xue a,∗, Litao Ma b, Minghu Ha b a b

Department of Mathematics, Harbin Institute of Technology, Harbin 150001, PR China School of Science, Hebei University of Engineering, Handan 056038, PR China

a r t i c l e

i n f o

Article history: Received 19 April 2017 Revised 27 November 2017 Accepted 29 November 2017

Keywords: Support vector machine Set-valued data Separability Support hyperplane Support function

a b s t r a c t The support function machine (SFM) has been shown to be effective in separating setvalued data sets. However, in SFM, the separability of set-valued data and the existence of support hyperplanes, which can provide useful guidance for improving algorithms for use in applications, have not been discussed in theory. Therefore, in this paper, we firstly discuss the problem of whether the linearly separable set-valued data in Rd are still linearly separable after being mapped into the infinite-dimensional Banach space C(S) by support functions. Secondly, we discuss the problem of whether the linearly inseparable set-valued data in Rd are linearly separable after being mapped into C(S). If not, in which situations are they linearly separable? Thirdly, we discuss the existence of support hyperplanes in SFM. Finally, two experiments with set-valued data sets are provided to verify the reasoning in the above discussions and the correctness of their conclusions. © 2017 Elsevier Inc. All rights reserved.

1. Introduction In some practical problems, such as water quality evaluation [22] and gene expression experiments [35], multiple measurements or replicated experiments are often used to reduce the level of uncertainty, which leads to a new kind of learning task: set-based classification [2,13,33,37]. Currently, there are two methods of approaching set-based classification. One is to compute the statistics of the original data (such as mean and median) and describe the input set with a vector, such as CART [4], ID3 [24], C4.5 [24], and SVMs [7,9,14,20]. However, according to the law of large numbers, when the number of samples tends to infinity, the mean value converges to the real value, but in the actual problems, the number of measured data that make up a set-valued datum cannot reach infinity, so the sample mean cannot adequately represent the real value. Furthermore, when a set-valued datum is represented by a vector-valued datum, some other information (such as the variance) may be lost in the reprocessing [6]. The other method is to develop set-based classifiers directly [21,27,30,32]. However, these methods usually state some assumptions in advance (for example, they assume that the sample sets lie on some manifolds, whereas some classifications may lie on manifolds but others may not), and they do not work in some cases. Therefore, Chen et al. [6] presented the support function machine (SFM), which is a new learning machine for set-based classifications. R ∗

Fully documented templates are available in the elsarticle package on CTAN. Corresponding author. E-mail addresses: [email protected] (J. Chen), [email protected] (X. Xue).

https://doi.org/10.1016/j.ins.2017.11.057 0020-0255/© 2017 Elsevier Inc. All rights reserved.

J. Chen et al. / Information Sciences 430–431 (2018) 432–443

433

In SFM, the sets of feature vectors are mapped into an infinite-dimensional Banach space C(S) (whose elements are the continuous functions defined on the unit ball S in Rd ) via support functions σ (x ) [12], and the set becomes a single point (namely a function) in this new Banach space C(S). Then, the set-based tasks [16,26,28,38] in d-dimensional Euclid space Rd are converted into function-based tasks in Banach space C(S). As C(S) is not an inner space, the separating hyperplane in SFM is defined via a Radon measure μ whose theoretical basis is the Riesz representation theorem in Banach space [25], which is different from that in support vector machines (SVMs) [1,3,7,9,10,17,18,23,29,36]. Then, we construct the maximal margin algorithm in this new Banach space. Consequently, the SFM retains the classification information of the original set-valued data and can deal with the set-based classifications effectively. Moreover, SFM is able to deal with function- (or distribution-) based classifications [6,31], learning tasks described with fuzzy sets, as we can map the fuzzy sets into Banach space C(S) with membership functions [6,15]. After mapping, all of the above tasks are converted into function-based classifications. Then we can train classifiers in C(S). In addition, as a vector x can be represented by a point set {x}, vector-based classifications can also be handled with the proposed SFM. Therefore, the new algorithm is powerful for different data representations. However, the hard margin SFM designed for linearly separable set-valued data is constructed directly via the maximal margin algorithm, and the soft margin SFM targeting linearly inseparable set-valued data is constructed by introducing slack variables. That is to say, we only care about establishing the SFMs themselves but do not consider the separability of setvalued data sets and the existence of support hyperplanes in the theoretical. The following three arguments indicate that it is meaningful to investigate the separability of set-valued data sets and the existence of support hyperplanes. First, linearly separating the finite function-valued data sets in C(S) with the Hahn–Banach Theorem [8] is equivalent to separating their convex hulls linearly, and their convex hulls are infinite sets, so we have implicitly considered the linear separability of special infinite data sets when separating finite sets linearly [5]. Second, if we prove that the data sets are linearly separable in theory, but we cannot separate them completely via the hard margin SFM, we can infer that there is some mistake in the experiments or in the source code. Therefore, the discussions of separability can help us analyze the experimental results. Last, for a practical classification problem, certainly we wish to analyze the algorithm’s complexity, including the number of support functions lying on the support hyperplanes, and this inspires us to discuss the existence of support hyperplanes. Thus, it is necessary to investigate the separability of set-valued data sets and the existence of support hyperplanes at least from the theoretical viewpoint, and such investigation can offer guidance for improving algorithms for use in practical problems [11,19,22,34]. Following the above considerations, this paper is organized as follows. Section 2 reviews some basic content related to SFM. Section 3 discusses the linear separability of two set-valued data sets. In Section 4, we discuss the problem of determining the situations in which linearly inseparable data sets are linearly separable after being mapped into C(S). In Section 5, the existence of support hyperplanes is discussed. Section 6 provides two numerical experiments to verify the points made in the above discussions, and Section 7 draws conclusions and suggests future studies. 2. Some preliminaries about SFM In order to make this paper self-contained, we provide some preliminaries about SFM in this section. Definition 1 [8]. Let K be a normed linear space, let f be a continuous linear functional on K, let E and F be two subsets 

of K, and let L = H rf = {x ∈ K| f (x ) = r.} (r ∈ R) be a hyperplane. If for any x ∈ E, we have f (x ) ≤ r (or ≥ r ), and for any x ∈ F , we have f (x ) ≥ r (or ≤ r ), then we say that the hyperplane L separates sets E and F. If for any x ∈ E, we have f (x ) < r (or > r ), and for any x ∈ F , we have f (x ) > r (or < r ), then we say that the hyperplane L strongly separates sets E and F. Theorem 1 [8] (Geometric Hahn–Banach Theorem). Let B∗ be a normed linear space, let E1 and E2 be two non-empty disjoint convex sets in B∗ , and let x0 be an interior point of E1 and x0 ∈ / E2 . Then there exist r ∈ R and a nonzero continuous linear functional f such that hyperplane H rf separates E1 and E2 . In other words, there exists a nonzero continuous linear functional f such that for any x ∈ E1 , we have f (x ) ≤ r, and for any x ∈ E2 , we have f (x ) ≥ r. Remark 1. Theorem 1 is the key theorem discussing the separability of function-valued data sets; it shows that for any two non-empty disjoint convex sets in B∗ there exists a hyperplane separating them completely (see Fig. 1). Definition 2 [12]. The support function σA : Rd → R of a non-empty closed convex set A in Rd is given by σA (x ) = supy∈A {x, y}, x ∈ Rd . For simplification, we also denote σA (x ) by σ A . As the Banach space C(S) is not an inner space, the hyperplane in SFM is defined via the following Riesz representation theorem in Banach space. Theorem 2 [25]. Assume that X is a compact Hausdorff space. Then for any bounded linear functional  on C(X), there is one and only one complex regular Borel measure μ such that

(σ ) =



X

σ dμ, σ ∈ C (X ),

and

 = |μ|(X ),

434

J. Chen et al. / Information Sciences 430–431 (2018) 432–443

Fig. 1. The ellipses E1 and E2 are two non-empty convex sets satisfying E1 ∩ E2 = ∅, and H rf is a hyperplane separating E1 and E2 . Table 1 Differences between SFM and SVM [6]. Method

Original space and classifying objects

Feature space and classifying objects

SFM SVM

Rd , set-valued data Rd , vector-valued data

Banach space, function-valued data Hilbert space, vector-valued data

Hyperplane    M = σ s σ dμ = α , σ ∈ C (S )   M = xx, y = α , x ∈ Rd

Fig. 2. [6]  are set-valued data co(Ai ) with output +1,  are set-valued data co(Ai ) with output −1, + indicate the support functions σ i of sets , and – indicate the support functions σ i of set-valued data . + and – are the points in C(S).

where |μ|(X ) = sup{ μ .

n

i=1

|μ(Ai )|, Ai is a part it ion o f X, i = 1, 2, . . . , n, n ≥ 1} is the total variation of μ, denoted simply by

 Definition 3 [6]. Let C(S)∗ be the conjugate space of C(S). We define M  {σ ∈ C (S )| S σ dμ = α .}, where α ∈ R is the hyper plane in C(S), and we define M⊥  {μ ∈ C (S )∗ | S σ dμ = 0, σ ∈ M.} as the orthocomplement of M, where μ is the vertical of hyperplane M. For simplicity, we denote the hyperplane ∫S σ dμ by μ(σ ). With this definition, the SFM is constructed in Banach space C(S), which is different from the construction of SVMs (see Table 1). Suppose that the given training set is

T = { ( A1 , y1 ), ( A2 , y2 ), . . . , ( Al , yl )},

(1)

where l is the number of samples; Ai ⊂ are bounded; and yi ∈ Y = {+1, −1}, i = 1, 2, . . . , l. Then, with the help of closed convex hulls co(Ai ) of sets Ai and support functions σi (x ), we have the training set Rd

T = { ( σ1 , y 1 ) , ( σ2 , y 2 ) , . . . , ( σl , y l ) }

(2)

and the decision function

f (σ ) = sgn(g(σ )),

(3)

where σi (x ) or σ i are the support functions of co(Ai ), i = 1, 2, . . . , l, respectively. Then the set-based classification in Rd is converted to a function-based classification in C(S) (see Fig. 2). Definition 4 [6]. If g(σ ), σ ∈ C(S), in decision function (3) dividing C(S) into two parts is a hyperplane, then we say that training set (2) is linearly separable. If g(σ ), σ ∈ C(S), in decision function (3) dividing C(S) into two parts is a hypersurface, then we say that training set (2) is linearly inseparable. Then, for the linearly separable training set and the linearly inseparable training set (2), we establish the hard margin SFM

min μ, α

1 μ 2

(4)

J. Chen et al. / Information Sciences 430–431 (2018) 432–443

s.t. yi

 S

σi dμ + α ≥ 1, i = 1, 2, . . . , l,

435

(5)

and the soft margin SFM

1 μ + C ξi 2 l

min

μ, α , ξ

s.t. yi

(6)

i=1

 S

σ i d μ + α ≥ 1 − ξi ,

(7)

ξi ≥ 0, i = 1, 2, . . . , l,

(8)

respectively [6], where vector ξ = (ξ1 , ξ2 , . . . , ξl and C > 0 is a penalty parameter. Objective function (6) implies that we  2 should minimize both μ (i.e., maximize the margin μ ) and li=1 ξi (i.e., minimize the extent of damage of constraint  yi ( s σi dμ + α ) ≥ 1 − ξi , i = 1, 2, . . . , l). Having established the above preliminaries, in the next sections we will discuss the following three questions: (1) Are the linearly separable set-valued data in Rd still linearly separable after being mapped into C(S) by support functions? (2) Are the linearly inseparable set-valued data in Rd linearly separable after being mapped into C(S)? If not, in which situations are they linearly separable? (3) Are there really support hyperplanes in SFM? Is there a theoretical analysis for this?

)T ,

3. Linearly separable set-valued data in Rd are still linearly separable in C(S) In this section, we discuss the first question formulated above. As separating sets Ai and Bj is equivalent to separating their closed convex hulls co(Ai ) and co(Bj ), we discuss the separability of co(Ai ) and co(Bj ) directly. Here, let co(Ai ) be the closed convex hulls of Ai ⊂ Rd (i = 1, 2, . . . , l1 ) with output yi = +1, co(Bj ) be the closed convex hulls of B j ⊂ Rd ( j = 1, 2, . . . , l2 ) with output y j = −1, A ⊂ Rd be the minimum convex set containing all the sets co(Ai ) (i = 1, 2, . . . , l1 ), and B ⊂ Rd be the minimum convex set containing all the sets co(B j ) ( j = 1, 2, . . . , l2 ), and let M = {σco(Ai ) (x )|i = 1, 2, . . . , l1 )} and N = {σco(B j ) (x )| j = 1, 2, . . . , l2 )}. Denote by



co(M ) = and

co(N ) =

 

l  1 λi σco(Ai ) (x ) λi = 1, λi ≥ 0, σco(Ai ) (x ) ∈ M, i = 1, 2, . . . , l1 , ∀ l1 ∈ N  i=1 i=1

l1

 

l  2 λ j σco(B j ) (x ) λ j = 1, λi ≥ 0, σco(B j ) (x ) ∈ N, j = 1, 2, . . . , l2 , ∀ l2 ∈ N  j=1 j=1

l2

the convex combination of the support functions σco(Ai ) (x ) and σco(B j ) (x ), respectively. In order to answer this question, some lemmas are given first. Lemma 1. If sets co(Ai ), i = 1, 2, . . . , l1 , and co(B j ), j = 1, 2, . . . , l2 , are linearly separable, then A ∩ B = ∅. Proof. If co(Ai ), i = 1, 2, . . . , l1 , and co(B j ), j = 1, 2, . . . , l2 , are linearly separable, then from Theorem 1 we can conclude that there exists a hyperplane L separating sets co(Ai ) and co(Bj ). Thus, A ∩ B = ∅.  Remark 2. This lemma illustrates that if the data sets are linearly separable, then the convex sets A and B are disjoint and there exists a hyperplane L separating them correctly (see Fig. 3). Lemma 2. Let P ⊂ Rd . If for any x ∈ S, we have σP (x ) ∈ co(M ), then P ⊂ A. N0 Proof. As σP (x ) ∈ co(M ) for any x ∈ S, then there exist N0 ∈ N, i=1 λi = 1, λi ≥ 0, and σco(Ai ) (x ) ∈ M satisfying

σP ( x ) =

N0

λi σco(Ai ) (x ).

i=1

As co(Ai ) ⊂ A, i=1,2, . . . ,N0 , we have

σP ( x ) =

N0

i=1

  λi σco(Ai ) (x ) ≤ max σco(Ai ) (x ) = σA (x ). 1≤i≤N0

436

J. Chen et al. / Information Sciences 430–431 (2018) 432–443

Fig. 3.  are the sets co(Ai ) (i = 1, 2, . . . , l1 ) with output yi = +1,  are the sets co(B j ) ( j = 1, 2, . . . , l2 ) with output y j = −1, and L is a hyperplane separating co(Ai ) and co(Bj ). The left ellipse is the minimum convex set A containing all the sets co(Ai ), the right ellipse is the minimum convex set B containing all the sets co(Bj ), and A ∩ B = ∅.

Fig. 4.  are the sets co(Ai ) (i = 1, 2, . . . , l1 ) with output yi = +1,  are the sets co(B j ) ( j = 1, 2, . . . , l2 ) with output y j = −1, and they are separated by hyperplane L. The red region (on the left in C(S)) is the convex combination co(M) of support functions σco(Ai ) (x ) (i = 1, 2, . . . , l1 ) described by +, the blue region (on the right in C(S)) is the convex combination co(N) of support functions σco(B j ) (x ) ( j = 1, 2, . . . , l2 ) described by –, and co(M ) ∩ co(N ) = ∅. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Therefore, P ⊂ A.



Theorem 3. If sets co(Ai ), i = 1, 2, . . . , l1 , and co(B j ), j = 1, 2, . . . , l2 , are linearly separable (see Fig. 4), then for any x ∈ S, we have (1) σco(B j ) (x ) ∈ / co(M ), j = 1, 2, . . . , l2 ; (2) co(M ) ∩ co(N ) = ∅.

Proof. The proof of this theorem can be found in Appendix A.



Remark 3. This theorem shows that the closed convex sets that are linearly separable in Rd are also linearly separable after being mapped into C(S) by their support functions (see Fig. 4), and it also answers the first question proposed at the end of Section 2. 4. Some linearly inseparable set-valued data in Rd are linearly separable in C(S) In order to discuss the second question, we first give the following lemma. Lemma 3. For any m ∈ N; λi ∈ R, i = 1, 2, . . . , m; and x ∈ S, we have m

σλi co(Bi ) (x ) = σ m

i=1

i=1

λi co(Bi )

(x ).

Proof. The proof of this theorem can be found in Appendix B. In particular, if λi ≥ 0, then we have m

i=1

σλi co(Bi ) (x ) =

m

λi σco(Bi ) (x ).

i=1

This lemma then yields the following theorem.



J. Chen et al. / Information Sciences 430–431 (2018) 432–443

437

Fig. 5.  are the sets co(Ai ) (i = 1, 2, . . . , l1 ) with output yi = +1,  are the sets co(B j ) ( j = 1, 2, . . . , l2 ) with output y j = −1, co(Ai ) and co(Bj ) are linearly inseparable, the curve L1 is a hypersurface separating co(Ai ) and co(Bj ), the ellipse indicates the minimum convex set A containing all the sets co(Ai ), the triangle is the convex combination co(T), and co(T ) ∩ A = ∅. Line L2 is a hyperplane in C(S) separating support functions + and –.

  Fig. 6. s σi+ dμ∗ + α ∗ = 1 − ξi∗0 and s σi− dμ∗ + α ∗ = −1 + ξi∗1 are the support hyperplanes, and σi+ and σi− are the support functions lying on the support 0 1 0 1 hyperplanes.

Theorem 4. Let co(Bi ) ⊂ Rd , i = 1, 2, . . . , l2 ; y∗i ∈ co(Bi ); T = {y∗i |i = 1, 2, . . . , l2 .}; l 2 λ = 1, λi ≥ 0, y∗i ∈ T , i = 1, 2, . . . , l2 , l2 ∈ N.}. If co(T ) ∩ A = ∅, then i=1 i

and

co(T ) = {

l 2

i=1

λi y∗i |

co(M ) ∩ co(N ) = ∅. Proof. The proof of this theorem can be found in Appendix C.



Remark 4. This theorem tells us that closed convex sets co(Ai ), i = 1, 2, . . . , l1 , and co(B j ), j = 1, 2, . . . , l2 , (co(Ai ) and co(Bj ) not necessarily linearly separable) in Rd satisfying the conditions of this theorem are linearly separable after being mapped into C(S) (see Fig. 5). Theorem 4 answers the second question proposed at the end of Section 2. 5. The existence of support hyperplanes In this section, we discuss the third question. As the hard margin SFM (namely the optimal problem (4), (5)) is a special case of the soft margin SFM (namely the optimal problem (6)–(8)), we only discuss the existence of support hyperplanes in the soft margin SFM here. Theorem 5. Let (μ∗ , α ∗ , ξ ∗ ) be an optimal solution of optimal problem (6)–(8). Then  (1) there exists an i0 ∈ {i|yi = 1 } such that s σi+ dμ∗ + α ∗ = 1 − ξi∗ ; 0  0 (2) there exists an i1 ∈ {i|yi = −1 } such that s σi− dμ∗ + α ∗ = −1 + ξi∗ . 1

1

Proof. The proof of this theorem can be found in Appendix D.   Remark 5. This theorem shows that the hyperplanes s σi dμ∗ + α ∗ = ±1 ∓ ξi∗ obtained by the soft margin SFM [6] are the two support hyperplanes (see Fig. 6). That is to say, this theorem answers the third question presented in Section 2. 6. Numerical experiments In this section, we aim to verify the correctness of Theorems 3 and 4 from an experimental viewpoint. In order to display the set-valued data accurately and ensure universality, we take some set-valued data selected randomly in R2 as examples. The first experiment is provided to verify Theorem 3, that the linearly separable set-valued data are still linearly separable after being mapped into C(S).

438

J. Chen et al. / Information Sciences 430–431 (2018) 432–443

Fig. 7. Linearly separable set-valued data in R2 , where the red line segments (above line y = x − 1) represent the intervals Ai labeled with 1 and the blue line segments (below line y = x − 1) represent the intervals Bi labeled with −1 (i = 1, 2, . . . , 250 ). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Table 2 Some linearly separable set-valued data. Data set A1 A2 A3 A4 A5

= { (x, y )| − 1.82 ≤ x ≤ −1.62, y = 0.69} = { (x, y )|2.27 ≤ x ≤ 2.47, y = 4.51} = { (x, y )|2.28 ≤ x ≤ 2.48, y = 6.86} = { (x, y )|1.09 ≤ x ≤ 1.29, y = 2.63} = { (x, y )|1.80 ≤ x ≤ 2.00, y = 6.43}

Label

Data set

1 1 1 1 1

B1 B2 B3 B4 B5

= { (x, y )| − 0.73 ≤ x ≤ −0.53, y = −3.81} = { (x, y )|0.21 ≤ x ≤ 0.41, y = −6.45} = { (x, y )|1.05 ≤ x ≤ 1.25, y = −6.18} = { (x, y )| − 1.60 ≤ x ≤ −1.40, y = −7.92} = { (x, y )|1.63 ≤ x ≤ 1.82, y = −2.68}

Label −1 −1 −1 −1 −1

Experiment 1. In this experiment, we randomly select 250 intervals Ai labeled with 1 and Bi labeled with −1 (i = 1, 2, . . . , 250 ) on both sides of the line y = x − 1, respectively (see Fig. 7). Then we obtain 500 set-valued data, some of which are listed in Table 2. That is to say, the data set

T1 = {(A1 , y1 ), (A2 , y2 ), . . . , (A250 , y250 ), (B1 , y1 ), (B2 , y2 ), . . . , (B250 , y250 )}

(9)

R2 .

is a linearly separable set-valued data set in Next, we randomly select 400 samples (200 samples labeled with 1 and 200 samples labeled with −1) as the training data set and designate the remaining 100 samples as the test data set. Then the experiment is conducted with the fivefold cross-validation and the hard margin SFM (namely the optimal problem (4), (5) in Section 2) designed for linearly separable set-valued data sets. As the hard margin SFM is established in infinite-dimensional Banach space C(S) and the classification results are obtained in C(S), we cannot draw a figure marking the precise locations of the points σ , and we simply give a diagram of the classification results for the five folds in Fig. 8. Discussion 1. From Fig. 8, we can see that the test data of every fold can be correctly classified with a hyperplane, so the average accuracy is 1, which illustrates that the linearly separable set-valued data in R2 are still linearly separable after being mapped into C(S). That is to say, we have verified the correctness of Theorem 3 from an experimental viewpoint. The following experiment is provided to show that the linearly inseparable set-valued data satisfying the conditions of Theorem 4 will be linearly separable after being mapped into C(S). Experiment 2. In this experiment, we randomly select some set-valued data satisfying the conditions of Theorem 4 on both sides of the parabola x = 3.15 − 12 (y − 2.5 )2 , respectively (see Fig. 9). Then, we obtain a set-valued data set

T2 = {(A1 , y1 ), (A2 , y2 ), . . . , (A251 , y251 ), (B1 , y1 ), (B2 , y2 ), . . . , (B251 , y251 )},

(10)

where Ai are labeled with 1 and Bi are labeled with −1 (i = 1, 2, . . . , 251 ), some of which are listed in Table 3. Thus, T2 is a linearly inseparable set-valued data set satisfying the conditions of Theorem 4 in R2 .

J. Chen et al. / Information Sciences 430–431 (2018) 432–443

439

Fig. 8. Diagram of the classification results of testing set-valued data in C(S). Objects 1, 3, 5, 7 and 9 are the data of the five respective folds labeled with −1. Objects 2, 4, 6, 8 and 10 are the data of the five respective folds labeled with 1. The horizontal straight line is the decision hyperplane. The horizontal axis is the sequence number of point σ , and the vertical axis is the value of decision function f(σ ).

Fig. 9. Some linearly inseparable set-valued data in R2 satisfying the conditions of Theorem 4, where the red line segments represent the intervals Ai labeled with 1 and the blue line segments represent the intervals Bi labeled with −1 (i = 1, 2, . . . , 251 ). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Next, we randomly select 402 samples (201 samples labeled with 1 and 201 samples labeled with −1) as the training data set and designate the remaining 100 samples as the test data set. Using the fivefold cross-validation and the hard margin SFM, the experiments are conducted. Fig. 10 is a diagram of the results of the classification of the test set of set-valued data in C(S). Objects 1, 3, 5, 7, and 9 are the data of the five respective folds labeled with −1. Objects 2, 4, 6, 8, and 10 are the data of the five respective folds labeled with 1. The black line is the classification hyperplane. The horizontal axis is the sequence number of point σ , and the vertical axis is the value of decision function f(σ ). Discussion 2. In this experiment, the test data of every fold can also be correctly classified with a hyperplane (see Fig. 10), so we find again that the average accuracy is 1. This result illustrates that the linearly inseparable set-valued data set T2 in R2 is linearly separable after being mapped into C(S). Thus, from an experimental viewpoint, we have verified the correctness of Theorem 4.

440

J. Chen et al. / Information Sciences 430–431 (2018) 432–443 Table 3 Some linearly inseparable set-valued data. Data set A1 A2 A3 A4 A5

= { (x, y )|2.90 ≤ x ≤ 3.10, y = 2.50} = { (x, y )|1.95 ≤ x ≤ 2.15, y = 1.20} = { (x, y )|1.84 ≤ x ≤ 2.04, y = 1.06} = { (x, y )|0.86 ≤ x ≤ 1.06, y = 2.33} = { (x, y )|1.97 ≤ x ≤ 2.17, y = 1.59}

Label

Data set

1 1 1 1 1

B1 B2 B3 B4 B5

= { (x, y )|2.8 ≤ x ≤ 3.5, y = 3.5} = { (x, y )|2.72 ≤ x ≤ 2.92, y = 1.02} = { (x, y )|3.21 ≤ x ≤ 3.41, y = 1.31} = { (x, y )|5.18 ≤ x ≤ 5.38, y = 2.13} = { (x, y )|3.80 ≤ x ≤ 4.00, y = 3.02}

Label −1 −1 −1 −1 −1

Fig. 10. Diagram of the classification results of the testing set-valued data in C(S).

Remark 6. The above two experiments have been used to verify the correctness of the theoretical analysis. The superiority of the mathematical analysis is that if the set-valued data set satisfies the conditions of Theorem 3 or Theorem 4, but the classification accuracy via the hard or soft margin SFM is not 1, then we can infer that there must be some mistake in the experiments or in the source code. Therefore, the discussions in this manuscript can help us analyze experimental results. Remark 7. When the set-value becomes a vector-value, according to following Theorem 6, we have that Rd can be linearly and isometrically embedded in C(S) with the isometric embedding f. In this case, Rd is a subspace of C(S), and the above theorems about set-valued data still hold. Theorem 6. Let mapping f : Rd → C (S ), x → f (x ), d1 and d2 be the distances in Rd and C(S), respectively. Then f is a linear, and isometric embedding. Proof. First, for any x0 , x1 ∈ Rd , we have

f (x0 + x1 )  σ{x0 +x1 } (x ) =

sup

{x, y} = f (x0 ) + f (x1 ),

y∈{x0 +x1 }

(11)

and for any λ ∈ R, we have

f (λx0 ) = x, λx0  = λ f (x0 ).

(12)

Therefore, f is linear. Second, for any x0 , x1 ∈ Rd ,

d2 ( f (x0 ), f (x1 )) = d2 (σ{x0 } (x ), σ{x1 } (x )) = sup

x =1

|x, x0  − x, x1 | ≤ sup x · x0 − x1 x =1

= d1 (x0 , x1 ),

(13)

(14)

(15)

and

d2 ( f (x0 ), f (x1 )) = sup

x =1

|x, x0  − x, x1 |

(16)

J. Chen et al. / Information Sciences 430–431 (2018) 432–443

≥ |

x0 − x1

x0 − x1

, x0 − x1 | = x0 − x1 = d1 (x0 , x1 ).

441

(17)



Therefore, we have that f is an isometric embedding. 7. Conclusions and future work

In this work, we discussed the linear separability of set-valued data sets and the existence of support hyperplanes in Banach space C(S). We proved that linearly separable set-valued data sets in Rd must be linearly separable in the infinitedimensional feature space C(S). For the linearly inseparable set-valued data sets, we developed a meaningful sufficient condition to judge their linear separability in C(S) using the information about the original input data. Furthermore, we showed that there are support hyperplanes in SFM. We hope that these results will provide a theoretical basis for experimental analysis and can be applied to develop algorithms for set-based classifications. Moreover, if the set-value becomes a vector-value, we discussed that Rd can be linearly and isometrically embedded in C(S) with the isometric embedding f. In this case, Rd is a subspace of C(S), and the conclusions in this paper still hold. Acknowledgments This work is supported by the National Natural Science Foundation of China (11671109, 11731010, 61222210, and 11626079), the Major Program of the National Natural Science Foundation of China (61432011), and the Science and Technology Support Program of Handan (1521109072-1). Appendix A. Proof of Theorem 3 Proof. (1) Assume that there exists j0 ∈ {1, 2, . . . , l2 } such that σco(B j ) (x ) ∈ co(M ). From Lemma 2, we have co(B j0 ) ⊂ A. As 0 sets co(Ai ) and co(Bj ) are linearly separable, by Lemma 1, for all j = 1, 2, . . . , l2 we have co(B j ) ∩ A = ∅, which is in contradiction to co(B j0 ) ⊂ A. Therefore, the original assumption is wrong; namely, for all j = 1, 2, . . . , l2 , we have

σco(Bi ) (x ) ∈/ co(M ), x ∈ S. (2) Assume that co(M) ∩ co(N) = ∅. Then there exist σP0 (x ) ∈ co(M ) and σP0 (x ) ∈ co(N ), x ∈ S. From Lemma 2, we have P0 ⊂ A and P0 ⊂ B. Furthermore, A ∩ B⊃P0 = ∅. As sets co(Ai ) and co(Bj ) are linearly separable, by Lemma 1 we have A ∩ B = ∅, which is in contradiction to A ∩ B⊃P0 = ∅. Therefore, the assumption is wrong. Thus, we have co(M ) ∩ co(N ) = ∅.  Appendix B. Proof of Lemma 3 Proof. Suppose that xi ∈ co(Bi ), i = 1, 2, . . . , l2 . For any x ∈ S; m ∈ N; λi ∈ R, i = 1, 2, . . . , m,



σ m i=1

λi co(Bi )

(x ) =

i=1



x,

sup m 

λi xi ∈

m

m  i=1

i=1 λi xi ∈λi co(Bi )



λi xi

i=1

λi co(Bi )

sup

m

{x, λi xi } =

m

σλi co(Bi ) (x ).

(B.1)

i=1

In addition, as co(Bi ), i = 1, 2, . . . , l2 , are all closed convex sets, then for any co(Bi ) there exists λi x0i ∈ λi co(Bi ) satisfying

sup

λi xi ∈λi co(Bi )

  {x, λi xi } = x, λi x0i ;

thus, m

i=1

m 

σλi co(Bi ) (x ) = x,

λ

0 i xi





i=1

sup m  i=1

= σ m i=1

λi co(Bi )

λi xi ∈

m  i=1

λi co(Bi )

x,

m

 λi xi

i=1

( x ).

By inequalities (B.1) and (B.2), we obtain that for any m ∈ N; λi ∈ R, i = 1, 2, . . . , m; and x ∈ Rd , we have m

i=1



σλi co(Bi ) (x ) = σ m i=1

λi co(Bi )

( x ).

(B.2)

442

J. Chen et al. / Information Sciences 430–431 (2018) 432–443

Appendix C. Proof of Theorem 4 Proof. Assume that co(M) ∩ co(N) = ∅. Then there exists σP0 (x ) such that σP0 (x ) ∈ co(M ) and σP0 (x ) ∈ co(N ). N0 By σP0 (x ) ∈ co(M ), we obtain that there exist N0 ∈ N, λ = 1, λi ≥ 0, and σco(Ai ) (x ) ∈ M such that for any x ∈ S, we i=1 i have

σP0 (x ) =

N0

  λi σco(Ai ) (x ) ≤ max σco(Ai ) (x ) ≤ σA (x ).

From σP0 (x ) ∈ co(N ), we obtain that there exist N1 ∈ N,

σP0 (x ) =

(C.1)

1≤i≤N0

i=1

N1

λi σco(Bi ) (x ) =

i=1

N1

N1

i=1

λi = 1, λi ≥ 0, and σco(Bi ) (x ) ∈ N such that for any x ∈ S,

σλi co(Bi ) (x ).

(C.2)

i=1

As for any co(Bi ), i = 1, 2, . . . , l2 , there exists y∗i ∈ co(Bi ), N1

λi y∗i ∈ co(T ).

i=1

By Lemma 3, we can obtain that

Equality (C.2) =

sup N1  i=1

As co(T ) ∩ A = ∅,



λi yi ∈

N1  i=1

x,

λi co(Bi )

N1

 λi yi

≥ x,

i=1

N1

are the same, and

N1

 λ

∗ i yi

.

(C.3)

i=1

λ y∗ ∈/ A, which implies that there exists  i=1 i i  N1 x∗ , i=1 λi y∗i > 1 ≥ σA (x∗ ). In other words,

an x∗ ∈ A such that the directions of vector x∗ and

N1

i=1

λi y∗i

there exists an x∗ ∈ S such that

σP0 (x∗ ) > σA (x∗ ).

(C.4)

Therefore, inequalities (C.1) and (C.4) are contradicted, and the original assumption is wrong. Thus,

co(M ) ∩ co(N ) = ∅.  Appendix D. Proof of Theorem 5 Proof. (1) As (μ∗ , α ∗ , ξ ∗ ) is an optimal solution of optimal problem (6)–(8), (μ∗ , α ∗ , ξ ∗ ) satisfies constraints (7) and (8), and therefore



S

σi− dμ∗ + α ∗ − ξi∗ ≤ −1, i ∈ {i|yi = −1 }.

(D.1)

Assume that for any i ∈ {i|yi = 1 }, we have



S

σi+ dμ∗ + α ∗ + ξi∗ > 1.

(D.2)

Denote μ ˜ = aμ∗ , α˜ = a(α ∗ + 1 ) − 1, and ξ˜i = aξi∗ . If a ∈ (0, 1), inequality (D.1) is equivalent to



S

σi− dμ˜ + α˜ − ξ˜i ≤ −1, i ∈ {i|yi = −1 }.

(D.3)

From the original assumption, we have that for any i ∈ {i|yi = +1 }, we have



σi+ dμ˜ + α˜ + ξ˜i a→1 S  = lim− σi+ d (aμ∗ ) + a(α ∗ + 1 ) − 1 + aξi∗ lim−

a→1



=

S

S

σ dμ∗ + α ∗ + ξi∗ > 1. + i

Thus, there exists a ∈ (0, 1) such that



S

σi+ dμ˜ + α˜ + ξ˜i > 1, i ∈ {i|yi = +1 }.

(D.4)

J. Chen et al. / Information Sciences 430–431 (2018) 432–443



443



By inequalities (D.3) and (D.4), we can conclude that μ ˜ , α˜ , ξ˜ is a feasible solution of optimal problem (6)–(8), and the value of objective function (6) satisfies



l

l

1 +l2 1 +l2 1 1 ˜ +C ξ˜i = a μ∗ + C ξi∗ μ 2 2 i=1

i=1



<

l

1 +l2 1 ξi∗ . μ ∗ + C 2 i=1

( μ∗ , α ∗ , ξ ∗ )

Thus, is not the optimal solution of optimal problem (6)–(8), which is a contradiction of the known condition. That is to say, the original assumption (D.2) does not hold. Namely, there exists an i0 ∈ {i|yi = +1 } such that



S

σi+0 dμ∗ + α ∗ = 1 − ξi∗0 .

The proof of conclusion (1) is complete. (2) It is not difficult to obtain conclusion (2) in a manner similar to the proof of conclusion (1).



References [1] S. Abe, Fusing sequential minimal optimization and newton’s method for support vector training, Int. J. Mach. Learn. Cybern. 7 (3) (2016) 345–364. [2] O. Arandjelovic, G. Shakhnarovich, J. Fisher, R. Cipolla, T. Darrell, Face recognition with image sets using manifold density divergence, IEEE Conf. Comput. Vis. Pattern Recognit. 1 (2005) 581–588. [3] S. Bang, J. Kang, M. Jhun, E. Kim, Hierarchically penalized support vector machine with grouped variables, Int. J. Mach. Learn. Cybern. 8 (4) (2017) 1211–1221. [4] L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and regression trees, Wadsworth and Brooks/Cole Advanced Books and Software, Monterey, CA, 1984. [5] D. Chen, Q. He, X. Wang, On linear separability of data sets in feature space, Neurocomputing 70 (13) (2007) 2441–2448. [6] J. Chen, Q. Hu, X. Xue, M. Ha, L. Ma, Support function machine for set-based classification with application to water quality evaluation, Inf. Sci. (Ny) 388 (2017) 48–61. [7] J. Chen, W. Pedrycz, M. Ha, L. Ma, Set-valued samples based support vector regression and its applications, Expert Syst. Appl. 42 (5) (2015) 2502–2509. [8] D. Cohn, Measure theory, second edition, Birkhäuser, Boston, 2013. [9] C. Cortes, V.N. Vapnik, Support vector networks, Mach. Learn. 20 (1995) 273–297. [10] S. Ding, X. Zhang, J. Yu, Twin support vector machines based on fruit fly optimization algorithm, Int. J. Mach. Learn. Cybern. 7 (2) (2016) 193–203. [11] A. Esteva, B. Kuprel, R. Novoa, J. Ko, S. Swetter, H. Blau, S. Thrun, Dermatologist-level classification of skin cancer with deep neural networks, Nature 542 (7639) (2017) 115–118. [12] R. Gardner, Geometric Tomography (second edition), Cambridge University Press, New York, 2006. [13] M. Hayat, M. Bennamoun, S. An, Deep reconstruction models for image set classification, IEEE Trans. Pattern Anal. Mach. Intell. 37 (4) (2015) 713–727. [14] C. Ho, C. Lin, Large-scale linear support vector regression, J. Mach. Learn. Res. 13 (2012) 3323–3348. [15] M. Kaufmann, A. Meier, K. Stoffel, IFC-filter: membership function generation for inductive fuzzy classification, Expert Syst. Appl. 42 (21) (2015) 8369–8379. [16] M. Korytkowski, L. Rutkowski, R. Scherer, Fast image classification by boosting fuzzy classifiers, Inf. Sci. (Ny) 327 (2016) 175–182. [17] C. Li, Y. Huang, H. Wu, Multiple recursive projection twin support vector machine for multi-class classification, Int. J. Mach. Learn. Cybern. 7 (5) (2016) 729–740. [18] Y. Li, I. Tsang, J. Kwok, Z. Zhou, Convex and scalable weakly labeled SVMs, J. Mach. Learn. Res. 14 (2013) 2151–2188. [19] E. Maggiori, Y. Tarabalka, G. Charpiat, P. Alliez, Convolutional neural networks for large-scale remote-sensing image classification, IEEE Trans. Geosci. Remote Sens. 55 (2) (2017) 645–657. [20] O. Mangasaian, D. Musicant, Lagrangian support vector machines, J. Mach. Learn. Res. 1 (2001) 161–177. [21] D. Nguyen, L. Ngo, L. Pham, W. Pedrycz, Towards hybrid clustering approach to data classification: multiple kernels based interval-valued fuzzy c-means algorithms, Fuzzy Sets Syst. 279 (2015) 17–39. [22] K. Nödler, M. Tsakiri, M. Aloupi, G. Gatidou, A. Stasinakis, T. Licha, Evaluation of polar organic micropollutants as indicators for wastewater-related coastal water quality impairment, Environ. Pollut. 211 (2016) 282–290. [23] X. Peng, L. Kong, D. Chen, A structural information-based twin-hypersphere support vector machine classifier, Int. J. Mach. Learn. Cybern. 8 (1) (2017) 295–308. [24] J. Quinlan, Induction of decision trees, Mach. Learn. 1 (1) (1986) 81–106. [25] W. Rudin, Real and Complex Analysis (third edition), McGraw-Hill Book Company, New York, 1987. [26] L. Shao, Z. Cai, L. Liu, K. Lu, Performance evaluation of deep feature learning for RGB-D image/video classification, Inf. Sci. (Ny) 385 (2017) 266–283. [27] L. Shou, X. Zhang, G. Chen, Y. Gao, K. Chen, Mud: mapping-based query processing for high-dimensional uncertain data, Inf. Sci. (Ny) 198 (2012) 147–168. [28] H. Tan, Z. Ma, S. Zhang, Z. Zhan, B. Zhang, C. Zhang, Grassmann manifold for nearest points image set classification, Pattern Recognit. Lett. 68 (2015) 190–196. [29] M. Tanveer, K. Shubham, A regularization on lagrangian twin support vector regression, Int. J. Mach. Learn. Cybern. 8 (3) (2017) 807–821. ˇ [30] M. Cerný, J. Antoch, M. Hladik, On the possibilistic approach to linear regression models involving uncertain, indeterminate or interval data, Inf. Sci. (Ny) 244 (2013) 26–47. [31] J. Wang, J. Chiou, H. Müller, Functional data analysis, Annu. Rev. Stat. Appl. 3 (2016) 257–295. [32] R. Wang, S. Shan, X. Chen, Q. Dai, W. Gao, Manifoldmanifold distance and its application to face recognition with image sets, IEEE Trans. Image Process. 21 (10) (2012) 4466–4479. [33] W. Wang, X. Liu, Fuzzy forecasting based on automatic clustering and axiomatic fuzzy set classification, Inf. Sci. (Ny) 294 (2015) 78–94. [34] M. Yang, X. Wang, W. Liu, L. Shen, Joint regularized nearest points for image set based face recognition, Image Vis. Comput. 58 (2017) 47–60. [35] Y. Yang, T. Speed, Design issues for cDNA microarray experiments, Nat. Rev. Genet. 3 (8) (2002) 579–588. [36] Z. Yang, H. Wu, C. Li, Least squares recursive projection twin support vector machine for multi-class classification, Int. J. Mach. Learn. Cybern. 7 (3) (2016) 411–426. [37] J. Yoneyama, Robust sampled-data stabilization of uncertain fuzzy systems via input delay approach, Inf. Sci. (Ny) 198 (2012) 169–176. [38] P. Zhu, W. Zuo, L. Zhang, S. Shiu, D. Zhang, Image set based collaborative representation for face recognition, IEEE Trans. Inf. Forensics Secur. 9 (7) (2014) 1120–1132.