A novel method for constructing the optimal hierarchical structure based on fuzzy granular space

A novel method for constructing the optimal hierarchical structure based on fuzzy granular space

Journal Pre-proof A novel method for constructing the optimal hierarchical structure based on fuzzy granular space Xu-Qing Tang, Yang Li, Wei-Wei Li, ...

790KB Sizes 0 Downloads 14 Views

Journal Pre-proof A novel method for constructing the optimal hierarchical structure based on fuzzy granular space Xu-Qing Tang, Yang Li, Wei-Wei Li, Wanqiang Shen

PII: DOI: Reference:

S1568-4946(19)30743-4 https://doi.org/10.1016/j.asoc.2019.105962 ASOC 105962

To appear in:

Applied Soft Computing Journal

Received date : 17 May 2018 Revised date : 23 October 2019 Accepted date : 27 November 2019 Please cite this article as: X.-Q. Tang, Y. Li, W.-W. Li et al., A novel method for constructing the optimal hierarchical structure based on fuzzy granular space, Applied Soft Computing Journal (2019), doi: https://doi.org/10.1016/j.asoc.2019.105962. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Elsevier B.V. All rights reserved.

*Manuscript Click here to view linked References

Journal Pre-proof

A novel method for constructing the optimal hierarchical structure based on fuzzy granular space Xu-Qing Tang, Yang Li, Wei-Wei Li, Wanqiang Shen School of Science, Jiangnan University, Wuxi 214122, China

Xu-Qing Tang (Corresponding author):e-mail: [email protected]

lP

1 Introduction

re-

pro of

Abstract: Granular computing serves as a general framework for complex problem solving in broad scopes and at various levels. The granularity was constructed via many ways, however, for complex systems there remain two challenges including determining a reasonable granularity and extracting the hierarchical information. In this paper, a new method is presented for constructing the optimal hierarchical structure based on fuzzy granular space. Firstly, the inter-class deviations and intra-class deviations were introduced, whose properties were investigated in depth and approved mathematically. Secondly, the fuzzy hierarchical evaluation index is developed, followed with a novel model for extracting the global optimal hierarchical structure established. An algorithm is then proposed, which reliably constructs the multi-level structure of complex system. Finally, to reduce the complexity, the granular signatures are extracted according to the nearest-to-center principle; with the use of the signatures, a classifier is designed for verifying our method. The validation of this method is approved by an application to the H1N1 influenza virus system. The theories and methodologies on granular computing presented here are helpful for capturing the structural information of complex system, especially for data mining and knowledge discovery. Index Terms: granular computing; fuzzy hierarchical evaluation index; optimization model; multi-level structure.

Jo

urn a

Granular computing (GrC) serves as a general framework for complex problem solving in broad scopes and at various levels [1-3]. Its core includes the way of cognizing and investigating, methods and information computing for solving problems, composed of the relevant theories, methodologies, techniques and tools. In recent years, the GrC theory has become a hotspot in the field of artificial intelligence and knowledge discovery [4-6]. Especially, the quotient space theory has provided solutions to both heuristic search and path programming, and renders it possible to solve problems using structural way (or granule) [1,7]. In GrC, however, there are always two challenges. One is how to determine the granular size, and the other is how to extract the structural information of complex systems. Clustering technique, especially hierarchical clustering, is an effective way for both generating granules and extracting structural information of complex systems [8-11]. Hartmann et al [12] proposed a supervised hierarchical clustering in fuzzy model identification by using a hierarchical tree. Pedrycz and Song [13] gave an analytic hierarchy process based on the information granularity. Zhou et al [14] explored the human motion by using the hierarchical cluster analysis. These researches on GrC provided a new way for handling the big data based on their structural information [15]. In hierarchical clustering, the challenge determining the granular size can be converted to defining the clustering number. Many scholars put forward many effective indexes to study this problem [16, 17]. Yu et al [18] presented an automatic method using a function for evaluating the clustering validity to obtain a right number of clusters. Kim et al [19, 20] proposed a method for determining the optimal clustering based on the clustering structure. Tang et al [21, 22] introduced the granular space for describing the hierarchical structural information by using the algebraic topology based on the fuzzy quotient space theory [1], and an optimized clustering method based on the sum of the inter-class

Journal Pre-proof deviations and intra-class deviations. He also applied the hierarchical clustering method to explore the

pro of

intrinsic differences among breast tumor subtypes and to reveal the heterogeneity of breast tumor subtypes [23, 24]. Cabrerizo et al [25, 26] discussed the consensus and consistency in group decision making based on information granule. These researches help us to develop some new methods for determining the granular size and obtaining the optimal clustering. In this paper, with strictly mathematical reasoning, our goal is to develop an optimized model for determining the granular size and extracting the structural information for complex systems based on the granular space theory [21, 22]. This paper is organized as follows. In Section 2, some preliminaries are given. In Section 3, a new index and an optimizing model are proposed for obtaining the optimal cluster and constructing the hierarchical structure of complex systems, with the corresponding algorithm given. In Section 4, a classifier is designed to verify our method and model, and a data experiment is given for the H1N1 influenza viruses from 1902 to 2019 in the world. The final conclusions are drawn in Section 5.

2 Preliminary

(1) x  X , R( x, x)  1 ; (2) xi , x j  X , R( xi , x j )  R( x j , xi ) .

re-

In this section, some basic concepts are introduced. Definition 1[22]: A fuzzy relation R on a universe X is a fuzzy proximity (FP) relation, if it satisfies: █

FP ( X ) stands for the set of FP relations on the universe X . Furthermore, if R is an FP

relation on the universe X and satisfies the separable condition ( x, y  X , R( x, y)  1  x  y ), then

lP

R is called a separable FP relation (or SFP relation), and marked as SFP ( X ) . In this paper, the FP

relation is considered separable. In Ref. [22], the granular space of FP (or SFP) relations on the universe X was introduced, the properties were then researched. Given R  FP( X ) (or R  SFP( X ) ) and   [0,1] , we define a relation

R : ( x, y)  R  R( x, y)  

urn a

where R is a crisp proximity relation that satisfies the reflexivity and symmetry. x  X , the equivalent class of the transitive closure

tr ( R )

corresponding to x is marked as [ x] ,

X ( )  [ x] | x  X  , then X ( ) is called a granularity corresponding to  deriving by R . The set

 X ( ) |  [0,1]

represents a fuzzy granular space on X deriving by R , marked as TR ( X ) .

Definition 2 [22]: Suppose that X (1 ) and X (2 ) are two granularities on X . (1) If x  X , [ x]1  [ x]2 , then the granularity X (2 ) is not finer than X (1 ) , noted as X (2 )  X (1 ) .

Jo

(2) If X (2 )  X (1 ) and there exists x0  X such that [ x]1  [ x]2 , then X (1 ) is finer than X (2 ) , marked as X (2 )  X (1 ) .



Lemma 1 [22]: If R  FP( X ) (or R  SFP( X ) ), then the derived fuzzy granular space TR ( X ) is an ordered set, which satisfies 1 , 2  [0,1] , 1  2  X  1   X  2  . Especially, if 1  2 and X (1 )  X (2 ) , then X (1 )  X (2 ) .



Remark 1: The granular space derived by a SFP relation on the universe X contains its finest granulation (i.e., X (1)  {{x} | x  X } ) [22]. Actually, an FP relation on the universe X can be

Journal Pre-proof transformed into an SFP relation on the universe X by simplifying the system X and meeting the separable condition. In Lemma 1, the order relation among the granularities can be extended to the granular level, and described by the notion of hierarchy. Mathematically, a hierarchy may be viewed as a partial order set [1] and the granules on a specific level are the composition units of ones on the lower level which corresponds to its coarse-grain. █ Remark 2: If R is a SFP relation on the finite set X  {x1 , x2 ,..., xn } , TR ( X ) is its granular space, then TR ( X )  n , where

 is the count number of a set. According to Lemma 1,

X (1 ), X (2 ) TR ( X ) , 1  2  X (1 )  X (2 ) .



pro of

Let R be an FP relation on a finite universe X  {x1 , x2 ,..., xn } , where X is a subset of K-dimension space, and let TR ( X ) be the deriving granular space. The granularity on  is marked as X ()  {a1, a2 ,..., ac } , where

ai  {xi1, xi 2 ,..., xiJi }

satisfying ai  Ji

and



c i 1

Ji  n .

ai  1 J i  k i1 xik ( i  1,2,..., c ) is the center of granule ai and the center of X is a  1 n  i  1  k i1 xik . J

c

J

Two indexes are introduced to measure the difference in each class and the deviation among the classes on the granulation

Sintra  X ( )  and inter-class deviation

X ( ) : the intra-class deviation

Sinter  X ()  . The detailed formulas are defined as follows:

1 c  i 1 J i ai  a n

2

, Sintra  X ( )  

1 c Ji xik  a  i 1  k 1 n

re-

Sinter  X ( )  

2

2 2

where  2 stands for the 2-norm number on K-dimension space.

According to the principles of statistics, the following lemma is obtained directly [27, 28]. Lemma 2: Assume that X is a finite set, R  FP( X ) , and TR ( X ) is its deriving fuzzy granular

lP

space. For   [0,1] , the total deviation on X ( ) ( TR ( X ) )is as follows:

S  X ( )   Sinter  X ( )   Sintra  X ( )  =1 n  i 1 xi  a Specially, when   1 : Sinter  X (1)   1 n  i 1 xi  a n

n

2 2

,

2 2

.

Sintra  X (1)   0 ;

When   0 , Sinter  X (0)   0 , Sintra  X (0)   1 n  i 1 xi  a

urn a

n

2 2



.

3 The optimal structural clustering based on granular space In this section, we propose a new index for extracting the optimal hierarchical structure of complex systems based on the fuzzy granular space, further construct the optimization model and present an algorithm to obtain the optimal hierarchical structure. Theorem 1: Assume that X is a finite set, R  FP( X ) (or R  SFP( X ) ), TR ( X ) is its deriving

Jo

fuzzy granular space. Then, Sinter  X ( )  is monotonically increasing on  ; that is, 1 , 2  0,1 ,

1  2 , Sinter ( X (1 ))  Sinter ( X (2 )) . Especially, when 1  2 and X  1   X  2  , then

Sinter ( X (1 ))  Sinter ( X (2 )) .

Proof: By given an FP (or SFP) relation R on a finite universe X  {x1 , x2 ,..., xn } , the granular space TR ( X ) is ordered according to Lemma 1; that is, 1  2  X  1   X  2  , c1  c2 , where c1  X (1 ) , c2  X (2 ) . Let X (1 )  {a1 , a2 ,..., ac } and X (2 )  {b1 , b2 ,..., bc } be the 1

2

granulation corresponding to 1 and 2 respectively, the inter-class deviations of X (1 ) and

Journal Pre-proof X (2 ) are as follows:

Sinter  X (1 )   1 n  i 11 J i ai  a c

2 2

, Sinter  X (2 )   1 n  i  1 J i bi  b

where a  b , J i = ai ( i  1, 2,..., c1 ), and J /j = b j

c

2

2

2

( j  1, 2,..., c2 ). If X (1 )  X (2 ) , it is obvious

that Sinter  X (1 )   Sinter  X (2 )  . When X (1 )  X (2 ) , we have X (1 )  X (2 ) and c2  c1 according to Lemma 1. Then, the theorem can be approved as follows. (1) When c1  c2  m ( m  1 ), there exist m  1 classes in X (1 ) that are combined into

pro of

one class in X (2 ) , and the rest classes in X (1 ) remain same as that in X (2 ) . Suppose that

ai , a j  X  1  satisfies R(ai , a j )  max{R( x, y) | x  ai , y  a j }  2 ( a1 , a2 ,..., am1  X  1  ) [22], then

ai

and a j

can be combined into one class bs  X (2 ) , where J s/   i 1 J i and

bs  1 J s/  i 1 J i ai . The difference-value Sinter m 1

m 1

between the inter-class deviations of two

granularities is as follows according to Lemma 2,

2 2 m 1 m 1 Sinter  Sinter  X (1 )   Sinter  X (2 )   1 n  i 1 J i ai  a  J s/ bs  b   1 n  i 1 J i ai  bs  2 2 

2 2

This is because a1 , a2 ,..., am 1 are not equal, Sinter  0 , and Sinter  X  1    Sinter  X  2   .

re-

(2) More generally, when c1 =c2 +m ( m  1 ), there exist some classes in X  1  that are merged into several classes in X (2 ) , while the rest classes in X  1  remain the same as that in X (2 ) . Suppose that {a11 , a12 ,..., a1m1 }, ... , {a p1 , a p 2 ,..., a pmp } in X  1  is combined into b1 , ... ,

bi respectively, where

lP

bp in X (2 ) , respectively. We introduce some notations that a ij , bi denotes the center of class aij ,

mi

aij  mij , bi  ni , ni   mij , bi  1 ni  j ij1 mij a ij ( i  1, 2,..., p , j  1, 2,..., mi ). m

j 1

The problem is transformed into the general case of (1), and the difference-value Sinter between

urn a

the inter-class deviations of two granularities is as follows:

Sinter  Sinter ( X (1 ))  Sinter ( X (2 ))

2 2 2 p m p m  1 n  i 1  j i 1 mij ai  a  ni bi  b   1 n  i 1  j i 1 mij aij  bi    2 2 2  

Because aij ( i  1, 2,..., p , j  1, 2,..., mi ) are not mutually equal, it yields Sinter  0 , and

Sinter ( X (1 ))  Sinter ( X (2 ))

Summarizing (1) and (2), Theorem 1 is proved. According to Lemma 2 and Theorem 1, we obtain directly Theorem 2 in the following.



Jo

Theorem 2: Assume that X is a finite set, R  FP  X  (or R  SFP  X  ), and TR  X  is its derived fuzzy granular space. Then, Sintra  X ( )  is monotonically decreasing with  , namely,

1 , 2  0,1 ,

1  2  Sintra ( X (1 ))  Sintra ( X (2 )) .

Especially, if 1  2 and X (1 )  X (2 ) , then

1  2  Sintra ( X (1 ))  Sintra ( X (2 )) .



Remark 3: Theorem 1, 2 and Lemma 2 state clearly the changing relationship between the

Journal Pre-proof intra-class deviations Sintra  X ( )  (or inter-class deviations Sinter  X ( )  ) and the granularity

X ( ) . With the granularity changing from fine to coarse, the intra-class deviation Sintra  X ( )  is monotonically decreasing and the inter-class deviation Sinter  X ( )  is monotonically increasing, but █

their sum is always constant.

From the perspective of classification, any reasonable partition should reflect the greatest classifying capacity and satisfy the condition [29] Sinter ( X ( ))  Sintra ( X ( )) (1)

pro of

The condition (1) means that the inter-class deviation Sinter  X ( )  is larger than the intra-class deviation Sintra  X ( )  on the effect of classifying problems. Therefore, we introduce a parameter

 to measure their impacts, and a new evaluation index termed as the fuzzy hierarchical evaluation index (FHEI) which is derived based on Theorem 1, 2 and Lemma 2:

FHEI  X ( ),     Sinter  X ( )   Sintra  X ( ) 

where   1 according to Condition (1). We establish an optimization model to determine the reasonable hierarchical clustering, with the consideration that FHEI  X ( ),   should reach its minimum on a granular space ( X ) . The optimal model is in the following:

FHEI  X ( ), 

re-

min

X (  ) X 

(2)

Referring to Theorem 1, 2 and Lemma 2, we can directly obtain the following theorem. Theorem 3: Assume that X is a finite set, R  FP  X  (or R  SFP  X  ), TR  X  is a fuzzy

solution of Model (2), namely,

lP

granular space of R . Then, there exists an unique 0  0,1 such that X (0 ) is the optimal

min

X ( ) X 

FHEI  X ( ),  .



Jo

urn a

X (0 )= arg

Fig. 1. Trends of the intra-class deviation, inter-class deviation, and fuzzy hierarchical evaluation index on  .

Theorem 3 states that Model (2) is a global optimization model on the finite set X . Based on Theorem 1, 2, 3 and Lemma 2, the intra-class deviation, inter-class deviation, and hierarchical evaluation index change as functions of the granularity are illustrated in Fig. 1. It is obvious that the bigger  is, the smaller the classifying number is. Given an FP relation (or a SFP relation) R on the finite set X  {x1 , x2 ,..., xn } , then, a granularity

Journal Pre-proof of R can be denoted by X ( )  {a1 , a2 ,..., ac } , where ai  {xi1 , xi 2 ,..., xiIi } ( i  1, 2,..., c ) . We define the similarity between the class ai and a j as R (ai , a j ) 

1 ai  a j



xik ai , x jl a j

R ( xik , x jl )

(3)

pro of

According to Theorem 1, 2 and 3, we can design an algorithm (named after algorithm A) to obtain the optimal clustering based on the hierarchical structure (granular space) of complex systems [22]. In The algorithm A, the hierarchical structure (granular space) is obtained via steps 3-7 through applying the average similarity. The algorithm A represents a way for deriving both the optimal structural clustering and the reasonable granularity, and its computational complexity is O(n2 K ) , where K is the dimensional number

re-

of vectors in X . During dealing with concrete problems, the complexity is decomposed hierarchically, consistent with the core idea of GrC. Given an FP (or an SFP) relation on the finite set X , the optimal clustering of X extracted by the algorithm A is called the first level structure of X . However, considering the structural complexity of large systems, it is still difficult to explore the systemic intrinsic relationship at the first level structure. Furthermore, if the algorithm A is applied to each of the classes of the first level structure, then every class be divided into several sub-classes. The set composed of all the sub-classes is called the second level structure of X . Therefore, the algorithm A can be applied repeatedly to construct the multi-level structure of X in practical application. Algorithm A Step 1. Input  . Step 2. i  0 , i  1 , X (i )  C  {a1, a2 ,..., aN } ( 1  N  n ), s0  Sinter C  , s1  Sintra C  ,

lP

S0   s0  s1 . Step 3. i  i  1 , A  C , C   , i 

max

a j  A, al  A, l  j

R(a j , al ) .

Step 4. B   . Step 5. Take a j  A , B  B  a j , A  A \ a j .

Step 6. For any ak  A , if R(a j , ak )  i , then B  {B  ak } , A  A \ ak , C  B  C . Step 7. If A   , then go to Step 4; Otherwise, X (i )  C .

urn a

Step 8. If X (i )  X (i 1 ) , calculate s0  Sinter C  , s1  Sintra C  , S   s0  s1 . Step 9. If S  S0 , then S0  S , go to Step 3. Step 10. Output i 1 , X (i 1 ) and S 0 . Step 11. End.

4 Experiment and analysis

Jo

In this section, we employ the proposed model to construct the multi-level optimal structure of the H1N1 influenza viruses system, which consists of several types of fragments proteins.

4.1 Data Sources

The influenza viruses has eight linear negative strand RNA fragments, which encode 10 viral proteins including PB1, PB2, PA, HA, NP, Na, M1, M2, NS1, and NS2. Most of them are structural proteins except NS1 and NS2. Notably, HA and NA play the direct and important roles in the H1N1 influenza virus's outbreak [30]. In order to validate the optimal model and establish a reasonable hierarchical structure of the H1N1 influenza virus system, 8983 HA protein sequences and 8198 NA protein sequences measured from 1902 to 2019 in the world are downloaded from Protein Sequence in Molecular Databases of

Journal Pre-proof NCBI (https://www.ncbi.nlm.nih.gov/). The additional information of H1N1 influenza viruses, such as viral host, occurrence time, location and so on, is also included in the data. There are 3573 influenza viruses which simultaneously reserve HA and NA protein; let  denotes these viruses. In the following experiment, we will test and validate the algorithm A based on the virus dataset  .

4.2 Feature information representation of a protein sequence Proteins are made up of 20 amino acids. These acids are divided into four types according to the physicochemical properties of amino acids, namely, the polar and hydrophilic ( pq ), polar and hydrophobic ( pr ), non-polar and hydrophilic ( sq ), and non-polar and hydrophobic ( sr ). The four

pro of

types of amino acids are marked as pq  G , pr  { A,V , L, I , F , P} , sq  {S , C , N , E , T , D, K , R, H } , and sr  {W , Y , M } . Considering the adjacency statistical information, the 16-dimensional feature vector is constructed by calculating the frequency. In fact, we compared the 16-dimensional feature vector [31] with the 40-dimensional vector [32, 33] of the H1N1 influenza virus, the results showed that the 16-dimensional feature vector not only saves the time required for computing but also much more effective in analyzing the nature of the H1N1 influenza virus. Therefore, the HA and NA protein sequences are characterized by their 16-dimensional feature vectors. An HIN1 influenza virus xi can be represented by a 32-dimensional feature vector which is a combination of two 16-dimensional feature vectors related to HA and NA proteins respectively.

re-

xi , x j  X , the similarity between xi and x j is defined as R( xi , x j ) 

( xi , x j )

( xi , xi )  ( x j , x j )

where ( xi , x j )   k 1 xik  x jk stands for the inner product in 32-dimension space. It is obvious that R is

lP

32

an SFP relation according to Definition 1.

4.3 Granular signature and validation

urn a

To facilitate the approval of our model, we introduce two concepts including granular signature and accuracy ratio for evaluation. (1) Selection of granular signature Once the optimal granularity (or hierarchical structure) of a complex system X is determined, constructing information granules is crucial for abstracting original data. Generally, the granules are obtained from the principle that samples with the same features assemble in a granule (i.e., a class). The average of all samples in a class, namely, the center of a class, is efficacious to represent the core information of the class. To begin with, let us consider a multi-level structure (or granularity)

X *  {a1 , a2 ,..., aJ } , where J = X * . Feature information (or signature) can be extracted to

Jo

approximately represent the corresponding class, we can thereby reduce the systemic complexity and analyze its structure. According to the nearest-to-center principle [34], an objective function for selecting the signature is established, and it follows pi  arg max{R( xik , ai )} 1 k  J i

(4)

where i  1, 2,..., X * and pi is called a signature of the corresponding sub-class ai or a granular signature. The set P  { p1 , p2 ,..., pJ } is called a signature set of the granularity X * . In some degree, the granular signature set P can be used to represent approximately the complex system X , and the systemic complexity is thus reduced.

(2) Validation of granular signature

Journal Pre-proof To validate the selected granular signature set P , a classifier can be designed according to the principle of maximum similarity. The designed classifier using P follows

pi0  arg max R(q, pi )

(5)

i

where q  X \ P and pi  P , i  1, 2,..., P . The model (5) states that the other samples in X \ P are assigned to P classes according to the maximum similarity with P . For any q  X \ P , if it satisfies (5), then the sample q is placed in a

pro of

class, which is marked as bi0 . All samples in X \ P are divided into P classes, marked as bk ,

k  1, 2,..., P . The accuracy ratio r ( r  [0,1] ) is introduced to measure the efficiency of granular signature set which serves for replacing the multi-level structure (or granularity) X * . It is defined as

 ak  bk r  k 1 P

X \P

(6)

In general, when P is determined, the larger the value of r is, the better the result is. Specially, it is obvious that when P  1 or P  n , r  100% . However, these two cases do not exist, since

value of P

re-

an approximate structure takes a proper number of signatures, which is larger than 1 but smaller than n . Therefore, in practice a proper value of  is to be found, which leads to a circumstance that the is large enough in the range convenient for further analyses and that the value of

accuracy ratio r is large—the larger the better.

4.4 Experiment Results and Analysis

lP

For different  , the first and the second level structures are constructed through employing the algorithm A, with the corresponding values of r obtained. Results are shown in Table 1. Table 1 shows that r reaches to its maximum when  =1.5 . Therefore, we selected  =1.5 in the following

 1 1.5 2

urn a

discussions. Additionally, we compare the accuracy with that by K-means method on the second level structure at   1.5 . The result is 95.83% vs. 84.84%, indicating that our method is more efficient than K-means. Table 1: Table of  , NC and accuracy on multi-level structure with algorithm A First level structure r Number of classes 13 7 6

87.89% 91.84% 88.93%

Second level structure r Number of classes 157 97 80

92.07% 95.83% 69.85%

When   1.5 , the first level structure is determined by employing the algorithm A. Results show

Jo

that the first level structure includes 7 classes denoted as ai* , i  1, 2,...,7 . With the use of Model (4), the corresponding signature (or representative) viruses are obtained, and the results are shown in Table 2. Furthermore, the second level structure of influenza viruses system contains 97 sub-classes denoted as bk* ( k  1, 2,...,97 ). There are respectively 42, 28, 4, and 20 sub-classes in a1* , a2* , a3* , and a4* ; the other first level classes remain unchanged, each of them contains only 1 sub-class. The corresponding signature viruses are obtained based on Model (5), which are denoted as pk* ( k  1, 2,...,97 ) (Appendix 1).

Journal Pre-proof Remark 4: The accuracy rate is 95.83%, meaning the corresponding error rate is still 4.17%. The error is due to the approximation process since all signature viruses were selected according the nearest-to-center principle which does not guarantee all the signature viruses are exactly at the center of every sub-class. However, in the perspective of approximation, 95.83% is a remarkable value; the granular signature set contains the most information. Therefore, the signature virus set P* containing 97 viruses can be used to approximate the whole system containing 3573 viruses. Additionally, these also means both our model and method are valid. █ Table 2: 7 signature viruses of the first level structure Virus A/swine/Guangxi/NS1727/2010 A/Moscow/WRAIR4316N/2011 A/swine/Illinois/A01411937/2014 A/Denmark/16/2001 A/swine/South Dakota/A01267992/2012 A/swine/Indiana/A01076195/2010 A/Texas/9435/2019

pro of

No. 1 2 3 4 5 6 7

urn a

lP

re-

As related above, the 97 signature viruses are divided into 7 classes. The first class contains 42 viruses which were measured from 1987 to 2013. The second class contains 28 viruses which were mostly measured from 2010 to 2016. The third contains four viruses (A/swine/Saraburi/NIAH100761-22/2009, A/swine/North Carolina/SG1172/2003, A/swine/Taiwan/NPUST0002/2013, and A/swine/Taiwan/NPUST0013/2013) which were all from subtropical and tropical zones. The fourth includes 20 viruses from temperate zone. The other 3 viruses are all from America but in different years. In detail, A/swine/Italy/7704/2001 is contained in the fifth class, A/swine/Indiana/A01076195/2010 is in the sixth class, and A/Texas/9435/2019 is in the seventh class. In fact, most of the viruses in the first class emerged in Europe and Asia and most of the viruses in the second occurred in America, consistent with the fundamental roles of both the similar climatic conditions and neighboring locations in affecting virus evolution. These results show that the outbreak time and location play important roles in the H1N1 virus spread and evolution. Furthermore, we can obtain the phylogenetic tree of the signature virus set by applying the hierarchical clustering algorithm [22]; the results are shown in Fig. 2. A phylogenetic tree reflects the relationship of species during evolution. It paves a way to comprehend the evolutionary history and the mechanism of evolution [35]. In Fig. 2, on one hand, there are two vague boundaries among classes of the first level structure which are produced through the approximation process according to Remark 4, one is located between class 1 and 2, and the other is between class 2 and 3; on the other hand, it shows that the H1N1 influenza viruses have high homology [36]. Table 3: The influence on the outbreak time Ratio

2 3

62.86% 71.06 %

Jo

Δ (year)

To demonstrate the influence of outbreak time, outbreak time of the 97 signature viruses are marked as Ti ( i  1,2,...,97 ), Tij ( i  1,2,...,97 , j  1, 2,..., bi* ). Because there are some mutated viruses had escaped the antibody recognition and they were in the latency [37], we introduce the time interval Δ (year). The statistical number satisfying the condition Ti  Tij   ( j  1, 2,..., bi* ) is ( i  1,2,...,97 ), and the influence on outbreak time is calculated by the

marked as ni ratio i 1 ni 97



97 i 1

bi* ; see Table 3. Results imply that although the latency is a factor that is hard to

estimate, our model gives a consistence ratio up to 71.06% with Δ=3.

Journal Pre-proof

lP

re-

pro of

Of note, some viruses emerged at different places are classified into one class, such as the second and the third class in Fig. 2. It was uncovered that the H1N1 virus system has a complex structure [38, 39], and there are many factors on its structure. Thus, to further extract the structural information, a possibly significant way is to additionally adopt the feature representation information of the other proteins.

urn a

Fig.2. Phylogenetic tree of signature viruses by using hierarchical clustering, where the same color present that they are in the same class according to its first level structure. The first, second, third, fourth, fifth, sixth, and seventh class was represented by red, black, blue, yellow, dark green, green, and dark red, respectively.

5 Conclusions

Jo

Granular computing theory provides the methods for dealing with the problems hierarchically, in the field of data mining. In this paper, a novel method is proposed for constructing the optimal hierarchical structure of complex system based on fuzzy granular space. There are three main components that constitute the whole approach. Firstly, the inter-class deviations and intra-class deviations were introduced, whose properties were investigated in depth and approved mathematically (Theorem 1 and 2). Secondly, the fuzzy hierarchical evaluation index is developed, and a new model (Model (2) ) for extracting the optimal hierarchical structure of complex system is established. Moreover, the model is globally optimal (Theorem 3). The corresponding algorithm (the algorithm A) is given, which reliably constructs the multi-level structure of complex system. Finally, we introduced the granular signature according to the nearest-to-center principle, and constructed the granular signature set of a multi-level structure of complex system. It could be used to approximate the whole complex system and to reduce its complexity. Moreover, a classifier with the use of the granular signature set was designed to verify our method and model. Data experiment on the H1N1 influenza

Journal Pre-proof virus systems from 1902 to 2019 in the world showed that the proposed method is effective. These theories and methodologies on granular computing are helpful for capturing more details about the structure of complex system, and provide a research basis for the network composed of the complex systemic structure. In future studies, it is to be investigated how and why the parameter  affects our model, which likely represents an intrinsically unknown mechanism.

Acknowledge

pro of

The work was supported by National Natural Science Foundation of China (Grant No.11371174, 11271163, 61402201), International Technology Collaboration Research Program of China (Grant No. 2011DFA70500).

Reference

Jo

urn a

lP

re-

[1] B Zhang and L Zhang, “Theory of Problem Solving and Its Application- the Theory and Methods of Quotient Granular Computing” (Second), Beijing: Tsinghua University Press, 2007. [2] W Pedrycz. “Granular computing: analysis and design of intelligent systems”, CRC Press, 2013. [3] J T Yao, A V Vasilakos, W Pedrycz, “Granular computing: perspectives and challenges”, IEEE Transactions on Cybernetics, vol. 43, no. 6, pp.1977-1989, 2013. [4] X Wang, W Pedrycz, A Gacek, et al, “From numeric data to information granules: a design through clustering and the principle of justifiable granularity”, Knowledge-Based Systems, vol. 101, pp.100-113, 2016. [5] W Pedrycz, A Gacek, X Wang, “Clustering in augmented space of granular constraints: a study in knowledge-based clustering”, Pattern Recognition Letters, vol. 67, pp. 122-129, 2015. [6] L Livi, A Rizzi, A Sadeghian, “Granular modeling and computing approaches for intelligent analysis of non-geometric data”, Applied Soft Computing, vol. 27, pp. 567-574, 2015. [7] L Zhang , B Zhang B (2005). “The structure analysis of fuzzy sets”, International Journal of Approximate Reasoning, 2005, vol. 40, pp.92-108. [8] A Gacek A, W Pedrycz, “Clustering granular data and their characterization with information granules of higher type”, IEEE Transactions on Fuzzy Systems, vol. 23, no. 4, pp. 850-860, 2015. [9] X Hu, W Pedrycz, O Castillo, et al, “Fuzzy rule-based models with interactive rules and their granular generalization”, Fuzzy Sets and Systems, vol.307, pp. 1-28, 2017. [10] C Chen, X Zhu, P Shen, et al. A hierarchical clustering method for big data oriented ciphertext search,” Proc. 2014 IEEE Conference on Computer Communication Workshops, pp.559-564, 2014.9 [11] A Fahad, N Alshatri and Z Tari, “A survey of clustering algorithms for big data: Taxonomy and empirical analysis”, IEEE Transactions on Emerging Topics in Computing, 2014, vol. 2, no.3, pp. 267-279, 2014. [12] B Hartmann, O Banfer, O Nelles, et al, “Supervised hierarchical clustering in fuzzy model identification”, IEEE Transactions on Fuzzy Systems, vol. 19, no. 6, pp. 1163-1176, 2011. [13] W Pedrycz and M Song, “Analytic process (AHP) in group decision making and its optimization with an allocation of information granularity”, IEEE Transactions on Fuzzy Systems, vol. 19, no. 3, pp. 527-539, 2011. [14] F Zhou, F De la Torre, and J K Hodgins, “Hierarchical aligned cluster analysis for temporal clustering of human motion”, IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 35, no. 3, pp. 582-596, 2013. [15] W Pedrycz, S-M Chen. Information Granularity, Big Data, and Computational Intelligence. Springer, 2015. [16] J C Bezdek and N R Pal, “Some new index of cluster validity,” IEEE Transaction Systems Man Cybernetics, vol. 28, pp. 301-315, 1998. [17] X Gao, J Li, D Tao, “Fuzziness measurement of fuzzy sets and its application in cluster validity analysis”, International Journal of Fuzzy Systems, vol. 9, no. 4, pp. 188-197, 2007. [18] H Yu, Z Liu and G Wang, “An automatic method to determine the number of clusters using decision-theoretic rough set”, International Journal of Approximate Reasoning, 2014, vol. 55, no. 1, pp. 101-115, 2014. [19] D W Kim, K H Lee and D Lee, “On cluster validity index for estimation of optimal number of fuzzy clusters”, Pattern Recognition, vol. 37, no.10, pp. 2009-2024, 2004. [20] J Yu, Q S Cheng, “The search scope of optimal cluster number in fuzzy clustering methods”, Science in China: Series E, vol. 32, no. 2, pp. 274-280, 2002. [21] X-Q Tang, P Zhu and J X Cheng, “The structural clustering and analysis of metric based on granular space”, Pattern recognition, vol. 43, no. 11, pp. 3768-3786, 2010. [22] X Q Tang and P Zhu, “Hierarchical clustering problems and analysis of fuzzy proximity relation on granular space, IEEE

Journal Pre-proof

Jo

urn a

lP

re-

pro of

Transactions on Fuzzy Systems, vol. 21, no. 5, pp. 814-824, 2013. [23] X Dai, Y Li, Z Bai, X-Q Tang, “Molecular portraits revealing the heterogeneity of breast tumor subtypes defined using immunohistochemistry markers”, Scientific Reports, 2015, doi: 10.1038/srep14499. [24] Y Li, X Q Tang, Z Bai, X Dai, “Exploring the intrinsic differences among breast tumor subtypes defined using immunohistochemistry markers based on the decision tree”, Scientific Reports, 2016, doi:10.1038/srep35773. [25] F J Cabrerizo, R Ureña, W Pedrycz, et al, “Building consensus in group decision making with an allocation of information granularity”, Fuzzy Sets and Systems, 255:115–127, 2014. [26] F J Cabrerizo, J A Morente-Molinera, W Pedrycz, et al, “Granulating linguistic information in decision making under consensus and consistency”, Expert Systems With Applications, 99: 83–92, 2018. [27] M R Anderberg, “Cluster Analysis for Applications: Probability and Mathematical Statistics”, New York: Academic Press, 2014. [28] H. Morris, J. Mark and Schervish, Probability and statics, China Machine Press, Beijing, 2012. [29] J Han, M Kamber, and J Pei, “Data mining: concepts and techniques”, Elsevier, 2011. [30] A Rambaut, O G Pybus, M I Nelson, et al, “The genomic and epidemiological dynamics of human influenza A virus”, Nature, vol. 453, pp. 615-619, 2008. [31] W W Li, Y Li, X-Q Tang, “A new representation method of H1N1 influenza virus and its application”, Proc. 11th international conference on ICIC2015, D.-S. Huang et al (Eds.), LNCS 9226, pp. 342-350, 2015. [32] W Hu, “Computational study of interdependence between hemagglutinin and neuraminidase of pandemic 2009 H1N1”, IEEE Transactions on Naobioscience, vol. 14, no. 2, pp. 157-166, 2015. [33] P P Qian, “A novel representation of protein sequences”, M.S. dissertation, China: Shandong University, 2011. [34] Y Li, Q-H Liang, M-M Sun, et al, “Construction of multi-level structure for avian influenza virus system based on granular computing”, International BioMed Research, 2017, https://doi.org/10.1155/2017/5404180. [35] Z W Chen, X Q Li, “Whole-genome phylogeny based on protein domain information”, China Journal of Bioinformatics, 2012, vol. 10, no. 1, pp. 31-36, 2012. [36] C Wu, A Kalyanaraman and W R Cannon, “pGraph: Efficient parallel construction of large-scale protein sequence homology graphs”, IEEE Transactions on Parallel and Distributed Systems, vol. 23, no.10, pp. 1923-1933, 2012. [37] C T T Su, S D Handoko, C K Kwoh, et al, “A possible mutation that enable H1N1 influenza a virus to escape antibody recognition”, Proc. 2010 IEEE International Conference on Bioinformatics and Biomedicine, Hongkong, China , pp.81-84, 2010. [38] K Shinya, M Ebina, S Yamada, et al, “Avian flu: influenza virus receptors in the human airway”, Nature, 440: 435-436, 2006. [39] A J Lee, S R Das, W Wang, et al, “Diversifying selection analysis predicts antigenic evolution of 2009 pandemic H1N1 influenza A virus in humans”, Journal of virology, vol. 89, no. 10, pp. 5427-5440, 2015.

Journal Pre-proof Appendix 1: 97 signature viruses of the second level structure Class No. 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 6 7

lP

Virus A/SW/MB/5-5/2009 A/swine/Manitoba/D0270/2013 A/swine/Oklahoma/A01134906/2011 A/swine/Indiana/A01944001/2015 A/swine/Nebraska/10/2012 A/swine/Alberta/SD0014/2013 A/swine/Thailand/CB001/2009 A/swine/Minnesota/A01267908/2012 A/swine/USA/1976/1931 A/swine/Chonburi/NIAH977/2004 A/swine/Minnesota/02475/2008 A/swine/Oklahoma/A01729565/2016 A/swine/Ratchaburi/NIAH1481/2000 A/swine/Kentucky/SG1167/2003 A/swine/Manitoba/D0175/2012 A/swine/Canada/01093/2006 A/swine/Mexico/AVX30/2012 A/swine/Pingtung/64-26/2007 A/swine/Korea/GBCG01/2010 A/swine/Minnesota/5/2012 A/swine/South_Dakota/A01267895/2012 A/swine/Saraburi/NIAH100761-22/2009 A/swine/North_Carolina/SG1172/2003 A/swine/Taiwan/NPUST0002/2013 A/swine/Taiwan/NPUST0013/2013 A/Hong_Kong/117/1977 A/Khorasan/512/2008 A/Fort_Warren/1/1950 A/Malaysia/1954 A/WS/1933 A/Wilson-Smith/1933 A/Kw/1/1957 A/swine/France/56-110525/2010 A/swine/Italy/7704/2001 A/AA/Marton/1943 A/swine/Oklahoma/A01290605/2013 A/swine/Oklahoma/A02214419/2017 A/swine/France/22-120067/2012 A/swine/Oklahoma/00801/2005 A/Cameron/1946 A/swine/Morbihan/0163/2010 A/Melbourne/1/1946 A/Puerto_Rico/8/34/Mount_Sinai_1934 A/swine/Ratchaburi/NIAH550/2003 A/swine/Oklahoma/01139/2006 A/swine/South_Dakota/A01267992/2012 A/swine/Indiana/A01076195/2010 A/Texas/9435/2019

pro of

No. 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97

re-

Virus A/swine/Guangxi/NS1727/2010 A/American_green-winged_teal/Wisconsin/2743/2009 A/swine/Hong_Kong/72/2007 A/swine/Italy/671/1987 A/swine/England/010402/2003 A/swine/Italy/303612/2011 A/swine/Italy/1369-7/1994 A/swine/England/254/2002 A/swine/Belgium/1/1998 A/swine/OMS/2112/1995 A/swine/Netherlands/Hoogeloon-167C/2012 A/swine/Denmark/10-2526-1/2011 A/swine/Spain/33903/2012 A/swine/Spain/29113/2012 A/swine/France/01-130054/2013 A/swine/Germany/Vi5698/1995 A/swine/Spain/32738/2012 A/swine/Guangdong/1605/2010 A/swan/Hokkaido/55/1996 A/swine/Guangxi/S2/2013 A/swine/Italy/319102/2010 A/swine/Guangdong/1/2010 A/swine/Scotland/034632/2012 A/swine/Haseluenne/IDT2617/2003 A/swine/Mexico/AVX23/2012 A/swine/England/33780/2006 A/turkey/MO/21939/1987 A/swine/England/35320/1999 A/swine/England/024079/2013 A/swine/Hong_Kong/275/2005 A/swine/Finland/si723/2009 A/swine/England/68327/1998 A/swine/Guangdong/SS1/2012 A/swine/England/1251/2011 A/Poland/wild_boar/1951713/2013 A/swine/Italy/317775/2010 A/swine/Hong_Kong/NS2761/2010 A/swine/Denmark/10169-3/2012 A/swine/Hubei/S1/2009 A/swine/Mexico/AVX47/2013 A/swine/Eire/89/1996 A/Muscovy duck/New York/ 97382-2/ 2005 A/Moscow/WRAIR4316N/2011 A/swine/Iowa/SG1401/2011 A/swine/Alberta/SG1415/2006 A/swine/Indiana/A01202622/2011 A/swine/Virginia/A01202625/2011 A/Mexico_City/INER13/2009 A/swine/Argentina/CIP112-C102/2015

urn a

Class No. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2

Note: “Class No.” means the class number that a signature virus belongs to in the first level structure.

Jo

No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

*Highlights (for review)

Journal Pre-proof

Highlights

Jo

urn a

lP

re-

pro of

(1) The fuzzy hierarchical evaluation index is developed, and a new model with the global optimal is established for extracting the optimal hierarchical structure of complex system. (2) An algorithm is proposed, which reliably constructs the multi-level structure of complex system. (3) With the use of the signatures, a classifier is designed for verifying our method.

*Declaration of Interest Statement

Journal Pre-proof Declaration of Interest Statement X Q Tang designed the study. Y Li and W Li implemented the analysis and prepared the draft. W

Jo

urn a

lP

re-

pro of

Shen contributed in finalizing the manuscript. All authors have read and approved the final manuscript.