On a granular functional link network for classification

On a granular functional link network for classification

On a granular functional link network for classification Communicated by Dr. Nianyin Zeng Journal Pre-proof On a granular functional link network f...

1MB Sizes 0 Downloads 36 Views

On a granular functional link network for classification

Communicated by Dr. Nianyin Zeng

Journal Pre-proof

On a granular functional link network for classification Francesco Colace, Vincenzo Loia, Witold Pedrycz, Stefania Tomasiello PII: DOI: Reference:

S0925-2312(20)30286-1 https://doi.org/10.1016/j.neucom.2020.02.090 NEUCOM 21958

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

14 January 2020 13 February 2020 21 February 2020

Please cite this article as: Francesco Colace, Vincenzo Loia, Witold Pedrycz, Stefania Tomasiello, On a granular functional link network for classification, Neurocomputing (2020), doi: https://doi.org/10.1016/j.neucom.2020.02.090

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.

On a granular functional link network for classification Francesco Colacea , Vincenzo Loiab , Witold Pedryczc , Stefania Tomasiellod a

Dipartimento di Ingegneria Industriale (DIIN), Universit` a degli Studi di Salerno, Fisciano, Italy b Dipartimento di Scienze Aziendali - Management & Innovation Systems (DISA-MIS), Universit` a degli Studi di Salerno, Fisciano, Italy c Department of Electrical Computer Engineering University of Alberta, Edmonton, Canada d Institute of Computer Science, University of Tartu, Tartu, Estonia

Abstract In this paper, we present a new granular classifier in two versions (iterative and non–iterative), by adopting some ideas originating from a kind of Functional Link Artificial Neural Network and the Functional Network schemes. These two architectures are substantially the same: they both use a function basis instead of the usual activation function, but they are different for the learning algorithm. We augment them from the perspective of Granular Computing and information granules, designing a new kind of classifier and two learning algorithms, by taking into account granularity of information. The proposed classifier exhibits the advantages of the granular architectures, that is higher accuracy and transparency. We formally discuss the convergence of the iterative learning scheme. We carry out some numerical experiments using publicly available data, by comparing the results against those results produced by the state-of-the-art methods. In particular, we achieved sound results by invoking the iterative learning scheme. Keywords: high-dimensional data; fuzzy sets; granulation; B-spline; convergence.

Email addresses: [email protected] (Francesco Colace), [email protected] (Vincenzo Loia), [email protected] (Witold Pedrycz), [email protected] (Stefania Tomasiello)

Preprint submitted to journal

February 25, 2020

1. Introduction Originally, Functional Link Artificial Neural Networks (FLANNs) were proposed by Pao et al. [28, 29], as flat networks without hidden layers, with the usual activation function processing a linear combination of input values with a bias and trained using the delta rule. Such kind of networks appeared very promising for function approximation and pattern classification, apparently resulting in a faster convergence rate and lesser computational cost than encountered in multi-layer perceptrons. In [32] a variant of this scheme was discussed, by adopting a functional expansion (e.g. by the polynomial basis) and processing the input directly by means of a function basis. This is different from passing the linear combination of the input values to an activation function, which in [28, 29] and related works represents an enhanced pattern in a flat network. The learning algorithm in [32] was based on the back-propagation. The architecture with a functional expansion inspired several studies (e.g. [10, 41, 44]). Such scheme has not been extensively applied to classification problems. In this context, it has been mainly combined with evolutionary techniques or similar, in order to optimize the topology, such as genetic algorithms [10], particle swarm optimization and harmony search [27]. Some recent works use similar ideas [26, 9]. In particular, in [9], FLANNs are combined with chemical reaction optimization for classification problems dealing with datasets with missing values, inconsistent records, and noisy instances. There is another architecture very similar to FLANN based on functional expansion, that is the Functional Network (FN) [7]: it uses a function basis as in [32], but unlike that, the learning algorithm is a least squares procedure. A recent review on FNs can be found in [45]. We introduce in such architecture a granular layer, built by means of fuzzy granules, giving rise to a Granular Functional Network (GFN) or a Granular Functional Link Artificial Neural Network (GFLANN), depending on the learning algorithm. The kind of adopted granules was recently introduced in [21], for a granular FN with delay. The study of the latter architecture is motivated by the need to limit the number of the adopted information granules when the number of features is high. Hence we develop a delta rule to achieve the needed accuracy. For this new proposed scheme, we formally discuss the convergence. In our algorithm the weights connecting the granular and the internal layer are randomly generated (under a suitable condition), affecting the membership degrees of the input values into the fuzzy sets forming the granules. This strategy allows to avoid a constrained 2

optimization as in [20]. This is different from the original idea of the random vector functional-link (RVFL) net as proposed in [30, 31], where weights and biases affecting the input values are randomly selected [31]. The interested readers can find a comprehensive discussion on RVFL networks in [42]. Proposing a new kind of randomized mechanism against the state of the art is beyond the scope of this paper. The aim here is to exploit granularity to get an accurate interpretable model. Interpretable models are particularly important in the medical field [14]. The interpretability of fuzzy systems, meant as the ability to explain the behavior of the system in an understandable way, is not a new topic. It has attracted many researchers in the last decade (e.g. see [36] and references therein), with the age-old question of the accuracy achieved by Takagi-Sugeno-Kang systems and the interpretability of Mamdami systems [8]. Over the last decade, the formal definition of interpretability has been underway [19, 14]. Only recently, some features of interpretability seem to have been fixed in the literature. Those features are: transparency, intelligibility, simplicity, comprehensibility and meaningfulness [14]. In a transparent model, it is possible to deduce clearly how the model works. This requires a clear mathematical model. In intelligible models, the influence of each model input on the final decision can be deduced (as in additive models, for instance). Regarding simplicity, in the model there should be as less as possible characteristic parameters. There exist very simple models with one or a low number of attributes, even though simplicity does not guarantee comprehensibility [14]. By following the ”comprehensibility postulate”, a model should be interpretable in natural language. This is somewhat related to the information granules and to the fact that fuzzy sets are suitable either for a qualitative representation (in linguistic terms) or a quantitative one [24]. Meaningfulness is mostly referred to the knowledge of the experts, who may choose a qualitative or quantitative representation [24]. It is clear that the proposed model, based on information granules with the fuzzy formalism, allows both a qualitative and a quantitative representation. Besides it addresses also the transparency and the intelligibility issues, being an additive model with a clear mathematical formulation. The ”link” between the classical mathematics, to cover transparency and intelligibility, and the fuzzy mathematics to cover the other above-mentioned aspects with the extraction of IF-THEN rules, is represented by a newly defined information granule [21]. With regard to simplicity, we tried to keep the number of involved parameters as small as possible. This has motivated the iterative computing scheme (GFLANN). As mentioned before, when the number 3

of input features is high, the number of granules needed to get an accurate solution may proportionally grow. In order to achieve a good accuracy, by containing the number of granules, an iterative scheme based on a delta rule has been developed. On the other hand, there are classifiers which are considered as interpretable to some extent, but they mostly try to find a kind of interpretation of deep learning, leaving some open mathematical issues, e.g. [11], or are based on linearized systems with empirical evaluation (e.g. [3, 18]). Finally, it is the case to mention that some types of Granular Neural Networks were proposed for classification (e.g. [12, 43]), but their convergence was not discussed. The main contribution of this paper lies in the theoretical framework defining a new granular classifier. Thanks to the adopted formalism, we have been able to introduce a mathematical model to be formally investigated, by proving the convergence. Even though granular computing is meant as a design methodology, the mathematics behind the granular schemes has not found a deeper discussion yet. Several numerical experiments have been performed, moving from limited datasets, for the sake of comparison with former FLANN-based classifiers, towards high-dimesional datasets, considering state-of-the-art techniques used with those datasets, such as [1, 40, 35]. The emphasis is on the GFLANN classifier, since, as mentioned before, the GFN classifier exhibits some limitations when dealing with a high number of features in datasets. This is reflected in the numerical results, which show the better performance of the GFLANN classifier. The paper is structured as follows: Section 2 is devoted to some preliminary notation; in Section 3, the new classifier model is introduced; in Section 4, the learning algorithm and its accuracy are formally discussed; Section 5 is referred to numerical experiments and finally in Section 6, some conclusions are offered. 2. Preliminaries In this section we recall all the notions related to fuzzy sets, since in this paper we refer to data granulation with fuzzy sets. Let I = [ξ0 , ξm+1 ] be a closed interval and ξ = {ξ0 , ξ1 , . . . , ξm+1 }, with m ≥ 3, be a sequence of points of I, called nodes, such that ξ0 < ξ1 < . . . < 4

ξm+1 . A fuzzy partition of I is defined as a sequence A = {A1 , A2 , . . . , Am } of fuzzy sets Ai : I → [0, 1], with i = 1, . . . , m such that • Ai (ξ) 6= 0 if ξ ∈ (ξi−1 , ξi+1 ) and Ai (ξi ) = 1; • Ai is continuous and has its unique maximum at ξi ; Pm • ∀ξ ∈ I. i=1 Ai (ξ) = 1,

The fuzzy sets {A1 , A2 , . . . , Am } are called basic functions. They form a uniform fuzzy partition if the nodes are equidistant. The norm of the partition is in general h = maxi |ξi+1 − ξi |. The set of nodes ξ is said to be sufficiently dense with respect to the fixed partition A if ∀k,

∃j

s.t. Ak (ξj ) > 0.

(1)

A fuzzy partition can be obtained by means of several basic functions. Typical basic functions are the triangular ones   (ξj+1 − ξ)/(ξj+1 − ξj ), ξ ∈ [ξj , ξj+1 ] (ξ − ξj−1 )/(ξj − ξj−1 ), ξ ∈ [ξj−1 , ξj ] Aj (ξ) = (2)  0, otherwise In order to satisfy (5), in this study we will also consider fuzzy partitions with small support. A fuzzy partition with small support is a fuzzy partition which has the following additional property: there exists r ≥ 1 such that supp(Ai ) = {ξ ∈ I : Ai (ξ) > 0} ⊆ [ξi , ξi+r ]. Possible basic functions to produce the above-mentioned fuzzy partitions are Bernstein basis polynomials and B–splines (e.g. see [6]). Provided that m ≥ p + 2, B-splines of degree p over the nodes sequence ξ can be defined. This means that some auxiliary points both on the left and on the right of the considered interval are required. An explicit form of the scaled cubic B–splines (CBS), for j = 0, . . . , m, is given as follows (e.g. see [20])  (ξ − ξj−2 )3 , ξ ∈ [ξj−2 , ξj−1 )     (ξ − ξj−2 )3 − 4(ξ − ξj−1 )3 , ξ ∈ [ξj−1 , ξj ) 1  (ξj+2 − ξ)3 − 4(ξj+1 − ξ)3 , ξ ∈ [ξj , ξj+1 ) (3) Aj (ξ) = 3 4h  3  (ξj+2 − ξ) , ξ ∈ [ξj+1 , ξj+2 )    0, otherwise. 5

1.0

0.8

0.6

0.4

0.2

0.0 0.0

0.2

0.4

0.6

0.8

1.0

Figure 1: A uniform fuzzy partition by quintic B–splines.

An explicit form of the scaled quintic B–splines (QBS), is given by  (ξ − ξj−3 )5 ,    5  (ξ − ξj−3 ) − 6(ξ − ξj−2 )5 ,     (ξ − ξj−3 )5 − 6(ξ − ξj−2 )5 + 15(ξ − ξi−1 ), 1  (ξj+3 − ξ)5 − 6(ξj+2 − ξ)5 + 15(ξi+1 − ξ)5 , Aj (ξ) = 66h5   (ξj+3 − ξ)5 − 6(ξj+2 − ξ)5 ,     (ξj+3 − ξ)5 ,    0,

for j = 0, . . . , m,

ξ ∈ [ξj−3 , ξj−2 ) ξ ∈ [ξj−2 , ξj−1 ) ξ ∈ [ξj−1 , ξj ) ξ ∈ [ξj , ξj+1 ) ξ ∈ [ξj+1 , ξj+2 ) ξ ∈ [ξj+2 , ξj+3 ) otherwise. (4) Notice that the B–spline of degree 1 is the hat function. In Figure 1, an example of uniform fuzzy partition (with m = 11) by means of quintic B–splines is depicted. The points on the abscissa axis are the nodes. For more details on B–splines and their approximation properties, one can refer to [22]. 3. The proposed model Before introducing the model, we recall some basic notions on information granules, FNs and FLANNs. 6

3.1. Information granules We recall the principle of justifiable granularity to introduce the notion of information granule. Briefly, we can say that an information granule should be representative of as many data as possible (coverage), though the granule should be quite specific (specificity). Formally, the coverage is expressed as the cardinality of the information granule.The specificity can be regarded as the inverse of its size. These two conflicting requirements can be expressed through the following performance index [34]: X maximize A(xk )/supp(A), (5) k

where A is the information granule, which may belong to a certain family of fuzzy sets, and supp(A) is its support. Actually, fuzzy sets seem to constitute the natural formalism to form the information granules, by recalling the energy measure of fuzziness, which can be meant as cardinality, and the measure of specificity [4]. The maximization in (5) is sought with respect to the parameters of the information granules, that is the bounds of the interval information granule. 3.2. Functional Networks and Functional Link Artificial Neural Networks The FN architecture usually consists of: • a layer of input units (IL), handling the input signals; • a layer of output units (OL), containing the output data;

• a layer of inner neurons (NL), processing a collection of input signals and providing a collection of output signals by means of suitable activation functions. The above-mentioned units are connected by means of directed links. Unlike the traditional Artificial Neural Networks (ANNs), the activation functions can be multi-argument, multivariate and different for any input node. This results in a more efficient learning (least-squares based) and no large data set seems to be required for it. FLANNs, meant as in [32], have the same architecture and features, but a different learning, that is gradient based. As pointed out in [45], FNs (and consequently FLANNs) can also be used as black boxes, like any ANN. This makes the interpretability an issue also in such kind of architectures. 7

3.3. Granular Functional Network and Granular Functional Link Artificial Neural Network A granular FN (GFN), or similarly a granular FLANN (GFLANN), has an intermediate layer, namely a granular layer (GL) located between the layers IL and NL (see Figure 1). Let xT = (x1 , x2 , . . . , xn ) and yT = (y1 , y2 , . . . , yp ) be the input and the output vector respectively. The granular layer consists of a collection of fuzzy sets Aik , with i = 1, . . . , m, k = 1, . . . , n. The fuzzy sets Aik , with i = 1, . . . , m, form a fuzzy partition of the kth input variable domain. Fuzzy sets are usually employed to build an information granule. A granule may be regarded as a fuzzy relation [33], formulated in several ways (e.g. see [23], [17], [13]). Here we use the interpretation of the granule as proposed in [21] and we briefly recall it below. Let Al be normal and convex fuzzy sets, for l = 1, . . . , m. We assume that for each granule Γr there exists a possibility distribution ψΓ ∈ Rn , such that max

xj ,j=1,...,n,j6=r

ψΓ (x1 (i), x2 (i), . . . , xn (i)) = Al (xr (i))

(6)

Then, for each information granule we have the following approximation: l

φ (Al (x1 (i), x2 (i), . . . , xn (i))) =

m _

(Al (xk (i)) ∗ wkl ),

(7)

l=1

W where ∗ is a t-norm, is the maximum operator, Al (xk (i)) is the membership degree of the data coming from the kth internal domain and wkl ∈ [0, 1] are the weights, here assumed to be different for every granule. The output of each granule can be thought as given by some functions of the input variables, here denoted as φ1 , φ2 , . . . , φm for short, and approximated by means of the basic functions Ai , as follows φ i = Ai w i ,

i = 1, 2, . . . , m,

(8)

where (Ai )T = (Ai1 , Ai2 , . . . Ain ), being Aik = Ai (xk ), and wi is the vector of the granular weights. Each output yj is processed through some invertible functions Fj , that is Fj (yj ), with j = 1, . . . , p. Unlike [20], the output y is computed as follows

8

T

F(y) = W φ,

(9)

where F(y)T = (F1 (y1 ), . . . , Fp (yp )), φT = (φ1 , . . . , φm ) and W is the m × p matrix of the unknown output weights wkj . The proposed granular classifier architecture, unlike [20], has two classes of weights and more details about them will be provided in the next section. We just point out that in GFLANN, there is a weights update through a delta rule (see dashed part in Figure 3). From the proposed scheme it is possible to extract fuzzy rules of the type: Rki : IF x1 ∈ supp(Ai ) AND x2 ∈ supp(Ai ) AND ... xn ∈ supp(Ai ) THEN yk = Fk−1 (φi (x1 , . . . , xn )), which are Takagi–Sugeno–Kang–like rules. In what follows, we will consider Fk , for any k = 1, . . . , p as the identity function (e.g. see [7, 20]). Besides, for convenience we change the superscripts of (8) to subscripts. 4. Methodology and properties of the proposed computing scheme 4.1. The learning algorithm As mentioned before, the proposed granular classifier has two types of weights. The first one, here named the granular weights, denoted as wij , are randomly chosen, as detailed in the next section. The second one, here called output weights, denoted as wlm , are unknown. In the GFN scheme, they are learnt through a least squares (LS) approach as explained in the following. The random assignment of some weights combined with the LS approach may be also found in deep architectures (e.g. see [15]) Let Np be the number of patterns in the training data. Let A be the Np × mn matrix of the values assumed by the basic functions through Np patterns, written as follows   A1 (x(1) )T . . . Am (x(1) )T  A1 (x(2) )T . . . Am (x(2) )T    A= (10) , .. .. ..   . . . A1 (x(Np ) )T . . . Am (x(Np ) )T (j)

(j)

(j)

with the row vectors Ai (x(j) )T = (Ai (x1 ), . . . , Ai (xn )), where xk denotes the kth input value in the jth pattern. Let W be the mn × m matrix of granular weights, which can be written as follows

9

Figure 2: The proposed granular classifier



  W= 

w1 0 0 . . . 0 0 w2 0 . . . 0 .. .. .. .. .. . . . . . 0 0 . . . 0 wm



  , 

(11)

where 0 is an n-sized vector with null elements and wiT = (wi1 , . . . , win ). Finally, let V denote the N values, expressed  p × p matrix of target  in its compact form as VT = y(x(1) ), . . . , y(x(Np ) ) , being y(x(j) ) the pdimensional vectors of target values, for j = 1, . . . , Np . 10

Then it is V = PW,

(12)

where P = AW. The unknown weights matrix W can be computed as follows W = M−1 PT V,

(13)

where M = PT P. By following this approach, the computational cost is substantially given by the matrix inversion. The theoretical lower bound for the computational complexity of the inversion of an m × m matrix is O(m2 log(m)) [37]. In FLANNs, a back–propagation (BP) algorithm is used for training [32]. Hence, we develop the following delta learning rule for the GFLANN, in order to adjust the weights at each iteration s: wji (s + 1) = wji (s) + ∆wji (s),

∆wji (s) = −η

∂E , ∂wji

(14)

for j = 1, . . . , p, i = 1, . . . , m. In (14) , η ∈ (0, 1) is the learning rate to be fixed (as it will be discussed in the next subsection), and E(s) = 21 eT (s)e(s) is the error functional collecting the distances between the computed values yj and the target values y j , ej (s) = yj (s) − y j (s).

where φi =

Pn

k=1

∂E ∂ej ∂yj ∂E (s) = = −ej (s)φi , ∂wji ∂ej ∂yj ∂wji

(15)

wik Ai (xk ), i = 1, . . . , m. Whence it follows that wji (s + 1) = wji (s) + η(yj (s) − y j )φi .

(16)

Notice that in our computing scheme the target values y j and the functions φi do not change over iterations. 4.2. Properties In order to prove convergence, one has to show that the error ej (s) for s → ∞ does not grow indefinitely. To that end, we first rewrite (16) as follows

11

wji (s + 1) = wji (s) + η

m X k=1

and hence the error

ej (s) =

m X q=1

wjk (s)φk − y j

!

φi ,

(17)

m X [wjq (s − 1) + η( wjk (s − 1)φk − y j )φq ]φq − y j .

(18)

k=1

By means of successive substitutions into (18), finally we get ej (s) = w0j + ηω1j + . . . + η s−1 ω(s−1)j + O(φ2s i ), where w0j =

m X q=1

(19)

wjq (0)φq − y j

(20)

with wkq (0) being the initial weights, and for any 1 ≤ r ≤ s − 1 



  m m m m X m X X  X X 2 2 2 2  ... ωrj = s  ... wjl (0) φq . . . φk φl −y j φq . . . φk  . | {z } | {z }  q=1 q=1 k=1 k=1 l=1 r+1 r | | {z } {z } r+1

(21)

r

We prove now the theorem giving some restrictions on η. We do not recall here the definition of uniform convergence of series for the sake of brevity. The interested readers can refer to [46], for instance. Theorem 1. Let n and m be the size Pof the input 1vector and the number of granules, respectively. Suppose 0 < nk=1 wki ≤ n 4 , for every i = 1, . . . , m, and 0 < η < 1 1 . Then the error ej converges uniformly, for every j ∈ mn 2 1, . . . , p. Proof. First notice that (19) is a power series. The radius of convergence of such series is given by the reciprocal of the following limit (if it exists) |ω(r+1)j | . r→∞ |ωrj | lim

12

Let y max be the maximum value among the target absolute values. By recalling the properties of the basic functions and the first hypothesis, then we have P r+2 r+1 2 |ω(r+1)j | 1 + y max n 2 ) mr+1 ( m k=1 |w jk (0)|n ≤ = mn 2 . P r+1 r m |ωrj | mr ( k=1 |wjk (0)|n 2 + y max n 2 )

The conclusion readily follows.

The iterations stop when |ej (s + 1) − ej (s)| ≤ , with  being a positive arbitrarily small real number. Theorem 1 states that the error does not grow over the iterations. Obviously, this does not prevent from the question about the accuracy of the proposed approach. Hence, in order to complete the discussion on the convergence, we will prove that by increasing the number of granules m in the limit to infinity, and under some assumptions, the proposed model can approximate any real continuous vector valued function. To the end, we recall that the computed p-dimensional solution: T

y(x) = W WT A(x), can be conveniently rewritten as: # " Np X T y(x(i) )A (x(i) ) DA(x), y(x) =

(22)

(23)

i=1

 where A(x) = A1 (x) , . . . , Am (x)T and D = W(M−1 )T WT . In what follows, k.kp and k.kF will denote the p-norm and the Frobenius norm respectively. Besides, σmax (Q) and σmin (Q) will denote the largest and the smallest singular values of a matrix Q and Im the identity matrix of order m. Without losing in generality, but for the sake of readability, we assume (N ) (1) that all the data are normalized into an interval I, that is {xi , . . . , xi p } ∈ I, for i = 1, . . . , n. Assumption 1. Let T = n1 Imn − D. We assume T 6= 0. T

Theorem 2. Let I n denote a n-dimensional hypercube and A = {A1 , . . . , Am } be a sequence of basic functions which form a uniform fuzzy partition with (N ) (1) norm h in every closed interval I. Suppose that the set of nodes {xi , . . . , xi p } is sufficiently dense with respect to the fuzzy partition A, for any i = 1, . . . , n. 13

Then for any given real continuous vector valued function y : I n → Rp , there exists y in the form (23) such that lim ky(x) − y(x)k2 = 0,

m→∞

(24)

for any x ∈ I n , with xi ∈ supp(Ak ) for each i = 1, . . . , n and k ∈ {1, . . . , m}. PNp T (i) (i) Proof. Let S(k) = i=1,i6 =k y(x )A (x ) and 1 be the mn-dimensional vector with all unit entries. Let ω(y, h) denote the modulus of continuity [2]: ω(y, h) := sup{ky(x) − y(z)k2 : ∀x, z ∈ I n , kx − zk1 ≤ h}.

(25)

Besides, we recall that for any m × m matrix Q, it is [25] 2 kQkm σmax (Q) F ≤ m . σmin (Q) m 2 | det Q|

(26)

T

Finally, we observe that n1 A (x)Imn 1 = 1, for any x ∈ I. (N ) (1) Then, since the sequence {xi , . . . , xi p } is sufficiently dense with respect to A, for any i = 1, . . . , n, we can choose x(k) such that T

ky(x)−y(x)k2 ≤ ky(x)−y(x(k) )k2 +ky(x(k) )−A (x(k) )DA(x)k2 +kS(k) DA(x)k2 ≤ (27) T

≤ ky(x) − y(x(k) )k2 + ky(x(k) )k2 kA (x(k) )T1k2 + kS(k) D1k2 ≤

≤ ω(y, h) + 2

σmin (T) kTkmn σmin (D) kDkmn F F (k) (k) ky(x )k + 2 k2 . mn mn kS 2 | det T| (mn) 2 −1 | det D| (mn) 2

Hence, for m → ∞ (implying h → 0), the conclusion follows. Theorem 2 clearly states the approximation ability of the proposed scheme. Because of the structure of matrices P and W, Assumption 1 is reasonable. Anyway, Theorem 2 does not bring any condition which can be turned into a practical clue for fixing the matrix W. Moreover, increasing m to 14

get a better accuracy is only partially feasible, because of the computational inversion. Instead, from Theorem 1, we know that Pcost of the matrix 1 0 < nk=1 wki ≤ n 4 and a practical way to fulfill this condition is choosing 3 randomly wki ∈ [0, n− 4 ], for k = 1, . . . , n and i = 1, . . . , m. In the following, we will refer to Method I, that is computing the unknown weights by (13), and Method II, meaning first computing the unknown weights by (13), using higher-order basic functions with a relatively small m, and then adjusting them by means of BP. 5. Numerical studies The numerical experiments are divided into two parts: in the first part, we consider small datasets in order to compare the results against the ones by some recent variants of FLANN (based on certain function bases), such as [10, 27]. In the second part, we consider medium and high-dimensional datasets, referring to the state-of-the-art techniques for the considered datasets such as [1, 40, 35]. Even though our approach cannot be regarded as a variant of the RVFL, we discuss also a comparison with some published results from the most recent work on RVFL and its variants, RFL (RVFL with univariate trees) and obRFL (RVFL with obliquetrees), for multi-class classification [16], since the authors used some of the above-mentioned datasets. The numerical experiments were performed by using a CPU clocking in at 2.5 GHz. 5.1. Datasets description The performance of the proposed method was evaluated using some publicly available datasets coming from the UCI machine learning repository. In particular, we started off with the following ones considered in [1, 10, 27, 16]: • IRIS: a popular and simple classification dataset based on the multivariate characteristics of a plant species (length and thickness of its petal and sepal) divided into 3 distinct classes (Iris Setosa, Iris Versicolor and Iris Virginica); there are 150 instances (uniformly distributed among the 3 classes) and 4 predicting attributes; • WINE: a dataset from a chemical analysis of wines grown; the total number of instances is 178, distributed as 59 for class 1, 71 for class 2 and 48 for class 3; the number of attributes is 13; 15

• ECOLI: a dataset referred to the protein localization sites; there are 336 instances and 7 predictive attributes; the samples are distributed into 8 classes and the class distribution is highly unbalanced. Afterwards, we considered some datasets with a higher number of data and features and, to our best knowledge, the most recent reference considering them for a comparison[1]: • SPAMBASE, a dataset referred to a collection of spam e-mails, presenting 4,601 instances and 57 attributes, distributed into 2 classes; • WAVEFORM, where 3 classes of waves have been considered, with 5,000 instances and 40 attributes; • HILL-VALLEY, a dataset where each record represents 100 points on a two-dimensional graph; the points plotted in an ordered way create either a hill or a valley in the terrain; there are 606 instances, 101 attributes and 2 classes; • MADELON, an artificial dataset for multivariate and highly non-linear problem; there are 4400 instances, 500 attributes and 2 classes; • LETTER, a dataset created to identify the 26 capital letters in the English alphabet; each letter was randomly distorted to produce a file of 20,000 unique stimuli and each one of them was converted into 16 primitive numerical attributes (statistical moments and edge counts); and the following datasets with a higher number of instances: • DLA (PUC-Rio), a dataset [39] for human activity recognition (namely sitting, sitting down, standing, standing up and walking); it contains 165,633 instances, with 17 attributes; • SKIN, a dataset [5] with randomly sampled RGB values from faces of different age, race and gender; the dataset is univariate (skin or nonskin), and it has 245,057 training examples.

16

IRIS WINE ECOLI HFLNN [10] - training 98.667 100 59.829 HFLNN [10] - test 97.33 91.011 54.701 HS-FLANN [27] - training 97.857 97.597 HS-FLANN [27] - test 99.472 95.570 Method I (CBS, m = 100) - traning 98.667 98.667 87.5 Method I (CBS, m = 100) - test 81.33 78.65 52.38 Method I (CBS, m = 150 − 400) - traning 100 100 90.47 Method I (CBS, m = 400)- test 97.33 93.25 57.52 Method I (QBS, m = 35) - traning 97.33 97.75 91.37 Method I (QBS, m = 35) - test 91.33 61.24 39.58 Method I (QBS, m = 100) - traning 98.667 99.44 91.37 Method I (QBS, m = 100) - test 94 83.15 41.37 Method II (QBS-BP,m = 35) - traning 100 100 98.667 Method II (QBS-BP, m = 35) - test 98.667 97.33 78.65 Table 1: Classification accuracy (%) by the proposed classifier against [10, 27]

5.2. Numerical results and discussion In order to compare our results with those in the reference papers [10] and [1], a 2-fold cross validation (2CV) and a 3-fold cross validation (3CV), as in [10] and [1] respectively, were used. Results for 2CV on the first three datasets listed in the previous section are showed in Table 1. These results are compared against the ones in [10] and the ones in [27], even though the authors used 5CV. Using higher–order B-splines in Method I allows to get better results with a lower m, but not the best ones. Besides, the convergence seems to be slow. The best results are achieved by means of Method II. With regard to 3CV on the same datasets, in [1] the authors reported an accuracy equal to 100% for the three datasets (even though for all the results in [1], it has not been clarified if the accuracy is referred to the training or testing set). By adopting QBS-BP, we found the same value (for training and test) by using m = 35 for the IRIS and the WINE datasets, and 97.57% with m = 60 for the ECOLI dataset. Regarding the medium-sized datasets, we performed 3CV for the sake of comparison with the results in [1]. The related results are covered in Table 2. For the LETTER dataset, we got very good results by means of Method I. For the remaining datasets more satisfactory results were obtained by using Method II. 17

SPAM Approach in [1] 98.70 m 250 Method I (CBS) - traning 88.26 Method I (CBS)- test 47.33 m 150 Method I (QBS) - traning 93.33 Method I (QBS) - test 59.33 Method II (QBS-BP) - traning 100 Method II (QBS-BP) - test 98.27

WAVE HILL MADEL 96.54 96.39 92.88 250 400 600 86.54 80.46 87.33 70.13 39.52 49.61 150 350 550 90.23 94.55 97.91 79.58 44.55 53.13 98.67 97.33 98.18 96.33 89.11 92.4

LETTER 93.50 200 92.14 89.21 100 96.04 95.93 100 98.05

Table 2: Classification accuracy (%) by the proposed classifier against [1]

Dataset ECOLI IRIS LETTER WAVEFORM WINE

RFL [16] 88.99 95.95 96.36 84.76 98.86

obRFL [16] 86.9 97.3 96.28 86.06 99.43

RVFL [16] 87.8 98.65 79.6 86.8 100

Method II (3CV) 97.57 100 98.05 96.33 100

Table 3: Classification accuracy (%) by Method II against [16]

For the sake of completeness, we compare the test results by Method II with the ones in [16], even though the authors used 4CV (Table 3). The authors in [16] listed the training time for some datasets. Among them, the training time for the ECOLI dataset by obRFL is 50.52 s. Our approach (Method II) seems to be slower, because it takes almost four times that one (even though in [16] the processor details are missing). Let Na denote the number of attributes, Nc the number of classes and Ni the number of iterations in Method II. Figure 3 shows how Ni varies by changing the ratios Na /Np , Nc /Np , following the completed experiments. By looking at the plot, it seems that the lower numbers of iterations occur when the ratios Na /Np , Nc /Np are different no more than one order. Let τ denote the ratio between the testing accuracy and the traning accuracy and Nw denote the number of unknown parameters in the network. Figure 4 shows how τ varies with Nw /Np and Na /Np , in our experiments. By observing the plot, it seems that the best accuracy ratios can be achieved when Na /Np is at least one order lower than Nw /Np . 18

Figure 3: Ni vs (Na /Np , Nc /Np )

Regarding the DLA dataset, the first classification results were presented in [39], where the overall average recognition performance was of 99.4 % , using a 10-fold cross validation (10CV) testing mode. The most recent article discussing numerical experiments with this dataset is [38], but a slightly former one [40] is preferable for comparison purposes, because running times are indicated. The average performance in [38] is 99 % using a 2CV and 99.92% in [40] using a 10CV. As mentioned before, we refer to [40], by using 10CV. In [40], the authors used active learning through density clustering (ALEC) and the numerical experiments were performed by means of a CPU clocking in at 2.83 GHz. The ALEC training time in Table 5.2 has been deduced from a graph in [40] in a stiffer way for our results. Regarding the SKIN dataset, the first classification results appeared in [5], where the authors used a fuzzy decision tree model, by getting an average accuracy equal to 94.1%. The most recent paper using the SKIN dataset and discussing also the training time is [35], where the authors developed a

19

Figure 4: τ vs (Nw /Np , Na /Np )

method to pre-select support vector candidates (SVC) for training support vector machines. As in [5], in [35], a 10CV has been adopted; the experiments have been performed by means of an Intel core i7-7700HQ (2.80-3.80 GHz). We refer to this method for a comparison (Table 5.2). The accuracy with regard to the DLA dataset the accuracy by Method II and the reference approach is comparable, while the training time seems to be slightly higher. For the SKIN dataset, the accuracy is slightly lesser and the training time slightly higher than that by the reference method. 6. Conclusions In this paper, we revised FLANNs and FNs (two computational schemes using function bases) from a granular perspective. We defined a new architecture, by introducing a granular layer. We deduced a delta rule based learning algorithm in opposition to a least squares one, leading to an iterative computing scheme in the first case. We formally proved the convergence, 20

Dataset DLA DLA SKIN SKIN

Method Accuracy (%) Time (s) ALEC [40] 99.92 4 × 103 Method II (QBS-BP, m = 45) 99.3 4.5 × 103 SVC [35] 92.56 8.9 × 103 Method II (QBS-BP, m = 100) 89.7 9.3 × 103 Table 4: Results for some high-dimensional data

under some conditions. The aim here is not to propose a new more accurate classifier, since nowadays, there is a copious literature on that, but to adapt an existing scheme in the granular settings to address the interpretability issues. We performed a number of numerical experiments on benchmark cases, by finding satisfactory results especially for the iterative scheme. As future work, we will investigate strategies to improve the proposed scheme, getting a faster training, especially for high-dimensional data. In this case, feature reduction techniques may be useful. In particular, we would like to study a combined scheme, integrating reduction techniques, which preserve the similarity between the original and the reduced domain.

References [1] Abpeykar, S., Ghatee, M., Zare, H. (2019) Ensemble decision forest of RBF networks via hybrid feature clustering approach for highdimensional data classification. Computational Statistics and Data Analysis, 131, 12–36. [2] Anastassiou, G.A. (1995) Comparison theorems on moduli of continuity. Computers and Mathematics with Applications, 30(3-6), 15–21 [3] Azmi, M., Runger, G.C., Berrado, A. (2019) Interpretable regularized class association rules algorithm for classification in a categorical data space, Information Sciences, 483, 313–331 [4] Bargiela, A., Pedrycz, W. (2003) Granular Computing: An Introduction, Springer, New York. [5] Bhatt, R.B., Sharma, G., Dhall, A., Chaudhury, S. (2009) Efficient skin region segmenta- tion using low complexity fuzzy decision tree model, in: Proceedings of the 2009 Annual IEEE India Conference, pp. 1-4. 21

[6] Bede, B., Rudas, I.J. (2011) Approximation properties of fuzzy transforms. Fuzzy Sets and Systems, 180, 20–40. [7] Castillo, E. (1998) Functional networks. Neural Processing Letters, 7, 151–159. [8] Cpalka, K. (2017) Design of interpretable fuzzy systems, Studies in Computational Intelligence, 684, Springer, Cham, Switzerland. [9] Dash, C.S.K., Behera, A.K., Nayak, S.C., Dehuri, S., Cho, S.-B. (2019) An Integrated CRO and FLANN Based Classifier for a Non-Imputed and Inconsistent Dataset. International Journal on Artificial Intelligence Tools, 28(3), 1950013. [10] Dehuri, S., Cho, S.-B. (2010) Evolutionarily optimized features in functional link neural network for classification. Expert Systems with Applications, 37, 4379–4391. [11] de la Torre, J., Valls, A., Puig, D. (2019) A deep learning interpretable classifier for diabetic retinopathy disease grading, Neurocomputing, in press [12] Dick, S., Kandel, A. (2001) Granular computing in neural networks. In W. Pedrycz (Ed.), Granular computing: an emerging paradigm (pp. 275–305) Heidelberg: Physica-Verlag. [13] Ganivada, A., Raya, S.S., Pal, S.K. (2013) Fuzzy rough sets, and a granular neural network for unsupervised feature selection. Neural Networks, 48, 91-108. [14] Itani, S., Lecron, F., Fortemps, P. (2019) Specifics of Medical Data Mining for Diagnosis Aid: A Survey. Expert Systems and Application, 118, 300–314. [15] Jaeger, H. (2002) Adaptive nonlinear system identification with echo state networks, in P roceedings of the 15th International Conference on Neural Information Processing Systems - NIPS02 (pp. 593–600). [16] Katuwal, R., Suganthan, P.N., Zhang, L. (2018) An ensemble of decision trees with random vector functional link networks for multi-class classification, Applied Soft Computing 70, 1146-1153. 22

[17] Leite, D., Costa, P., Gomide, F. (2013) Evolving granular neural networks from fuzzy data streams, Neural Networks, 38, 1–16. [18] Le Nguyen, T., Gsponer, S., Ilie, I., OReilly, M., Ifrim, G. (2019) Interpretable time series classification using linear models and multi-resolution multi-domain symbolic representations, Data Mining and Knowledge Discovery 33(4), 1183–1222. [19] Lipton, Z. C. (2018) The mythos of model interpretability. ACM Queue 16(3), 1–27. [20] Loia, V., Tomasiello, S. (2017) Granularity into functional networks, in P roceedings of the Third IEEE International Conference on Cybernetics - CYBCONF 2017, (pp. 1–6). [21] Loia, V., Parente, D., Pedrycz, W., Tomasiello, S. (2018) A Granular Functional Network with delay: Some dynamical properties and application to the sign prediction in social networks. Neurocomputing, 321, 61–1. [22] Lyche, T., Schumaker, L. L. (1975) Local spline approximation methods. Journal of Approximation Theory, 15(4), 294–325. [23] Lu, W., Pedrycz, W., Liu, X., Yang, J., Li, P. (2014) The modeling of time series based on fuzzy information granules. Expert Systems and Applications, 41, 3799–3808. [24] Mencar, C., Castellano, G., Fanelli, A. M. (2007) On the role of interpretability in fuzzy data mining. International Journal of Uncertainty, Fuzziness and Knowledge-based Systems, 15(05), 521–537. [25] Merikoski, J.K., Urpala, U., Virtanen, A. (1997) A best upper bound for the 2-norm condition number of a matrix, Linear algebra and its applications, 254, 355–365. [26] Mishra, A., Dehuri, S. (2019) Real-time online fingerprint image classification using adaptive hybrid techniques. International Journal of Electrical and Computer Engineering, 9(5), 4372–4381. [27] Naik, B., Nayak, J., Behera, H.S., Abraham, A., (2016) A self adaptive harmony search based functional link higher order ANN for non-linear data classification. Neurocomputing, 179, 69-87. 23

[28] Pao, Y.-H., Beer, R. D. (1988) Functional link net: A unifying network architecture incorporating higher order effects, Neural Networks, 1(1 SUPPL), 40. [29] Pao, Y.-H., Phillips, S. M., Sobajic, D. J. (1992) Neural-net computing and intelligent control systems. International Journal of Control, 56(2), 263-289. [30] Pao, Y.-H., Takefuji, Y. (1992) Functional-Link Net Computing: Theory, System Architecture, and Functionalities. Computer, 25(5), 76–79 [31] Pao, Y.-H., Park, G.-H., Sobajic, D.J. (1994) Learning and generalization characteristics of the random vector functional-link net. Neurocomputing, 6(2), 163–180. [32] Patra, J.C., Pal, R.N., Chatterji, B.N., Panda, G. (1999) Identification of Nonlinear Dynamic Systems Using Functional Link Artificial Neural Networks. IEEE Transactions on Systems, Man, and Cybernetics B, 29(2), 254–262. [33] Pedrycz, W., Vukovich, W. (2001) Abstraction and Specialization of Information Granules, IEEE Transactions on Systems, Man, and Cybernetics B, 31(1), 106–111. [34] Pedrycz, W., Vukovich, W. (2001) Granular neural networks. Neurocomputing, 36, 205–224. [35] Reeberg de Mello, A., Stemmer, M.R., Oliveira Barbosa, F.G. (2018) Support vector candidates selection via Delaunay graph and convex-hull for large and high-dimensional datasets, Pattern Recognition Letters, 116, 43-49. [36] Riid, A., Rstern, E. (2014) Adaptability, interpretability and rule weights in fuzzy rule-based systems, Information Sciences 257, 301-312. network, Neurocomputing, 230, 374-381. [37] Tveit, A. (2003) On the complexity of matrix inversion, Mathematical Note, Norwegian University of Science and Technology, Trondheim (pp. 1).

24

[38] Uddin, M.Z., Hassan, M.M., Alsanad, H., Savaglio, C. (2020) A body sensor data fusion and deep recurrent neural network-based behavior recognition approach for robust healthcare, Information Fusion, 55, 105115. [39] Ugulino, W., Cardador, D., Vega, K., Velloso, E., Milidiu, R., Fuks, H. (2012) Wearable Computing: Accelerometers’ Data, Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science, pp. 52– 61. [40] Wang, M., Min, F., Zhang, Z.-H., Wu, Y.-X. (2017) Active learning through density clustering, Expert Systems With Applications, 85, 305317. [41] Weng, W.-D., Yang, C.-S., Lin, R.-C. (2007) A channel equalizer using reduced decision feedback Chebyshev functional link artificial neural networks, Information Sciences, 177, 2642-2654. [42] Zhang, L., Suganthan, P.N. (2016) A comprehensive evaluation of random vector functional link networks, Information Sciences 367-368, 10941105. [43] Zhang, Y. Q., Jin, B., Tang, Y. (2008) Granular neural networks with evolutionary interval learning. IEEE Transactions on Fuzzy Systems, 16(2), 309-319. [44] Zhao, H., Zeng, X., Hea, Z., Yud, S., Chen, B. (2016) Improved functional link artificial neural network via convex combination for nonlinear active noise control. Applied Soft Computing, 42, 351-359. [45] Zhou,G., Zhou,Y.-Q., Huang, H., Tang, Z., (2019) Functional Networks and Applications: A Survey. Neurocomputing, 335, 384–399. [46] Rudin, W. Principles of Mathematical Analysis. New York: McGrawHill, 1976.

25

Declaration of interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. CRediT author statement F. Colace: Conceptualization, Methodology, Software, Validation, Investigation V. Loia: Conceptualization, Methodology W. Pedrycz: Conceptualization, Methodology S. Tomasiello: Conceptualization, Methodology, Formal analysis, WritingOriginal Draft, Writing-Review

26

Francesco Colace is Associate Professor in Computer Science in the Department of Industrial Engineering, University of Salerno, Salerno, Italy. His main research directions involve Knowledge Management, Recommender System, Context Aware and Computing, Affective Computing and Sentiment Analysis, e-Learning, ICT for Cultural Heritage. He has published numerous papers in these areas. He is also an author of research monographs and edited volumes covering various aspects of ICT and Cultural Heritage, Pervasive Computing and Sentiment Analysis. Professor Colace is managing, as principal investigator, various research projects founded by national and international institutions. He serves as reviewer for many international journal, as IEEE Transactions on Knowledge and Data Engineering, KnowledgeBased Systems, IEEE Transaction on Education and is a member of various editorial boards of international journals.

Vincenzo Loia received B.S. degree in computer science from University of Salerno , Italy in 1985 and the M.S. and Ph.D. degrees in computer science from University of Paris VI, France, in 1987 and 1989, respectively. From 1989 he is Faculty member at the University of Salerno where he teaches Safe Systems, Situational Awareness, IT Project & Service Management. His current position is as President of University of Salerno, Italy. He is the founder and editor-in-chief of Ambient Intelligence and Humanized Computing Springer, and editor-in-chief of Journal of Evolutionary Intelligence, Springer. He is an Associate Editor of various journals IEEE Transactions Journals. His research interests include soft computing, agent technology for technologically complex environments Web intelligence, Situational Awareness, Cognitive Dedense, Artificial Intelligence. He hold in the last years several role in IEEE Society in particular for Computational Intelligence Society (Chair of Emergent Technologies Technical Committee, IEEE CIS European Representative, Vice-Chair of Intelligent Systems Applications Tech-

27

nical Committee).

Witold Pedrycz (IEEE Fellow, 1998) is Professor and Canada Research Chair (CRC) in Computational Intelligence in the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada. He is also with the Systems Research Institute of the Polish Academy of Sciences, Warsaw, Poland. In 2009 Dr. Pedrycz was elected a foreign member of the Polish Academy of Sciences. In 2012 he was elected a Fellow of the Royal Society of Canada. In 2007 he received a prestigious Norbert Wiener award from the IEEE Systems, Man, and Cybernetics Society. He is a recipient of the IEEE Canada Computer Engineering Medal, a Cajastur Prize for Soft Computing from the European Centre for Soft 28

Computing, a Killam Prize, a Fuzzy Pioneer Award from the IEEE Computational Intelligence Society, and 2019 Meritorious Service Award from the IEEE Systems Man and Cybernetics Society. His main research directions involve Computational Intelligence, fuzzy modeling and Granular Computing, knowledge discovery and data science, pattern recognition, data science, knowledge-based neural networks, and control engineering. He has published numerous papers in these areas; the current h-index is 111 (Google Scholar) and 82 on the list top-h scientists for computer science and electronics http://www.guide2research.com/scientists/. He is also an author of 21 research monographs and edited volumes covering various aspects of Computational Intelligence, data mining, and Software Engineering. Dr. Pedrycz is vigorously involved in editorial activities. He is an Editor-inChief of Information Sciences, Editor-in-Chief of WIREs Data Mining and Knowledge Discovery (Wiley), and Co-editor-in-Chief of Int. J. of Granular Computing (Springer) and J. of Data Information and Management (Springer). He serves on an Advisory Board of IEEE Transactions on Fuzzy Systems and is a member of a number of editorial boards of international journals.

Stefania Tomasiello, Ph.D. in computer science (University of Salerno, Italy) is currently a lecturer of Artificial Intelligence with the Institute of Computer Science, University of Tartu, Estonia, where she is responsible for the course of Fuzzy Logic and Soft Computing. Formerly, permanent researcher with CO.RI.SA. (Research Consortium on Agent Systems), University of Salerno, Italy and Senior Research Fellow with the Dept. of Management and Innovation Systems (DISA-MIS), University of Salerno. She holds the Italian Scientific Qualification to function as associate professor 29

in computer science and numerical analysis. Workpackage leader in several funded projects and expert evaluator (ex-ante and ex-post) of applied research projects for the Italian Ministry of Economic Development and, formerly, the Italian Ministry of University and Research. She has been adjunct professor of Fundamentals of Computer Science, Human-Computer Interaction, Computational Methods and Finite Element Analysis in the University of Basilicata, Italy. TPC member in many international conferences, here included ACM and IEEE sponsored events. Her research interests lie in scientific and soft computing, AI, fuzzy mathematics, nonlinear dynamics. She authored and co-authored numerous papers in the above mentioned areas. She is managing editor of Evolutionary Intelligence (Springer), associate editor of International Journal of Computer Mathematics (Taylor&Francis), and editorial board member of some journals. ACM, ECMI, EUSFLAT, IEEE and INNS member.

30