Gaussian Mixture Descriptors Learner

Gaussian Mixture Descriptors Learner

Knowledge-Based Systems xxx (xxxx) xxx Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/k...

539KB Sizes 0 Downloads 67 Views

Knowledge-Based Systems xxx (xxxx) xxx

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Gaussian Mixture Descriptors Learner✩ Breno L. Freitas a,b , Renato M. Silva b , Tiago A. Almeida b , a b



Shopify Inc., Ottawa, Ontario, Canada Department of Computer Science, Federal University of São Carlos (UFSCar), Sorocaba, São Paulo, Brazil

article

info

Article history: Received 29 March 2019 Received in revised form 31 July 2019 Accepted 12 September 2019 Available online xxxx Keywords: Minimum description length Classification Machine learning

a b s t r a c t In recent decades, various machine learning methods have been proposed to address classification problems. However, most of them do not support incremental (or online) learning and therefore are neither scalable nor robust to dynamic problems that change over time. In this study, a classification method was introduced based on the minimum description length principle, which offered a very good trade-off between model complexity and predictive power. The proposed method is lightweight, multiclass, and online. Moreover, despite its probabilistic nature, it can handle continuous features. Experiments conducted on real-world datasets with different characteristics demonstrated that the proposed method outperforms established online classification methods and is robust to overfitting, which is a desired characteristic for large, dynamic, and real-world classification problems. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Machine learning methods search from among a set of potential models one that fits well to the training data, and then use it to predict unseen data. It is desirable to obtain models that have a high generalization capability, that is, suitable and consistent models capable of generating adequate and consistent outputs both for training and unseen data. As there are a large number of candidate models, each machine learning method uses a specific selection strategy (e.g., probability, optimization, distance) for selecting the best possible model. A simple criterion is to choose the model that offers the most compact description of the observed data. This selection strategy is based on the principle of parsimony, which is also known as the Occam’s razor principle. This was devised by William of Occam, an English Franciscan friar and scholastic philosopher, in the Middle Ages, as a criticism of scholastic philosophy for its complex theories. Recently, Occam’s razor has been used in scientific methodology, stating that if there are two or more hypotheses that can explain a phenomenon, the simplest is probably the best [1]. Rissanen [2] proposed the minimum description length (MDL) principle in the context of information theory: a model selection ✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys. 2019.105039. ∗ Corresponding author. E-mail addresses: [email protected] (B.L. Freitas), [email protected] (R.M. Silva), [email protected] (T.A. Almeida).

strategy based on Kolmogorov complexity [3] and which is considered a formalization of the Occam’s razor. The MDL principle prioritizes models that fit the data well and simultaneously have low complexity. This characteristic naturally avoids overfitting and therefore is desirable for classification methods. Over the last few years there has been an increasing interest in classification methods that are scalable and dynamic. Several traditional methods fail to meet at least one of these requirements. They usually offer offline learning only, requiring that all training data be presented at once. Therefore, they cannot be used in problems involving continuous data flow, where the prediction model should be updated when new samples are presented for training. Currently, in the big data era, where a large number of problems contain a massive amount of data, incremental (or online) learning methods are desirable owing to their scalability [4]. MDL-based online learning methods have been proposed in some studies, such as Bratko et al. [5], Braga and Ladeira [6], Almeida and Yamakami [7], and Silva et al. [8]. However, all these methods were specifically designed for text classification. Later, Silva et al. [9] proposed a more generic MDL-based classifier, but it can only be trained online when the features are categorical since it needs to discretize the data. To circumvent these limitations, in this study, we propose the Gaussian Mixture Descriptors Learner (GMDL): a multinomial and online classification method that incorporates the theoretical advantages offered by the MDL principle. The main novelty in relation to existing MDL-based online learning approaches is that GMDL can be used in any binary or multiclass classification problem whose data can be represented by categorical or numerical features (discrete or continuous). Moreover, GMDL has the inherent ability to prevent overfitting since it is based on MDL principle. To support our

https://doi.org/10.1016/j.knosys.2019.105039 0950-7051/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: B.L. Freitas, R.M. Silva and T.A. Almeida, Gaussian Mixture Descriptors Learner, Knowledge-Based Systems (2019) 105039, https://doi.org/10.1016/j.knosys.2019.105039.

2

B.L. Freitas, R.M. Silva and T.A. Almeida / Knowledge-Based Systems xxx (xxxx) xxx

claims, we performed a comprehensive performance evaluation with 16 public and well-known datasets and we compared it with benchmark online learning classification methods. The remainder of this paper is organized as follows. We briefly summarize related work in Section 2. In Section 3, we present the basic concepts related to the MDL principle. In Section 4, we explain the proposed classification method. The experimental settings are detailed in Section 5. Section 6 presents all the results and the related analysis. Finally, we give our main conclusions and suggestions for future research in Section 7. 2. Brief related work In some real-world problems, data arrives continuously. Other problems contain a massive amount of data, which prevents the use of offline learning methods as all the examples should be processed at the same time. Online learning methods are more appropriate for these types of problems because they can incrementally update their predictive model [4]. Therefore, they are suitable for large-scale problems, they are efficient in handling dynamic changes in data distribution, and in general, they require less training time and smaller memory than offline learning methods [4,10,11]. Online learning has been extensively studied in machine learning [12–18]. Among the existing online learning methods, one of the most widely used is the multilayer perceptron algorithm [19]. There are also online methods based on gradient descent, such as stochastic gradient descent (SGD) [20]. Passive–aggressive (PA) is another established online learning method [11]. MLP, SGD, and PA have been used as a baseline in several studies [4,9,21,22]. The MDL principle is a versatile and powerful model selection method. The consolidation of the Occam’s Razor into a mathematical model is attractive in several contexts. Since its introduction in Rissanen [2], the MDL principle has been used in several studies [23–27]. However, to the best of our knowledge, there are few online classification methods that are based on the MDL principle and, in general, they were specifically designed for text classification. For example, Bratko et al. [5], Braga and Ladeira [6] and Almeida and Yamakami [7] proposed spam classification methods based on the MDL principle. Bratko et al. [5] proposed two spam classification methods: one is based on dynamic Markov compression and the MDL principle, whereas the other is based on the combination of minimal cross-entropy with the MDL principle. Braga and Ladeira [6] also proposed a spam classification method, but their strategy was based on the MDL principle and the adaptive Huffman coding. Almeida and Yamakami [7] proposed a spam classification method based on the combination of a feature selection technique called confidence factors (CF) and the MDL principle. CF was used to obtain the relevance of the terms in each class. Thus, the most relevant terms contribute more to the calculation of the description length, which is used by the MDL principle as a criterion for model selection. According to the authors, their method obtained better results than benchmark methods for spam classification. Silva et al. [9] extended the method proposed by Almeida and Yamakami [7] so that it can be used in other binary and multiclass text classification problems. Later, Silva et al. [8] extended the method proposed by Silva et al. [9] so that it can be applied to other classification problems. However, that method is naturally adapted only to problems with categorical features. Thus, continuous features are discretized and, therefore, incremental training is not possible. As is the case with the methods described above, our proposed method is based on the MDL principle; however, it can be employed in any classification problem with categorical and continuous features.

3. MDL principle The MDL principle was originally proposed in Rissanen [2,28] based on the idea that, in a problem of model selection, the model with the smallest description length should be selected. The MDL principle originates in information theory and states that the data compression capacity of a model is directly proportional to its knowledge about the data [29]. The MDL also draws ideas from the Kolmogorov complexity [3] which is defined as the length of the shortest program that prints a given sequence, and then halts. The smaller the Kolmogorov complexity for a given sequence, the greater the regularity found by the corresponding program and therefore the greater the knowledge about the sequence. Thus, the smallest program that can reproduce a given sequence should be selected to represent it [30]. As the MDL principle is based on both Occam’s razor and Kolmogorov complexity, MDL-based classifications methods can obtain a desirable trade-off between model complexity and overfitting, thus providing a model that generalizes well and is not overly complex. 3.1. Terminology and definitions Let X be a finite set of symbols. Let X n be the set of all finite sequences of n symbols. We denote by xn , a sequence of symbols (x1 , x2 , . . . , xn ), where xi ∈ X . Let X n be a set of sequences of n symbols; whenever it is clear by the context, the notation X will be used rather than X n . In this study we focus on binary sequences; therefore, bits are used as the information units and, for the sake of simplicity, the notation log is used for log2 . A probability source P is a sequence P (1) , P (2) , . . . in X 1 , X 2 , . . . such that P (n) is the marginal probability of P (n+1) ; whenever it is clear by the context, the notation P will be used rather than P (n) (x) such that x ∈ X n . Let P be a probability distribution defined in X . P(x) is called probability of x. An encoding is a function that maps one set of sequences into another, usually in a different domain, in a more compact form. An encoding C is said to be prefix-free if no code in C is a prefix of another. Thus, the decoding can be performed when the code is received. Every code produced by an encoding has corresponding length associated with it. We say LC is a function that describes the length of a symbol encoded by C . Kraft [31] described a relation between an encoding function and code lengths as follows. Theorem 3.1 (The Kraft Inequality). There is a prefix-free encoding ∑n with−l cardinality D and code lengths l1 , . . . , ln if and only if i ≤ 1. i=1 D In the MDL principle, we are more interested in looking to the length of a symbol encoding than to the code itself. Ultimately, we look for a codification C of a given dataset (or set of symbols) D, such that LC (D) is the shortest possible. 3.2. Information theory The MDL principle originates in information theory [2,28,32], which is closely related to probability theory. Information theory is concerned with the transfer and reception of messages and their efficient understanding. In probability theory, the expected value of a distribution is known to be achieved with successive trials over a distribution being averaged out; this property is commonly known as ‘‘the law of large numbers’’. Thus, this theorem allows for an incremental approximation of a probability distribution. It is still necessary to define the ideal minimum

Please cite this article as: B.L. Freitas, R.M. Silva and T.A. Almeida, Gaussian Mixture Descriptors Learner, Knowledge-Based Systems (2019) 105039, https://doi.org/10.1016/j.knosys.2019.105039.

B.L. Freitas, R.M. Silva and T.A. Almeida / Knowledge-Based Systems xxx (xxxx) xxx

length for L based on the information presented by Theorem 3.1. It is possible to show that the optimal length is bounded below by − log P(x) for a given sequence of symbols x [29, Section 2.2.2]. It should be noted∑ that a direct consequence of this property ∑ ˆ is that Lˆ = E [ˆl(X )] = x∈X P(x) · l(x) ≥ − x∈X P(x) · log P(x); i.e, from an Information Theory perspective, the expected length of a prefix-free encoding is bounded by the entropy of the source. It is evident that if a distribution generates a given encoding, then its entropy is smaller than that of any other distribution that attempts to describe the same encoding. This property is known as information inequality and can be defined as follows. Proposition 3.1 (Information Inequality). If P and Q are probability distributions such that P ̸ = Q , then EP [P(Y )] < EP [Q (Y )] for a random variable Y . With all these pieces it is then possible to build a method for selecting a model that is simple yet a good descriptor of a set: Theorem 3.1 states that it is possible to establish the existence of a prefix-free encoding C with an associated probability mass distribution P; the law of large numbers shows that LC approaches its optimum value when the number of samples increases; Proposition 3.1 implies that if P is in fact the distribution that should be approximated, then LC (z) = ⌈− log P(z)⌉ for all z in a set of symbol sequences Z defined by P. The MDL principle was originally proposed as a two-part method (also called crude MDL) defined by Eq. (1), which is based on the combination of the Kraft inequality, the law of large numbers, and information inequality. In this version, given a set of probability sources (candidate models) M, the model M that minimizes Eq. (1) should be chosen. MMDL := arg min[L(M) + L(D|M)]

(1)

M ∈M

The MDL is present in several studies, particularly in those related directly to information theory. In some studies, the MDL principle is interpreted from a Bayesian perspective [33, Chapter 28.3] or is compared to the maximum a posteriori estimation method [34, Chapter 20]. A large part of the MDL literature is focused on two-part MDL by defining a model complexity function. This comes from the fact that universal encodings are computationally prohibitive, and thus, it is more feasible to obtain the model itself than its encoding [35]. According to Van Leeuwen and Siebes [35], there are barriers to implementing MDL in real-world problems. Therefore, in this study, we focus on the two-part version, as in other studies in the literature.

4.1. Online kernel density estimation In real applications, it is desirable to have a probability distribution estimation technique that (i) does not require discretization of the data, as in the method proposed in Silva et al. [8], and (ii) is incremental, so that it can learn from new samples in a continuous data stream. Thus, Eq. (1) can be rewritten to be used in a classification method based on probability distributions where each model induces a distribution for each class of the problem. A Gaussian mixture is defined as the weighted sum of G Gaussian components. Each of these components has its own mean and covariance matrix. Formally, a Gaussian mixture p(X ) for a random variable X is defined as follows: p(X ) :=

G ∑

wi · N (µi , Σi ),

(2)

i=1

∑G

where w is a weight vector for each component such that i=1 wi = 1 with wi > 0, and N (µi , Σi ) denotes a normal distribution with mean µi and variance Σi . Assigning different weights to each component (wi ) allows greater prediction flexibility because the mean and covariance pair from a given component may be more descriptive than that of other components and therefore has a greater influence on the computation of the density value for the random variable. A classic and parametric method of estimating a probability density function for a set of instances is to assume that the distribution is normally distributed [36]. However, such techniques require prior definition of the parameters of the mixture [36,37]. Estimating such parameters is not a trivial task. For example, using the wrong number of components may cause the method not to properly represent the real function [38]. In Silverman [39], a non-parametric method called kernel density estimate (KDE) was proposed for estimating the probability density function of random variables. This method tries to approximate a probability distribution given a set of samples drawn from an unknown distribution with the help of a function Φ called kernel. A parameter b, called bandwidth, is also used to smooth out the approximate distribution. Both the kernel and the bandwidth have great influence on the estimation of the function [39]. There are several kernel functions in the literature, but the Gaussian kernel (Eq. (3)) is the most widely used, owing to its mathematical simplicity.

4. Gaussian mixture descriptors learner (GMDL) gauss

Eq. (1) can be interpreted as the search for a model M that provides the most compact description of D, where M induces a probability distribution P for a given data encoding. As mentioned in Section 3.2, the shortest codification possible for a given code induced by a probability distribution P is bounded by ⌈− log P ⌉. From a Bayesian perspective, by knowing in advance the actual underlying probability distribution describing a dataset, it would be possible to perfectly predict an unseen sample. However, in real-world problems, the actual probability distribution is not known, and therefore it is necessary to compute an estimate. Based on the aforementioned concepts, we propose GMDL as a new classification method based on the MDL principle. The following sections present the building blocks of the method in a bottom-up fashion: Section 4.1 explores a means to incrementally approximate a probability distribution; Section 4.2 defines the mathematical bases for the proposed method connecting Information Theory concepts from Section 3.2 to the incremental probability distribution estimation; and finally, Section 4.3 presents the pseudo-code for the proposed method.

3

ΦΣ

)T 1 − (x−µ)·(x−−µ 2Σ 1 (x, µ) := √ ·e 2π · | Σ |2

(3)

KDE has already been used to extend Bayesian inference methods [40,41] and obtained a good trade-off between classification accuracy and learning curves. Furthermore, this method has one of the desired characteristics regarding the distribution, namely, it does not require the data to be discretized. Thus, the probability distributions can be estimated without suffering from the information loss caused by data discretization and we are then able to extend Eq. (1) into a classification method. A major difficulty in transforming KDE into an incremental method is that the number of components used for the distribution estimate grows linearly with each new instance [38] because a sufficient amount of information is required for generalizing unseen instances without revisiting previous training instances [42]. Recently, Kristan et al. [38] proposed a method called oKDE to incrementally calculate the kernel density estimate. Therefore, it is ideal for contemplating our second desired characteristic for the probability distribution function estimation.

Please cite this article as: B.L. Freitas, R.M. Silva and T.A. Almeida, Gaussian Mixture Descriptors Learner, Knowledge-Based Systems (2019) 105039, https://doi.org/10.1016/j.knosys.2019.105039.

4

B.L. Freitas, R.M. Silva and T.A. Almeida / Knowledge-Based Systems xxx (xxxx) xxx

In oKDE estimation, a parameter f , called forgetting factor, is also used to assign a weight to old samples, causing them to have less impact on a data stream (this is ideal for scenarios where the data stream is temporal). A forgetting factor equals to 1 implies that the weighting is the same among the temporal samples. Ferreira et al. [42] proposed a version of oKDE called xokde++ to resolve numerical stability issues and obtain a more robust computational approach in terms of memory usage and processing. This approach uses a normalized distribution, that is, all estimates have a density bounded by one (a most useful property, as it can yield a probability estimate). For simplicity, the acronym oKDE will be henceforth used for xokde++. 4.2. Incremental learning Proposition 3.1 implies that no distribution represents a dataset as well as its true distribution. Therefore, by using oKDE to obtain an approximation of this distribution, according to the law of large numbers, it is expected that as the number of samples increases, increasingly representative estimates of the true distribution will be obtained by assuming a normal distribution on the samples. This allow us to take a step further and define an incremental learning methodology that does not rely on the discretization of the samples. By using the MDL equation, we define the description length Lˆ of the instances as the sum of the description length of their features when encoded by the approximation of their density functions p′ for each class. This can be formalized by the following equation:

ˆ ⃗x|c) := L(

n ∑ ⌈− log p′(i,c) (xi )⌉,

(4)

i=1

where ⃗ x is a vector (x1 , . . . , xn ) of features and c ∈ K is the class. It ˆ ⃗x|c) ∈ N. Furthermore, although oKDE is important to note that L( was used as means of obtaining p′ , any online estimator for a probability distribution could have been used. We opted for oKDE because both oKDE and its non-incremental counterpart (KDE) have already been used on classification problems achieving good results [38,40,41]. One of the caveats of using oKDE is that the value of p′ can tend to infinity when there is only one sample being evaluated, and can be zero if there are insufficient instances to measure the density at a point. Therefore, Eq. (4) can be rewritten as

ˆ ⃗x|c) := L(

n ∑ ⌈− log p(i,c) (xi )⌉,

(5)

Eq. (7) can be formalized to provide better numerical stability and obtain a pseudo-probability when it is evaluated as 1 − GMDL(⃗ x). Therefore, the first version of GMDL can be defined as follows:



The optimization proposed in Eq. (8) is similar to the maximum likelihood estimation method [29]. However, this estimate is limited by logarithms and also by a normalization among the classes of the problem. Therefore, this main equation of GMDL may be considered similar to a normalized maximum likelihood estimate. 4.2.1. Alleviating degenerate distributions on KDE The method proposed in Kristan et al. [38] is susceptible to degenerate distributions, which can be briefly defined as lowvariance distributions. Formally, they can be defined as follows. Definition 4.1. Let X ∼ N (µ, Σ ) be a random variable following a distribution pX . If ∥diag(Σ )∥∞ ≈ 0, then pX is a degenerate distribution. The method in Ferreira et al. [42] was an attempt to mitigate the issues of numerical stability and degenerate distributions. In the proposed technique, a decomposition into eigenvalues and eigenvectors is obtained, and then the eigenvalues are analyzed to determine those that are smaller than 10−9 (fixed by 1% of the average of the eigenvalues). However, this technique is ineffective when there is only one dimension because it is highly dependent on the order in which the instances are presented. To mitigate this problem, Gaussian noise drawn from N (0, σ˜ 2 ) is added to the input in the probability density estimation, which keeps the distribution mean and only changes its standard deviation. This is a useful property, given that the central point, estimated as a Dirac-delta function by oKDE [38], has the highest density of the distribution. However, changing the standard deviation also changes the density of the data around the mean. As proposed by Ferreira et al. [42], noise is then only used when the variance of an attribute in a given class is less than the threshold of 10−9 . This process, given the meta-parameter σ˜ 2 , can be described as follows: p(i,c)

p(i,c) :=

2

−Ω

p′(i,c)





p(i,c) → ∞ ∨ p(i,c) = 0 o.w.

(6)

In Eq. (6), Ω is a meta-parameter that acts as a regularizer when there is insufficient information to compute a probability. It is also possible to change the forgetting factor f of oKDE as a meta-parameter of p′ , but this is omitted for notational simplicity. As p′(·,c) is a density function approximated by oKDE, it is a Gaussian mixture. Therefore, it is composed of a finite number of distributions; a larger number of Dirac-delta functions implies a larger number of components. This makes |p′(·,c) | = Gc (the number of components of the Gaussian mixture for the class c) a sensible choice, as a larger number of components implies larger description length for the model. Hence, there is a candidate for L(M) of a given model M. Thus, GMDL′ can be defined as follows:

ˆ ⃗x|c) + Gc ]. GMDL′ (⃗ x, K ) := arg min[L( c ∈K

(7)

(8)

GMDL′ (⃗ x, {ck })



{



⎢ ⎥ |K | ⎢GMDL′ (⃗x, {c })⎥ / ∑ ⎢ 2 ⎥ GMDL(⃗ x) := ⎢ GMDL′ (⃗ x, {ck }). ⎥ .. ⎢ ⎥ k=1 . ⎣ ⎦

i=1

where

GMDL′ (⃗ x, {c1 })

⎧ ⎨⃗xi := ⃗xi + ⎩



n, σic2 > 10−9

σic2 ≥ 10−9 o.w.

(9)

n∼N (0,σ˜ 2 )

The incremental calculation of the variance is performed by an approximation based on the accumulation of the mean. This method was proposed in Welford [43] and subsequently refined in Ling [44] and Chan et al. [45]. It is based on a continuous data stream from a random variable. 4.3. Overview of the proposed method Algorithms 1 and 2 summarize the training and classification stages of GMDL. The source code is publicly available via GitHub.1 The proposed method requires kernel density estimations only in the training stage (Algorithm 1). These estimates are computed 1 GMDL source code. Available at https://github.com/brenolf/gmdl, accessed on Jul 28, 2019.

Please cite this article as: B.L. Freitas, R.M. Silva and T.A. Almeida, Gaussian Mixture Descriptors Learner, Knowledge-Based Systems (2019) 105039, https://doi.org/10.1016/j.knosys.2019.105039.

B.L. Freitas, R.M. Silva and T.A. Almeida / Knowledge-Based Systems xxx (xxxx) xxx

incrementally by oKDE without discretizing the values of the features, which provides robustness and speed on training. At the same time, the approximation of the variance is computed to provide stability to the method by adding Gaussian noise whenever necessary, as explained before in Section 4.2.1. Algorithm 1 Training stage for GMDL. Require: Instance ⟨⃗ x, c ⟩, and parameters ⟨Ω , σ˜ 2 , f ⟩ 1: c˜ ← pc (⃗ x; f ) 2: for i ∈ {1, . . . , n} do 3: if p′(i,c) → ∞ ∨ p′(i,c) = 0 then

7: 8: 9: 10: 11: 12:

p(i,c) ← 2−Ω else while σic2 ≥ 10−9 do ⃗xi ← X1 , X ∼ N (0, σ˜ 2 ) end while p(i,c) ← p′(i,c) (⃗ xi ; f ) {Equation (6)} end if nic ← nic + 1 M2,nic ← M2,nic −1 + (⃗ xi − x¯ nic −1 ) · (⃗ xi − x¯ nic )

13:

x¯ nic ← x¯ nic −1 +

14:

σic2 ←

4: 5: 6:

M2,n nic

ic

⃗xi −¯xn −1 ic nic

{Iterative approximation of the variance}

15: end for 16: return p, G

Algorithm 2 Classification stage for GMDL. Require: Instance ⃗ x, and parameters ⟨p, G, K ⟩ 1: for c ∈ K do 2: for i ∈ {1, . . . , n} do ˆ ⃗x, c) ← L( ˆ ⃗x, c) + ⌈− log p(i,c) (⃗xi )⌉ {Equation (5)} 3: L( 4: end for ˆ ⃗x, c) ← L( ˆ ⃗x, c) + Gc {Equation (7)} 5: L( 6: end for 7: for c ∈ K do ˆ ⃗x,c) L( {Equation (8)} 8: L(⃗ x, c) ← ∑|K | k=1

9: end for 10: return cˆ

ˆ ⃗x|ck ) L(

← arg minc ∈K L(⃗x|c)

The prediction stage (Algorithm 2) of the method estimates the density for each feature of the instance by using the distributions obtained when training before selecting the class that best describes the instance. 4.3.1. Asymptotic complexity The training stage (Algorithm 1) of GMDL is based on the computation of density estimates for a single class for the n 2 features of the instances. The complexity of oKDE is O( G2+−GG ), where G is the number of components of the Gaussian mixture. Therefore, the training stage has quadratic complexity with respect to the number of components in the mixture. According to Kristan et al. [38], the number of components is low, on the order of O(101 ). Furthermore, the classification stage (Algorithm 2) has a pseudo-linear step for the estimation of the description length with complexity O(n · |K |).

5

publicly available on the UCI repository.2 Table 1 summarizes the main statistics for each dataset used in this study, where m is the number of instances, n is the number of features, and |K | is the number of classes. The last column presents the number of instances in each class. The datasets have distinct characteristics aimed at evaluating the robustness of GMDL from different aspects, such as the number and the type of features (categorical and continuous), the class balancing, the number of classes, and the number of instances. The datasets 1, 2, 4, 5, 7, 8, 9, 10, and 13 were used in Fernández-Delgado et al. [46]. Some of these datasets were also previously used in Kristan and Leonardis [47], namely, the datasets 3, 11, 14, 15, and 16. We also used the datasets 6 and 12 because of their large number of instances, which makes them good representatives of real-world scenarios where there is a large amount of data. In this study, we convert the categorical values to numeric values using one-hot encoding. Moreover, in all experiments, we applied the Z-score normalization using information from the training examples. To compare the results, we employed the well-known macroaverage F-measure [9]. Additionally, to ensure that the results were not obtained randomly, we performed a statistical analysis of the results using the nonparametric Friedman test, carefully following the methodology described in Zar [48, Section 12.7]. If the null hypothesis of the Friedman test was rejected, we performed a pairwise comparison using the Wilcoxon post-hoc test [49]. We compared the results obtained by GMDL with the following established online learning methods: multilayer perceptron (MLP) [19], passive–aggressive (PA) [11], and stochastic gradient descent (SGD) [20]. We implement SGD with the standard SVM loss (the hinge loss) and L2 regularization, which is similar to a linear SVM that can be updated online [20,50,51]. We used the implementations of all baseline methods from the scikit-learn Library,3 with their default parameters. For GMDL, as used in Srivastava et al. [52], we applied the following parameters: σ˜ 2 = 2 and f = 1. The forgetting factor f of the GMDL was set to 1 because the datasets do not contain temporal information such as trend and temporal noise. 5.1. Evaluation To simulate a real scenario of online learning, we consider that only a small number of samples are available to train the classifier (10%), as in Kristan and Leonardis [47]. Then, one message at a time is presented to the classifier, which makes its prediction. Subsequently, the classifier can receive feedback for incremental learning. To simulate real-world applications, we considered two scenarios where the delay in the feedback and the number of instances presented to the classifier can vary, as detailed below:

• Scenario 1 — immediate feedback: when an instance is misclassified, the classifier receives feedback and its predictive model is immediately updated with the true label. • Scenario 2 — limited feedback: when an instance is misclassified, there is a 50% chance of the classifier receiving feedback to update its predictive model. 6. Results

5. Experimental setup To evaluate GMDL, we consider the online learning scenario where the instances are presented one at a time to the classifier and feedback may or may not be submitted to the method, as in some real-world applications. To give credibility to the results and to make the experiments reproducible, all the tests were performed using 16 datasets

In the following, we describe the results obtained in the experiments based on the online learning scenarios. 2 UCI machine learning repository. Available at https://archive.ics.uci.edu/ml/ datasets.html, accessed on Jul 26, 2019. 3 The scikit-learn library is available at: http://scikit-learn.org/stable/ index.html accessed on Jul 26, 2019.

Please cite this article as: B.L. Freitas, R.M. Silva and T.A. Almeida, Gaussian Mixture Descriptors Learner, Knowledge-Based Systems (2019) 105039, https://doi.org/10.1016/j.knosys.2019.105039.

6

B.L. Freitas, R.M. Silva and T.A. Almeida / Knowledge-Based Systems xxx (xxxx) xxx Table 1 Datasets used in the experiments. # 1 2 3 4 5 6 7 8

9 10 11 12 13 14 15 16

Dataset

m

n

|K |

Class size

32,561 1,473 581,012 100 1,212 928,991 150 20,000

109 21 10 40 100 11 4 16

2 3 7 2 2 3 3 26

360 130,064 245,057 5,000,000 569 178 1,599 4,898

90 50 3 18 30 13 11 11

15 2 2 2 2 3 6 7

7841; 24,720 333; 511; 629 2747; 9493; 17,367; 20,510; 35,754; 211,840; 283,301 12; 88 600; 612 276,967; 305,444; 346,580 50 (each class) 734; 734; 736; 739; 747; 748; 752; 753; 755; 758; 761; 764; 766; 768; 773; 775; 783; 783; 786; 787; 789; 792; 796; 803; 805; 813 24 (each class) 36,499; 93,565 50,859; 194,198 2,287,827; 2,712,173 212; 357 48; 59; 71 10; 18; 53; 199; 638; 681 5; 20; 163; 175; 880; 1,457; 2,198

adult contrac covertype fertility hill-valley ht-sensor iris letter

libras miniboone skin susy wdbc wine wine-red wine-white

Table 2 F-measure obtained by each method in the immediate feedback scenario. GMDL adult contrac covertype fertility hill-valley ht-sensor iris letter libras miniboone skin susy wdbc wine wine-red wine-white

1.000 0.465 0.400 0.537 0.492 0.587 0.947 0.670 0.514 0.817 0.835 0.687 0.924 0.969 0.299 0.243

MLP 0.989 0.449 0.526 0.477 0.487 0.943 0.590 0.726 0.471 0.833 0.996 0.667 0.906 0.755 0.251 0.232

PA 1.000 0.393 0.294 0.626 0.639 0.463 0.795 0.446 0.373 0.756 0.811 0.677 0.938 0.947 0.238 0.199

SGD 0.999 0.410 0.377 0.650 0.528 0.482 0.843 0.569 0.291 0.765 0.802 0.703 0.944 0.934 0.251 0.200

6.1. Scenario 1 — immediate feedback Table 2 presents the macro F-measure obtained by each evaluated method for each dataset in the immediate feedback scenario. To facilitate comparison, the scores are presented as a grayscale heat map in which better scores correspond to darker cell colors. The bold values indicate the best scores. GMDL obtained the best macro F-measure for seven datasets and was the second best method for the other seven datasets. Moreover, GMDL presented greater robustness to the problem of imbalanced classes than PA and SGD, as it obtained better results than both methods in the experiments with the datasets covertype, miniboone, skin, wine-red, and wine-white. MLP also achieved good performance for some datasets, such as miniboone and letter. These datasets have a large number of instances, which allows better convergence in the optimization of the parameters of the neural network. This conclusion is reinforced by the results of MLP in some datasets with few instances (e.g, fertility, hill-valley, iris, wine, and wdbc), which ranked lower than those obtained by all other methods. GMDL exhibited greater robustness in problems with a small number of instances than MLP and the other evaluated methods. Note that in datasets with less than 2,000 samples, the macro F-measure obtained by GMDL was on average about 17%, 6%, and 4% higher than that obtained by the MLP, SGD, and PA, respectively. We performed a statistical analysis of the results using the non-parametric Friedman test based on the average ranking of each classification method shown in Fig. 1, where a low ranking implies better performance.

Fig. 1. Average ranking of each method in the immediate feedback scenario.

For a confidence interval α = 0.05, the Friedman test indicated that the null hypothesis, which states that all methods exhibited equivalent performance, was rejected. Therefore, we performed a pairwise comparison using the Wilcoxon post-hoc test, which for a confidence interval α = 0.05, indicated that the performance of GMDL was significantly better than that of SGD and PA. However, although GMDL obtained the best average ranking (Fig. 1), there is insufficient statistical evidence to infer that the performance obtained by GMDL was significantly better than that of MLP. 6.2. Scenario 2 — limited feedback Table 3 presents the results obtained by each evaluated method for each dataset in the limited feedback scenario. The bold values indicate the best macro F-measure. Moreover, the results are presented as a grayscale heat map in which better scores correspond to darker cell colors. GMDL obtained the best results in six datasets and was the second best method in five datasets. The only experiment in which MDL obtained the lowest macro F-measure was that involving the skin dataset. However, even in this experiment, the F-measure achieved by GMDL was equal to that obtained by SGD and close to the score obtained by PA. In four of the five datasets with most samples, MLP obtained the best macro F-measure. However, in four of the five datasets with fewer samples, MLP obtained a lower performance than the other methods. Therefore, we can conclude that MLP generally needs more samples than GMDL, PA, and SGD to obtain competitive predictive performance. GMDL, on the other hand, performed well in both small and large datasets.

Please cite this article as: B.L. Freitas, R.M. Silva and T.A. Almeida, Gaussian Mixture Descriptors Learner, Knowledge-Based Systems (2019) 105039, https://doi.org/10.1016/j.knosys.2019.105039.

B.L. Freitas, R.M. Silva and T.A. Almeida / Knowledge-Based Systems xxx (xxxx) xxx Table 3 F-measure obtained by each method in the limited feedback scenario. GMDL adult contrac covertype fertility hill-valley ht-sensor iris letter libras miniboone skin susy wdbc wine wine-red wine-white

1.000 0.453 0.413 0.492 0.497 0.563 0.939 0.670 0.392 0.770 0.800 0.690 0.924 0.932 0.306 0.235

MLP 0.989 0.383 0.521 0.477 0.493 0.943 0.566 0.710 0.392 0.831 0.995 0.672 0.906 0.755 0.252 0.208

PA 1.000 0.375 0.283 0.607 0.627 0.430 0.789 0.402 0.328 0.749 0.801 0.668 0.929 0.947 0.231 0.198

SGD 0.999 0.396 0.315 0.518 0.525 0.459 0.747 0.560 0.258 0.762 0.800 0.703 0.944 0.934 0.240 0.226

Fig. 2. Average ranking of each method in the limited feedback scenario.

Both GMDL and PA obtained the best possible score for the dataset adult. From a mathematical point of view, this demonstrates that GMDL and PA were able to perfectly estimate the distribution of the attributes even in this experimental scenario where the model update occurs only with 50% chance when the classifier makes a prediction error. In this scenario, we also performed a statistical analysis of the results using the non-parametric Friedman test based on the average ranking of each classification method (Fig. 2). For a confidence interval α = 0.05, the Friedman test rejected the null hypothesis that all the algorithms compared in this experimental scenario were equivalent. We also performed a pairwise comparison using the Wilcoxon post-hoc test, with a confidence interval α = 0.05. According to this analysis, it can be safely inferred that the performance of GMDL was significantly better than that of SGD and PA. However, as in the previous experimental scenario, there is insufficient statistical evidence to infer that the performance obtained by GMDL was significantly better than that of MLP, although its average ranking was the best. Given the good results obtained by GMDL in online learning scenarios, we also performed experiments in the offline scenario and presented the results in Appendix. Although the scope of this study is online learning, this appendix was added to provide an insight into the performance of GMDL in an offline scenario and show that it also obtain competitive results in this learning scenario.

7

to best describe the data from a known sampling to create the possibility of making predictions on unseen data. Determining which model best fits the data without overfitting is a problem of great interest in machine learning. In this study, we proposed a novel multinomial classification method based on the MDL principle. The proposed method, called GMDL, has desirable features, namely, incremental learning and sufficient robustness to prevent overfitting. These characteristics make GMDL a great candidate for real-world, online, and large-scale classification problems. For model selection based on the MDL principle, GMDL obtains the description of the models based on density estimation. These estimates are made using oKDE as an incremental estimator, which allows instances of any type to be estimated and makes the classification method incremental and capable of improving its performance over time. To assess the performance of the proposed method, we conducted a comprehensive evaluation using 16 large, real, public, and well-known datasets, in which we considered two online learning scenarios. In the first scenario, when an instance is misclassified, the classifier immediately receives feedback to be updated with the true label. In the second scenario, when an instance is misclassified, there is a 50% chance of receiving feedback to update the classifier with the true label. The results indicated that GMDL was robust and efficient in these experimental scenarios. It was less negatively affected by class imbalance than PA and SGD, and presented robustness in problems with few instances available for training. The statistical analysis of the results demonstrated that GMDL outperformed SGD and PA in both experimental scenarios. Moreover, it obtained the best average ranking in both scenarios. From a theoretical perspective, the mathematical formulation used in GMDL would allow for new instances, new classes, or new features to be presented over time for updating the predictive model. In future research, we aim to explore this property of the model, that fits in well on many real problems that require constant adjustments. Furthermore, we intend to investigate the sensitivity of GMDL to changes in its parameters in more detail. We also intend to evaluate the possibility of merging GMDL with other online approaches in order to obtain a more robust classifier and apply it to different problems, such as text and image classification. Acknowledgments We gratefully acknowledge the support of NVIDIA Corporation, USA and the financial support provided by São Paulo Research Foundation (FAPESP; grants #2017/09387-6 and #2018/02146-6). Appendix. Performance in an offline learning scenario

7. Conclusions

GMDL was theoretically idealized and mathematically designed to learn and adjust its model in an online fashion. However, with the purpose of provide the reader with an insight into the performance of our proposed method in an offline scenario, we compared its results with those produced using the following established offline learning methods: Gaussian naïve Bayes (GNB) [53], support vector machines (SVM) [54], k-nearest neighbors (KNN) [55], and random forests (RF) [56]. These methods are widely used as baseline in several other studies that address offline learning. We used the implementations of GNB, SVM, RF, and KNN from the scikit-learn Library.4 As the performance of KNN, SVM,

In classification problems it is common to obtain several models (or hypotheses) for a given dataset. Each model attempts

4 The scikit-learn Library is available at http://scikit-learn.org/ accessed on Jul 26, 2019.

Please cite this article as: B.L. Freitas, R.M. Silva and T.A. Almeida, Gaussian Mixture Descriptors Learner, Knowledge-Based Systems (2019) 105039, https://doi.org/10.1016/j.knosys.2019.105039.

8

B.L. Freitas, R.M. Silva and T.A. Almeida / Knowledge-Based Systems xxx (xxxx) xxx

Table A.4 Parameters and range of values used in the grid-search. Method

Parameter

Range

GMDL RF

σ˜ 2 n_estimators criterion

SVM KNN

C K

{2; 5; 10; } {10; 20; 30; 40; 50; 60; 70; 80; 90; 100; } {‘‘gini’’; ‘‘entropy’’} {0.0001; 0.001; 0.01; 0, 1; 1; 10; 100; 1000} {3; 5; 7; 9; 11; 13; 15; 17; 19}

Table A.5 F-measure obtained by each method in the offline learning scenario. GMDL adult contrac covertype fertility hill-valley ht-sensor iris letter libras miniboone skin susy wdbc wine wine-red wine-white

1.000 0.477 0.411 0.438 0.480 0.549 0.960 0.703 0.586 0.847 0.917 0.745 0.922 0.952 0.299 0.267

GNB 1.000 0.468 0.401 0.656 0.521 0.537 0.953 0.648 0.603 0.556 0.880 0.737 0.928 0.957 0.305 0.279

RF 1.000 0.489 0.629 0.688 0.582 0.480 0.940 0.962 0.741 0.920 0.999 – 0.960 0.962 0.274 0.281

SVM

KNN

1.000 0.484 0.273 0.599 0.726 0.499 0.954 0.694 0.623 0.885 0.887 – 0.962 0.984 0.251 0.238

0.916 0.445 0.492 0.468 0.532 0.458 0.967 0.936 0.628 0.871 0.999 – 0.970 0.958 0.268 0.242

and RF can be highly affected by the choice of parameters, we performed a grid search using five-fold cross-validation to find the best values for their main parameters. We also used grid search to select the parameter σ˜ 2 of GMDL. All parameters are presented in Table A.4. For GMDL we applied a forgetting factor f of the GMDL was set to 1 because the datasets do not contain temporal information. For SVM, we applied the linear kernel because some datasets have high dimensionality. Hsu and Lin [57] demonstrated that a linear kernel performs better than other kernel functions in highdimensional problems. In all experiments, we applied the Z-score normalization using information from the training examples. A.1. Results Table A.5 presents the results obtained by each evaluated method for each dataset. We present the average macro Fmeasure obtained in five-fold stratified cross-validation. The scores are presented as a grayscale heat map in which better scores correspond to darker cell colors. The bold values indicate the best scores. It is clear that RF obtained the best performance in most experiments. However, the other methods also obtained the best macro F-measure in at least two of the datasets evaluated. Moreover, in the experiments with the dataset susy, which contains many samples, RF, SVM, and KNN failed to complete the experiment within 48 h, which was the maximum time limit we set. GMDL and GNB are more suitable to large-scale problems because they have a lower computational cost than RF, SVM, and KNN. In addition, it is possible to notice the influence of the Law of Large Numbers on GMDL: a larger number of samples allows a better estimate of the density of the distributions. GMDL obtained a low performance in the experiments with the datasets hill-valley and libras. Both datasets have features with high cardinality; then, the oKDE estimation errors can accumulate when estimating the description length of the class. Moreover, as pointed out by Kristan and Leonardis [47], the computation of the class prototype as a multivariate distribution suffers a great smoothness due to the high number of variables.

Fig. A.3. Average ranking of each method in the offline learning scenario.

We performed a statistical analysis of the results using the non-parametric Friedman test based on the average ranking of each classification method (Fig. A.3). For a confidence interval α = 0.05, the Friedman test rejected the null hypothesis that all the methods were equivalent. We also performed a pairwise comparison using the Wilcoxon post-hoc test, with a confidence interval α = 0.05. According to this analysis, there is no statistical evidence to state that GNB, RF, SVM, or KNN is superior to GMDL. Although the results obtained by GMDL were not statistically superior to the other methods evaluated in this scenario, we can state that GMDL has a great advantage over them because it is naturally incremental, which makes it scalable and suitable for real and dynamic scenarios. References [1] P. Domingos, The role of Occam’s Razor in knowledge discovery, Data Min. Knowl. Discov. 3 (4) (1999) 409–425. [2] J. Rissanen, Modeling by shortest data description, Automatica 14 (5) (1978) 465–471. [3] A.N. Kolmogorov, On tables of random numbers, Sankhya: Indian J. Stat. Ser. A 53 (4) (1963) 369–376. [4] R.M. Silva, T.C. Alberto, T.A. Almeida, A. Yamakami, Towards filtering undesired short text messages using an online learning approach with semantic indexing, Expert Syst. Appl. 83 (2017) 314–325. [5] A. Bratko, G.V. Cormack, B. Filipič, T.R. Lynam, B. Zupan, Spam filtering using statistical data compression models, J. Mach. Learn. Res. 7 (12) (2006) 2673–2698. [6] I.A. Braga, M. Ladeira, Filtragem adaptativa de spam com o princípio minimum description length, in: Anais Do XXVIII Congresso Da Sociedade Brasileira de ComputaÇãO (SBC’08), Belém, Brasil, 2008, pp. 11–20. [7] T.A. Almeida, A. Yamakami, Facing the spammers: A very effective approach to avoid junk e-mails, Expert Syst. Appl. 39 (7) (2012) 6557–6561. [8] R.M. Silva, T.A. Almeida, A. Yamakami, Towards web spam filtering using a classifier based on the minimum description length principle, in: 15th IEEE International Conference on Machine Learning and Applications (ICMLA’16), IEEE, Anaheim, CA, EUA, 2016, pp. 470–475. [9] R.M. Silva, T.A. Almeida, A. Yamakami, MDLText: An efficient and lightweight text classifier, Knowl.-Based Syst. 118 (2017) 152–164. [10] J.R. Bertini, L. Zhao, A.A. Lopes, An incremental learning algorithm based on the k-associated graph for non-stationary data classification, Inform. Sci. 246 (2013) 52–68. [11] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, Y. Singer, Online passive-aggressive algorithms, J. Mach. Learn. Res. 7 (Mar) (2006) 551–585. [12] I. Frías-Blanco, J. del Campo-Ávila, G. Ramos-Jiménez, A.C. Carvalho, A.O.-D. az, R. Morales-Bueno, Online adaptive decision trees based on concentration inequalities, Knowl.-Based Syst. 104 (2016) 179–194. [13] W. Shu, W. Qian, Y. Xie, Incremental approaches for feature selection from dynamic data with the variation of multiple objects, Knowl.-Based Syst. 163 (2019) 320–331. [14] Y. Liu, Z. Xu, C. Li, Online semi-supervised support vector machine, Inform. Sci. 439–440 (2018) 125–141. [15] C. Luo, T. Li, H. Chen, H. Fujita, Z. Yi, Efficient updating of probabilistic approximations with incremental objects, Knowl.-Based Syst. 109 (2016) 71–83.

Please cite this article as: B.L. Freitas, R.M. Silva and T.A. Almeida, Gaussian Mixture Descriptors Learner, Knowledge-Based Systems (2019) 105039, https://doi.org/10.1016/j.knosys.2019.105039.

B.L. Freitas, R.M. Silva and T.A. Almeida / Knowledge-Based Systems xxx (xxxx) xxx [16] J. Jorge, R. Paredes, Passive-aggressive online learning with nonlinear embeddings, Pattern Recognit. 79 (2018) 162–171. [17] L. Wang, H.-B. Ji, Y. Jin, Fuzzy passive–aggressive classification: A robust and efficient algorithm for online classification problems, Inform. Sci. 220 (2013) 46–63, Online Fuzzy Machine Learning and Data Mining. [18] D. Brzezinski, J. Stefanowski, Combining block-based and online methods in learning ensembles from concept drifting data streams, Inform. Sci. 265 (2014) 50–67. [19] F. Rosenblatt, The perceptron: A probabilistic model for information storage and organization in the brain, Psychol. Rev. 65 (6) (1958) 386–408. [20] L. Bottou, Large-scale machine learning with stochastic gradient descent, in: Y. Lechevallier, G. Saporta (Eds.), Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010), Springer, Paris, France, 2010, pp. 177–187. [21] E.F. Cardoso, R.M. Silva, T.A. Almeida, Towards automatic filtering of fake reviews, Neurocomputing 309 (2018) 106–116, URL https://doi.org/10. 1016/j.neucom.2018.04.074. [22] K. Crammer, M. Dredze, F. Pereira, Confidence-weighted linear classification for text categorization, J. Mach. Learn. Res. 13 (1) (2012) 1891–1926. [23] J.R. Quinlan, R.L. Rivest, Inferring decision trees using the minimum description lenght principle, Inform. and Comput. 80 (3) (1989) 227–248. [24] W. Lam, F. Bacchus, Learning Bayesian belief networks: An approach based on the mdl principle, Comput. Intell. 10 (3) (1994) 269–293. [25] O.M. Tataw, T. Rakthanmanon, E. Keogh, Clustering of symbols using minimal description length, in: 12th International Conference on Document Analysis and Recognition (ICDAR’13), IEEE, Washington, DC, EUA, 2013, pp. 180–184. [26] J. Sheinvald, B. Dom, W. Niblack, A modeling approach to feature selection, in: 10th International Conference on Pattern Recognition (ICPR’90), Vol. 408, IEEE, Atlantic City, NJ, USA, 1990, pp. 535–539. [27] A. Bosin, N. Dessì, B. Pes, High-dimensional micro-array data classification using minimum description length and domain expert knowledge, Adv. Appl. Artif. Intell. 4031 (2006) 790–799. [28] J. Rissanen, A universal prior for integers and estimation by minimum description length, Ann. Statist. 11 (2) (1983) 416–431. [29] P. Grünwald, A tutorial introduction to the minimum description length principle, Adv. Minim. Descr. Length: Theory Appl. 1 (1) (2005) 23–81. [30] A. Barron, J. Rissanen, B. Yu, The minimum description length principle in coding and modeling, IEEE Trans. Inform. Theory 44 (6) (1998) 2743–2760. [31] L.G. Kraft, A Device for Quantizing, Grouping, and Coding AmplitudeModulated Pulses (Ph.D. thesis), Massachusetts Institute of Technology, 1949, pp. 1–66. [32] T.M. Cover, J.A. Thomas, Elements of information, in: Elements of Information, John Wiley & Sons, Inc., 1991, pp. 50–51, Ch. 3. [33] D.J. MacKay, Information Theory, Inference and Learning Algorithms, Vol. 4, Cambridge University Press, 2003, p. 628. [34] S.J. Russell, P. Norvig, J.F. Canny, J.M. Malik, D.D. Edwards, Artificial Intelligence: a Modern Approach, Vol. 1, first ed., Prentice Hall, 1995, p. 947. [35] M. Van Leeuwen, A. Siebes, Streamkrimp: Detecting change in data streams, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML’08), Springer, Antwerp, Belgium, 2008, pp. 672–687.

9

[36] G. McLachlan, D. Peel, Finite Mixture Models, Vol. 1, John Wiley & Sons, Inc., 2004. [37] Z. Zivkovic, F. van der Heijden, Recursive unsupervised learning of finite mixture models, IEEE Trans. Pattern Anal. Mach. Intell. 26 (5) (2004) 651–656. [38] M. Kristan, A. Leonardis, D. Skočaj, Multivariate online kernel density estimation with Gaussian kernels, Pattern Recognit. 44 (10–11) (2011) 2630–2642. [39] B.W. Silverman, Density Estimation for Statistics and Data Analysis, Vol. 26, first ed., Springer, Boston, MA, 1986. [40] P. Langley, G.H. John, Estimating continuous distributions in Bayesian classifier, in: 11th Conference on Uncertainty in Artificial Intelligence (UAI’95), Morgan Kaufmann Publishers Inc., Montreal, Canadá, 1995, pp. 399–406. [41] J. Lu, Y. Yang, G.I. Webb, Incremental discretization for naive-bayes classifier, in: 2nd International Conference on Advanced Data Mining and Applications (ADMA’06), Springer, Xian, China, 2006, pp. 223–238. [42] J. Ferreira, D.M. Matos, R. Ribeiro, Fast and Extensible Online Multivariate Kernel Density Estimation, CoRR, abs/1606.0 (1) (2016) 1–17. [43] B.P. Welford, Note on a method for Calculating corrected sums of squares and products, Technometrics 4 (3) (1962) 419. [44] R.F. Ling, Comparison of several algorithms for computing sample means and variances, J. Amer. Statist. Assoc. 69 (348) (1974) 859. [45] T.F. Chan, G.H. Golub, R.J. LeVeque, Algorithms for computing the sample variance: Analysis and recommendations, Amer. Statist. 37 (3) (1983) 242–247. [46] M. Fernández-Delgado, E. Cernadas, S. Barro, D. Amorim, D. Amorim Fernández-Delgado, Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 15 (2014) 3133–3181. [47] M. Kristan, A. Leonardis, Online discriminative kernel density estimator with gaussian kernels, Syst. Cybernet. 44 (3) (2014) 355–365. [48] J.H. Zar, Biostatistical Analysis, fifth ed., Prentice Hall, 2009. [49] A. Benavoli, G. Corani, F. Mangili, Should we really use post-hoc tests based on mean-ranks? J. Mach. Learn. Res. 17 (2016) 1–10. [50] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S.S. Keerthi, S. Sundararajan, A dual coordinate descent method for large-scale linear SVM, in: Proceedings of the 25th International Conference on Machine Learning (ICML’08), ACM, Helsinki, Finland, 2008, pp. 408–415. [51] T. Zhang, Solving large scale linear prediction problems using stochastic gradient descent algorithms, in: Proceedings of the 21th International Conference on Machine Learning (ICML’04), ACM, Banff, Alberta, Canada, 2004, pp. 116–123. [52] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, J. Mach. Learn. Res. 15 (2014) 1929–1958. [53] R.O. Duda, P.E. Hart, Pattern Classification and Scene Analysis, first ed., Wiley, 1973, p. 512. [54] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (3) (1995) 273–297. [55] G. Salton, M.J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, Inc., New York, NY, USA, 1986. [56] L. Breiman, Bagging predictors, Mach. Learn. 24 (2) (1996) 123–140. [57] C.-W. Hsu, C.-J. Lin, A comparison of methods for multiclass support vector machines, IEEE Trans. Neural Netw. 13 (2) (2002) 415–425.

Please cite this article as: B.L. Freitas, R.M. Silva and T.A. Almeida, Gaussian Mixture Descriptors Learner, Knowledge-Based Systems (2019) 105039, https://doi.org/10.1016/j.knosys.2019.105039.