Discriminative structure learning of sum–product networks for data stream classification

Discriminative structure learning of sum–product networks for data stream classification

Journal Pre-proof Discriminative structure learning of sum-product networks for data stream classification Zhengya Sun, Cheng-Lin Liu, Jinghao Niu, We...

1MB Sizes 0 Downloads 30 Views

Journal Pre-proof Discriminative structure learning of sum-product networks for data stream classification Zhengya Sun, Cheng-Lin Liu, Jinghao Niu, Wensheng Zhang

PII: DOI: Reference:

S0893-6080(19)30394-6 https://doi.org/10.1016/j.neunet.2019.12.002 NN 4334

To appear in:

Neural Networks

Received date : 15 August 2018 Revised date : 30 July 2019 Accepted date : 2 December 2019 Please cite this article as: Z. Sun, C.-L. Liu, J. Niu et al., Discriminative structure learning of sum-product networks for data stream classification. Neural Networks (2019), doi: https://doi.org/10.1016/j.neunet.2019.12.002. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Elsevier Ltd. All rights reserved.

Journal Pre-proof

Discriminative Structure Learning of Sum-Product Networks for Data Stream Classification Zhengya Suna , Cheng-Lin Liub,c,d , Jinghao Niua , Wensheng Zhanga,c

pro of

a

Precise Perception and Control Research Center, Institute of Automation of Chinese Academy of Sciences, Beijing 100190, China b National Laboratory of Pattern Recognition (NLPR), Institute of Automation of Chinese Academy of Sciences, Beijing 100190, China c School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing 100049, China d CAS Center for Excellence of Brain Science and Intelligence Technology, Beijing 100190, China

re-

Abstract

urn a

lP

Sum-product network (SPN) is a deep probabilistic representation that allows for exact and tractable inference. There has been a trend of online SPN structure learning from massive and continuous data streams. However, online structure learning of SPNs has been introduced only for the generative settings so far. In this paper, we present an online discriminative approach for SPNs for learning both the structure and parameters. The basic idea is to keep track of informative and representative examples to capture the trend of time-changing class distributions. Specifically, by estimating the goodness of model fitting of data points and dynamically maintaining a certain amount of informative examples over time, we generate new sub-SPNs in a recursive and top-down manner. Meanwhile, an outlier-robust margin-based log-likelihood loss is applied locally to each data point and the parameters of SPN are updated continuously using most probable explanation (MPE) inference. This leads to a fast yet powerful optimization procedure and improved discrimination capability between the genuine class and rival classes. Empirical results show that the proposed approach achieves better prediction performance than the state-of-the-art online structure learner for SPNs, while promising order-of-magnitude speedup. Comparison with state-of-the-art stream classifiers further prove the superiority of our approach.

Jo

Keywords:

Email address: [email protected] (Zhengya Sun)

Preprint submitted to Neural Networks

December 6, 2019

Journal Pre-proof

Sum-product network, Discriminative structure learning, Data stream classification. 1. Introduction

Jo

urn a

lP

re-

pro of

Sum-product networks (SPNs) are recently developed deep neural probabilistic models that admit exact inference in time linear to the size of the network [1]. This has aroused a lot of interest because learning usually involves inference as a subroutine, which is expensive or even intractable in classical graphical models, except models with low treewidth [2, 3]. SPNs have manifested their superiority in dealing with real world data, such as image completion [1], classification [4], speech [5] and language processing [6]. An SPN consists of a rooted directed acyclic graph with internal nodes corresponding to sums and products, and leaves corresponding to tractable distributions. The learning process of SPN structure generates this graph along with its parameters, with the aim of capturing the latent interaction among observed variables. Most algorithms were designed for learning in batch optimization scenario [7, 8, 9], where the full dataset is available to be examined iteratively. With the rise of massive streaming data, there has been a recent focus on online structure learning for SPNs. The dynamic and evolving nature of data streams poses great challenges to structure learning algorithms since it is hard to extract all necessary information from data records in only one pass. Some online approaches have been proposed to refine the parameters of SPN with fixed structure [1, 10, 11]. One straightforward way is to modify an iterative parameter optimization algorithm to the online mode by restricting parameter updating in only one iteration [10]. Such algorithms include gradient descent, exponentiated gradient and expectation maximization (EM). They can be further speeded up by replacing marginal inference with most probable explanation (MPE) inference and implementing hard training mechanisms [1, 4]. Instead of maximum likelihood, Rashwan et al. proposed a Bayesian moment matching (BMM) algorithm which lends itself to online learning without suffering from local optima. Jaini et al. extended this paradigm from SPN modeling categorical data to SPN modeling over continuous data [11]. While these approaches have proven to be effective in achieving state-of-the-art results, they rely heavily on the prespecification of SPN structure, which is not trivial. Some researchers have attempted automated structure learning for SPNs from massive and continuous data streams. In a first attempt, Lee et al. [12] built up clusters based on mini-batch samples, and performed training with a top-down 2

Journal Pre-proof

Jo

urn a

lP

re-

pro of

structure learner over the newly generated clusters. In their model, new child nodes are hierarchically added onto the existing sum nodes, while product nodes do not change after they are created. A related but different approach was developed by Hsu et al. [13], who considered the more general case of SPNs over continuous variables, and proposed a bottom-up structure learner, which dynamically monitor the change of the correlation coefficients between two variables, and modify the product nodes whenever correlation is detected. Since the product nodes need to maintain the covariance matrix, which is quadratic in the size of their scope, the algorithm is computationally expensive. These two online approaches learn the structure of SPNs generatively by maximizing the joint distribution of all the variables. However, such generative learning can lead to suboptimal prediction performance, due to the mismatch between the learning objective and the goal of classification. In this paper, we propose an online approach for discriminatively learning both the structure and parameters of SPNs. The benefit of structure update is to improve the representation for streaming data, while parameter update is to improve prediction under drift. In particular, our formulation works with continuous SPNs that have Gaussian leaves. The basic idea is to keep track of informative and representative examples over time to capture the trend of time-changing class distributions. We incorporate a vigilance parameter balance between plasticity1 and stability2 during online discriminative learning. For each new incoming date point, we estimate the goodness of fit of the SPN structures learned so far, and by dynamically maintaining a certain amount of informative examples, we generate new sub-SPNs in a recursive and top-down manner for enriching the representation. Specifically, the sum nodes are obtained by dynamic clustering over the instances, while the product nodes are obtained by partitioning the variables into correlated subsets. To boost the discrimination capability between the genuine class and the closest rival class, an outlier-robust margin-based log-likelihood loss function is applied to each data point, and parameters of SPN are updated continuously using most probable explanation (MPE) inference. In other words, we simply consider the branching paths that traverse the winning child nodes, leading to a fast yet powerful optimization procedure. Empirical results on handwritten digit recognition and stream classification tasks demonstrate that the proposed approach promises appealing performance and efficiency over the well-developed 1 2

Plasticity is the ability of a system to integrate new information. Stability is the ability of a system to retain its previously learned knowledge.

3

Journal Pre-proof

pro of

SPNs. In addition, it achieves consistently lower classification errors compared to the state-of-the-art data stream classifiers. The remainder of this paper is organized as follows. Section 2 reviews the related work. Section 3 provides the basics of SPNs. Section 4 presents the proposed online discriminative structure learning approach. Section 5 presents the experimental results. Section 6 draws concluding remarks. 2. Related Work

Data stream classification is a challenging data mining task because of the dynamic and evolving nature of data. Existing data stream classification approaches can be categorized into three groups: single model classification, ensemble classification and instance-based classification.

re-

• Single model classification approaches strive to update the model by dynamically keeping track of a fixed or adaptive window of incoming instances. For example, the approaches in [14, 15] incrementally update a decision tree with the new data, and the approach in [16] adapts microclusters in the model to the most recent data.

urn a

lP

• Ensemble classification approaches divide the data stream into chunks of equal size and use each chunk to build one base classifier. An ensemble of base classifiers is continuously updated so that it represents the most recent concept in the stream. For example, the approaches in[17] and [18] replace one of the existing classifiers in the ensemble as soon as a new classifier is trained, when judged necessary, and classify the unlabeled instance by majority vote in the ensemble. • Without building a general model, instance-based approaches maintain summaries of historical data in the form of selected instances, which are used to predict the unlabeled instance. Such approaches include IBLStream [19], SyncStream [20], and ILVQ [21].

Jo

While these approaches perform very well, their expressive power is limited by the shallow architecture. Motivated by advances in deep learning, research efforts have shifted to multi-layer models for non-stationary input distributions [22, 12, 13]. Among them, sum-product networks have shown promise in learning the changing probability distribution of data. 4

Journal Pre-proof

Jo

urn a

lP

re-

pro of

Learning SPN structure has gained growing interest in recent years. The first approach in this direction, proposed by Dennis and Ventura [8], consists of two main steps: building a region graph and converting it to an SPN. Specifically it applies (but not limited to) k-means clustering algorithm to detect subsets of similar instances and variables that interact with one another. A limitation of this process is that it ignores the context-specific independences and correlations between variables, which are crucial to SPNs’ expressiveness. Also, it cannot optimize the structure and parameters in a unified manner. To remedy these problems, Gens and Domingos proposed a general framework for SPN structure learning [7]. They recursively apply splits on independent sets of variables, and splits on similar sets of instances, to improve both the network structure and data log-likelihood. Along this line, some variant algorithms use tractable multivariate distributions rather than univariate distributions as leaves, such as arithmetic circuit leaves [23] and Chow-Liu tree leaves [24]. Instead of splitting the data matrix across only one dimension at a time, Adel et al. further reformulated the top-down search problem as finding approximate rankone submatrices based on singular value decomposition [25]. As an alternative to top-down approaches, Peharz et al. proposed to learn the structure of SPNs in a greedy bottom-up manner by merging simple models over small variable scopes to more complex models [9]. Besides learning a tree structure, Rahman and Gogate consider more general SPNs in the form of directed acyclic graph [26], which are induced from tree SPNs by merging similar sub-structures. Unfortunately, all these structure learning algorithms for SPNs operate in batch mode, in that the whole training set is accessed in each iteration to maximize certain (regularized) likelihood functions. Lee et al. introduced an online structure learning procedure for discrete SPNs [12], where the number of nodes and edges are evolved dynamically based on a small subset of the training set. It can be viewed as a variant of LearnSPN that adaptively augments the child nodes by applying incremental clustering technique to the new incoming instances. As an extension, Hsu et al. explored an online algorithm for continuous SPNs with Gaussian leaves [13]. It updates the network structure in a bottom-up manner by detecting the correlations between two variables and modifying the corresponding product nodes. Most online structure learning approaches for SPNs are formulated in the generative settings. As an exception, the approach in [4] learns SPNs discriminatively using a form of backpropagation for computing gradient and exploring variations of inference. However, it relies on a pre-defined SPN structure. Rooshenas and Lowd [27], and Adel et al. [25] proposed discriminative structure learning algo5

Journal Pre-proof

3. Sum-Product Networks

re-

pro of

rithms, which are batch optimizers assuming that all the training data is available, however. To the best of our knowledge, there have not been an attempt reported on online discriminative structure learning of SPNs for data stream classification. The proposed SPN-DSC can be viewed as an extension of LearnSPN [12] – the prototypical batch structure learning algorithm for SPNs – to the online setting. Instead of assuming the variables are discrete valued [12], we focus on continuous SPNs with Gaussian leaves, as the case in [13]. Our approach is more general and can be applied to discrete data or for mixed data [28] by applying one-hot encoding to the categorical variables. Unlike the algorithms of [12] and [13] that model the joint distribution over all the variables, SPN-DSC learns conditional sub-SPNs that predict the most probable class label. Although Rooshenas and Lowd [27], and Adel et al. [25] have also proposed to discriminatively learn a conditional distribution, these algorithms go through training instances in a batch mode. The random structure learning routine proposed by Peharz et al. [29] could potentially alleviate the tediousness of SPN structure learning in high dimensional space. However, it requires a large number of complete passes over the whole training data, and is not directly applicable to streaming data.

urn a

lP

We begin by introducing the notations used throughout this paper. We denote random variables as uppercase letters W , X and Y . We represent the set of values taken by X as val(X), and denote their values using the corresponding lowercase letters, e.g., x is an element of val(X) ⊆ R. The sets of random variables are denoted by boldface letters W and X. Consider any random variable set X = {X1 , · · · , XD }, we define the set of its possible values as the Cartesian product val(X) = ×D d=1 val(Xd ) and use the corresponding lowercase boldface letter x for elements of val(X). Given a node N in a directed graph, let ch(N) and pa(N) be the sets of all the child and parent nodes of N, respectively. For simplicity, we abbreviate the probability of a variable assignment p(X = x) to p(x), and p(X = x) to p(x). In the following, we briefly review some background on representation and structure learning for sum-product networks.

Jo

3.1. Representation A sum-product network can be viewed as a rooted, directed acyclic graph whose internal nodes are either sums or products, and leaves correspond to tractable distributions of some random variables in X, including Bernoulli distributions for

6

Journal Pre-proof

discrete SPNs and Gaussian distributions for continuous SPNs. The edges emanating from sum nodes are assigned with non-negative weights. The sum and product nodes compute multilinear polynomials in their leaf nodes [1].

pro of

Definition 1. (Details in [30, 31]) A sum-product network (SPN) S over a set of random variables X is a tuple (G, w) where G is a connected, rooted and acyclic directed graph, and w is a set of non-negative parameters. The graph G contains three types of nodes: distributions, sums and products.

re-

1. A distribution node (also called input distribution) DΓ : val(Γ) 7→ [0, ∞] is a distribution function over a subset of random variables Γ ⊆ X, i.e., either a probability mass function (discrete random variables), a probability density function (continuous random variables), or a mixed distribution function. ∑ 2. A sum node S computes a weighted sum of its children, i.e., S = C∈ch(S) wS,C C, where wS,C is a non-negative weight associated with edges S → C. ∏ 3. A product node P computes a product over its children, i.e., P = C∈ch(P) C.

lP

The value of an SPN is determined by its root in an upward pass. The scope of any node in an SPN is defined as the set of variables that appear in its descendants. More formally, it can be defined as { Γ, if N is a distribution DΓ , sc(N) = ∪ (1) sc(C), otherwise. C∈ch(N)

urn a

An SPN is said to be valid if it always correctly computes the probability of evidence, i.e., partial instantiation of X. A valid SPN performs arbitrary marginalization tasks in linear time with respect to the network size. This is a prominent property since Bayesian and Markov networks may take exponential time in these cases. Poon & Domingos (2011) further established the conditions for the validity of an SPN: Definition 2. (Completeness) An SPN is complete iff all children of the same sum node have the same scope.

Jo

Definition 3. (Decomposability) An SPN is decomposable iff all children of the same product node have disjoint scopes.

7

Journal Pre-proof

pro of

Decomposability and completeness are sufficient for validity [1]. Note that the SPNs we discuss later are all valid by default. In this context, sum nodes embody probabilistic mixtures over their children distributions whose coefficients are the children weights, while product nodes characterize factorizations over independent distributions. Besides, the sub-SPN rooted at each node can be used to encode a joint distribution over its scope.

re-

3.2. Structure Learning Learning the structure of SPNs is the task of discovering local latent variables and their intricate connections in the data. This allows us to capture complex dependencies among random variables, while in the meantime ensure efficient inference. LearnSPN [7] is the first principled structure learning algorithm for SPNs and can be adapted naturally to online learning. It is briefly reviewed below. LearnSPN takes as input a learning set made of a vector-valued variables, typically in the form of an instance by variable matrix. The basic idea behind LearnSPN is to partition the data matrix in a top-down manner and construct treestructured SPNs by column splits or row splits at each step. LearnSPN starts with checking whether the variables can be split into mutually independent subsets according to G-test. For each pair of discrete variables x and y, their G-test is defined as ∑∑

lP

G(x, y) = 2

x

y

c(x, y) · log

c(x, y) · |T | , c(x)c(y)

(2)

Jo

urn a

where the summations range over the values of each variable and c(·) counts the occurrences of a setting of a variable pair or singleton [32]. A graph is formed by connecting each pair of variables that are dependent. If permitted, the split recursions are performed on each connected component, and returning the products of the resulting SPNs. Otherwise, it aggregates the instances into an adaptive number of clusters by hard incremental EM over a naive Bayes mixture model. Through several recursions on each cluster, it returns the weighted sum of the resulting SPNs. The weights of sum node children are proportional to the number of the assigned instances, which can be smoothed using a Dirichlet prior. LearnSPN terminates when the current submatrix contains only one column or when the number of its rows does not reach a certain threshold. And all the leaf nodes represent a univariate distribution.

8

Journal Pre-proof

4. Proposed Algorithm

pro of

Consider a data stream consisting of a continuous sequence of labeled instances (xt , yt ) for t = 1, 2, . . . , T , where xt ∈ Rd denotes a new instance arriving at time t with d-dimensional features, and yt ∈ {1, . . . , L} represents its class label. It is assumed that learner can access the true label yt of instance xt before the arrival of instance xt+1 . We present a discriminative structure learning algorithm for SPNs on streaming data. The algorithm incrementally builds up a collection of generative SPNs, one per conditional distribution p(X|Y = l) for l ∈ {1, . . . , L}, which are combined by a single sum node and weighted according to a priori class probabilities p(Y = l). In this paper, we assume for simplicity that the class labels are uniformly distributed. The proposed algorithm is detailed in the following.

Jo

urn a

lP

re-

4.1. Overview The proposed algorithm, called SPN-DSC (SPN learner for Data Stream Classification), is composed of two key processes to deal with concept drift: (1) adding new sub-SPNs to the existing SPN and (2) adjusting parts of the existing parameters that increases the conditional likelihood. By monitoring the drift in the likelihood of each data point, SPN-DSC adapts to drift continuously without explicit concept drift detection. In this sense, knowledge (e.g., the structure and a set of parameters) is transferred as best as possible to emerging new concepts. When the drift makes the structure insufficient to explain the new data points, a new sub-structure with new parameters is added. Otherwise, the parameters are refined for better prediction under drift. Note that locally updating a model we are taking here could offer effective reactions to recurring concepts. Specifically, we incorporate a vigilance parameter ρ (≥ 0) to control the tradeoff between generation of new sub-SPNs and adaptation of the SPN structures learned so far. In this sense, for each new instance xt assigned with label l, we estimate the goodness of fit of the SPN structures learned so far, i.e., classconditional log-likelihood score log p(X = xt |Y = l). If the log-likelihood score (LLS) has a value greater than −ρ, we assume the instance can be well represented with the structure, and only use it for calculating the discriminative gradient descent. Otherwise, we temporarily store the instance in a buffer Bl . When there are enough instances in Bl , we initialize the structure increments and merge them to the existing SPN. We then discriminatively train parts of the parameters using a single pass over the instances in Bl . The pseudocode of the overall algorithm is provided in Algorithm 1.

9

pro of

Journal Pre-proof

Jo

urn a

lP

re-

Algorithm 1 SPN-DSC(xt , yt , ρ, B, τ, S) Require: (xt , yt ): a new labeled instance in the data stream, ρ: vigilance parameter, B: a set of buffer, {B1 , . . . , BL } holding temporarily deferred instances, τ : cache size, St−1 : the SPN that has just learned at time t − 1 1: Compute the LLS L(xt , yt ) for input(xt , yt ) 2: if L(xt , yt ) < −ρ and yt = l then 3: Bl ← Bl ∪ {xt } 4: else 5: St = parameterU pdate(xt , yt , St−1 ) 4.3 6: return St 7: end if 8: if |Bl | ≥ τ then 9: S˜0 = structureU pdate(Bl , St−1 ) 4.2 10: for each instance xi ∈ Bl , i = 1, . . . , τ 11: S˜i = parameterU pdate(xi , l, S˜i−1 ) 4.3 12: end for 13: Set St = S˜τ and Bl = ∅ 14: return St 15: end if

10

Journal Pre-proof

pro of

4.2. Structure Update We start with a base node for the discriminative SPN, i.e., a sum node dividing instances assigned with different labels. After dynamically maintaining τ informative instances for a particular class, we generate new sub-SPNs in a recursive and top-down manner. We then merge them to the existing SPN for a more compact representation.

urn a

lP

re-

4.2.1. Adding Sub-SPNs We construct a treed sub-SPN for the buffered τ instances, and alternate between instance splits and variable splits in a top-down manner as taken by LearnSPN [7]. If there is a single variable, a leaf node, representing a univariate Gaussian distribution, is introduced. If there is more than one variable, applying splits on the subsets of similar instances leads to sum nodes, and splits on the subsets of dependent variables leads to product nodes. At each instance splitting step, LearnSPN uses online EM with restarts, and obtains an adaptive number of clusters by introducing an exponential prior. However, how to decide the parameter for the time-changing subsets belonging to one specific class is a non-trivial task. An inappropriate setting of the parameter may cause either over-clustering or under-clustering effects. In addition, for the data stream case, trials with different initial centers are restricted. As a remedy, we follow a backward greedy scheme to successively bisect the resulting instances until splits on the variables happen or the two sub-clusters are close enough. Limiting the number of clusters to two has been validated to be a proper choice [24]. This can be seen as a hierarchical divisive clustering process. Our idea is to start with a single cluster Ω1 comprising all the τ instances, and move an instance from Ω1 to the new cluster Ω2 as soon as this movement drastically reduces the error function. E=

2 ∑ ∑

j=1 xi ∈Ωj

∥xi − µj ∥2 ,

(3)

Jo

∑ where µj = xi ∈Ωj xi /nj is the center of the j-th cluster Ωj and nj is the number of instances in Ωj . Note that the Euclidean norm is used as the distance measure (which is more or less the most common choice), unless otherwise mentioned. To facilitate our discussion, the new function values as instance xi is moved from Ω1 to Ω2 is formulated as follows [33]. n1 ||xi − µ1 ||2 , E˜1 = E1 − n1 − 1 11

Journal Pre-proof

n2 ||xi − µ2 ||2 . (4) n2 + 1 In each iteration of the clustering, we check whether moving xi from Ω1 to Ω2 will lead to lower E, i.e., n2 n1 ||xi − µ2 ||2 < ||xi − µ1 ||2 . (5) n2 + 1 n1 − 1 We then greedily pick the instance that drastically reduces the error function. The intention is to make most significant progress at each step in order to achieve optimal partitions. In this regard, the method can be considered as an approximation algorithm for solving (3). The procedure terminates when no instance in Ω1 satisfies the inequality (5). To avoid generating overcomplex networks, for the two newly formed sub-clusters, the following condition is checked:

pro of

E˜2 = E2 +

||µ1 − µ2 || ≥ δ.

(6)

re-

If this condition is fulfilled, a sum node, representing a mixture over different clusters, is created. Otherwise, the variable splitting procedure is invoked. Through trial and error tuning in our implementation, the following choice of the parameter is preferred: √ δ = 5.0 · d. (7)

urn a

lP

The dependency of δ on the dimension d caters for the splits layer by layer with varying dimensions. It also makes sense that the higher the dimension is, the greater the distance between the two centers should be. At each variable splitting step, we use a Fisher’s z-transformation to check the independence between continuous variables x and y [34]. √ 1 + rxy , (8) z(x, y) = 0.5 n − 3 ln 1 − rxy

Jo

where n is the number of instances, ln is the natural logarithm function, and rxy is the correlation coefficient between variables x and y. If the z-score is less than a certain threshold, the two variables are assumed to be independent. Two subsets of variables are produced by seeking the connected components of the graph expressing the variable correlation. If there is only one connected component, the instance splitting procedure is invoked. Otherwise, a product node, representing groups of independent variables, is returned. When the two splitting procedures both fail, a product node is created and placed over a set of univariate leaf nodes. Note that the sums and products are arranged in alternating layers, which can be achieved by pruning nodes whose parent has the same type, yet yielding the same distribution. 12

Journal Pre-proof

re-

pro of

4.2.2. Merging Sub-SPNs As described earlier, incremental representation construction allows quickly learning and adapting to emerging new data. However, monotonically adding sub-SPNs could potentially result in many redundant networks and overfitting. To deal with this problem, we consider merging similar sub-SPNs conditioned on the same class label to produce more compact representation, which can alleviate the problem of overfitting as well. Our merging process is done in an upward pass from the leaves to the root, with breadth-first-search (BFS). Since each leaf node stands for a Gaussian distribution, it keeps track of the empirical mean µ and variance σ 2 . Besides, every non-leaf node keeps track of the count of the instances reaching this node. Given the existing and the newly-formed sub-SPNs (Sl and S˜l ) for class label l, we first investigate whether the product nodes whose children are leaves can be merged. ˜ be such product nodes belonging to Sl and S˜l , reMore concretely, let P and P ˜ as merge candidates if they satisfy the following spectively. We regard P and P conditions: ˜ • sc(P) = sc(P); ∑ 2 2 • µX − µX )2 / min(σX ,σ ˜X ) ≤ 1. DX ∈ch(P) (˜

urn a

lP

Intuitively, a pair of candidate product nodes that receive connections from the leaves are combined to a single product node when they have the same scope and their children have the mean values lying within the range of the counterparts. Each merge of two product nodes at this layer leads to an update of the count and new leaves. A reasonable merging of two leaf nodes can be carried out in the following way: nµX + n ˜µ ˜X µ′X = , ˜ ( n+n ) max(n, n ˜ ) − min(n, n ˜) ′ σX = max(σX , σ ˜X ) + 1 − n+n ˜ × min(σX , σ ˜X ), n′ = n + n ˜,

Jo

where the subscript X denotes the variable in the leaf node’s scope, and n′ , µ′X , ′ and σX are the new count, mean, and standard deviation after the merge. It makes sense that the mean of the new leaf node shifts towards the mean of that leaf node 13

Journal Pre-proof

which is formed by more data points. A similar consideration is valid for the variance of the new leaf node. After merging the subSPNs in the bottom two layers, we proceed to higher layers.

pro of

• For the sum layers, we find such two nodes as the merge candidates if they share at least one common child. This leads to an update of the count, and a new sum node that connects the children of the original two sums. • For the product layers, we find two nodes with the same scope as the merge candidates if they have identical views on the dependency of the variables. Their child nodes share the same scope partitioning, and are merged accordingly.

re-

The merging strategy is schematically shown in Fig. 1. Merging increases the number of instances available near the bottom, beneficial to reduce the variance of the parameter estimates while having no impact on their bias.

Jo

urn a

lP

4.3. Parameter Update The parameters that need to be learned include the weights from each sum node, and the mean and variance for the Gaussian distribution in each leaf node. We estimate these parameters by continuously optimizing a margin-based loss function in an online manner. The discriminant function here takes the form fl (xt ) = log p(xt |l) + log p(l), measuring how likely xt belongs to class label l akin to the Bayes theorem. We have dropped the term log p(l), based on the assumption of equal prior class probabilities. Let Sl be the subSPN representing the LLS for class label l. When the weights at each sum node sum to one and the leaf distributions are normalized, then fl (xt ) = log Sl (xt ), where Sl (xt ) is the bottom-up evaluation of the Sl for assignment xt . Let gk (xt ) = fk (xt ) − fj (xt ) be the margin between the genuine class k and the predicted class j. Let ξ (ξ > 0) be a constant for scaling the margin. Then the classification error of f (·) is approximated by the modified Huber loss [35]:   ξgk (xt ) < −1, −4ξgk (xt ), ϕ(f (·)) = (ξgk (xt ) − 1)2 , ξgk (xt ) ∈ [−1, 1], (9)   0, ξgk (xt ) > 1. It is noteworthy that the loss penalizes misclassified points with ξgk (xt ) < −1 only linearly, which brings tolerance to outliers as well as probability estimates. 14

Journal Pre-proof

~ S

S

~ P

P

x1

x3

x2

x2

x3

x1

re-

x1

x1

x3

pro of

x2

x1

x2

x3

x1

x2

x3

x2

x3

lP

Figure 1: Illustration of how to merge the sum layer and the product layer at the bottom. The gray ˜ nodes denote the merged ones from the existing subSPN S and the newly generated subSPN S.

Let θk be a parameter in Sk and θj be a parameter in Sj . Then, the partial derivative of the loss with respect to θk and θj takes the form

urn a

∂ϕ(f ) ∂ϕ(f ) ∂ log Sk = , ∂θk ∂gk ∂θk

∂ϕ(f ) ∂ϕ(f ) ∂ log Sj =− . ∂θj ∂gk ∂θj

Jo

Note that the gradient of gk w.r.t. log Sk (or log Sj ) simplifies to 1 (or -1). Following [4], we derive the gradient of the CLL using MPE inference, which has been shown to outperform marginal inference in both accuracy and efficiency. Specifically, we convert an SPN S to an MPN (max-product network) M by replacing each sum node with a max node. Thus, we simply consider the branching paths that traverse the winning child nodes. At product nodes, the path branches to all children. We define W and DX as the set of the weights and leaf distributions traversed by this path. The partial derivatives of the logarithm of an MPN with respect to the weights 15

Journal Pre-proof

in Mk take the form ∑ ∑ ∂ wu ∈W log wu + (µd ,σ2 )∈DX log p(xd |µd , σd2 ) ∂ log Mk ∂ log wi 1 d = = = . ∂wi ∂wi ∂wi wi

pro of

The partial derivatives of the logarithm of an MPN with respect to the parameters of the Gaussian distribution in Mk take the form ∑ ∂ (µd ,σ2 )∈DX log p(xd |µd , σd2 ) ∂ log Mk d = ∂µo ∂µo ( ) 1 ∂ 1 2 2 = − log(2πσo ) − 2 (xo − µo ) ∂µo 2 2σo xo − µo , = σo2

lP

re-

∑ ∂ (µd ,σ2 )∈DX log p(xd |µd , σd2 ) ∂ log Mk d = ∂σo2 ∂σo2 ) ( ∂ 1 1 2 2 = − log(2πσo ) − 2 (xo − µo ) ∂σo2 2 2σo 2 1 (xo − µo ) =− 2 + . 2σo 2σo4

(10)

The hard gradient updates are given by ∂ϕ(f ) ∂ log Mk ηw ∂ϕ(f ) = , ∂gk ∂wi wi ∂gk ∂ϕ(f ) ∂ log Mk ∂ϕ(f ) (xo − µo ) ∆µo = ηµ = ηµ , ∂gk ∂µo ∂gk σo2 ∂ϕ(f ) ∂ log Mk ∆σo2 = ησ ∂gk ∂σo2 ( ) ∂ϕ(f ) 1 (xo − µo )2 = ησ − 2+ , ∂gk 2σo 2σo4

urn a

∆wi = ηw

(11)

Jo

where ηw , ηµ and ησ are the learning rates, i.e., user-defined hyper-parameters. Analogously, we can derive the hard gradient updates with respect to the parameters in Mj . 16

Journal Pre-proof

5. Experiments

pro of

We evaluated the performance of the proposed algorithm on SD3 and SD7 datasets in the NIST Special Database SD19 [36] and three popular stream classification datasets: Spam, Electricity and Covtype 3 . We aim to compare our algorithm with the state-of-the-art online SPN solvers and several data stream classification methods, and analyze the effects of the vigilance parameter ρ and cache size τ .

re-

5.1. Datasets Handwritten digits Data has been widely used in evaluating pattern recognition and machine learning algorithms. We use datasets from the NIST Special Database (SD) 19, which contains two datasets SD3 and SD7. From SD3, we use samples of 400 writers for training and 399 writers for testing, and for SD7, samples of 100 writers for training and 100 writers for testing. The patterns of each writer are assumed to have the same writing style. The statistics of patterns for SD3 and SD7 are listed in Table 1. Our choice of data is similar to that of [37] for adaptive classification. Table 1: Statistics on Handwritten numeral datasets.

lP

Writers SD3-Train No.0-No.399 (400) SD7-Train No.2100-No.2199 (100) SD3-Test No.400-No.799 (399) SD7-Test No.2200-No.2299 (100)

Number of instances 42969 11585 42821 11660

Jo

urn a

Spam Filtering Data is a collection of 9324 email messages derived from the Spam Assassin collection 4 . Each email is represented by 500 attributes using the Boolean bag-of-words approach. Electricity Data contains 45,312 instances, which was collected from the Australian New South Wales Electricity Market. In this market, prices are set every five minutes according to the demand and supply of the market. Covtype Data contains seven forest cover types for 30 × 30 meter cells obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. It contains 581,012 instances and 54 attributes. 3 4

https://moa.cms.waikato.ac.nz/datasets/ https://spamassassin.apache.org/

17

Journal Pre-proof

Table 2: Statistics on stream classification datasets.

Instances Dimensions 9324 500 45,312 8 581,012 54

Classes 2 2 7

pro of

Dataset Spam Electricity Covtype

5.2. Baseline Algorithms We first compare our proposed algorithm SPN-DSC against the well-developed online learning algorithms for SPNs with continuous variables. • SPN-DSC: We gradually grow a network structure by keeping track of representative examples over time, and continuously update the parameters using discriminative gradient descent with MPE inference.

re-

• oSLRAU [13]5 : This method incrementally constructs a network in a bottom up fashion by detecting correlation and modifying product nodes to represent these correlations. The parameters are updated based on the running average update procedure.

lP

• oBMM [11]: This is a parameter learning algorithm without any hyperparameters to tune, which extends the online Bayesian moment matching and online EM algorithms from categorical [10] to Gaussian SPNs. We combine it with our structure update procedure for a fair comparison.

Jo

urn a

We implemented SPN-DSC in Java. The source code of oBMM was obtained from the authors. To investigate the effectiveness of the proposed modifications to LearnSPN w.r.t. the introduction of the merging strategies and the discriminative loss, we simplify SPN-DSC to the case in which a single batch is employed. We call this method SPN-DSC-Batch. We also investigate the contribution of Huber loss by applying the classical Conditional Log-Likelihood (CLL) alone. We call this method SPN-DSC-CLL. Altogether we compare five methods: SPN-DSC, SPN-DSC-Batch, SPN-DSC-CLL, oSLRAU, and oBMM. We measure their quality in terms of average accuracy on the held-out test data sets and their scalability in terms of runtime. They are run on identically configured servers (2.0 GHz, 1T RAM, 18 432 KB CPU cache). 5

The source code is available at https://github.com/whsu/spn.

18

Journal Pre-proof

In addition, we compare SPN-DSC to the incremental prototype learning algorithms, ILVQ/NN and ILVQ/STM, which have achieved the current state-ofthe-art on the handwritten digit datasets.

pro of

• ILVQ/NN [38, 21]: This method adapts newest data by updating two nearest prototypes from the genuine class and the rival class. Two types of prototypes are learned for each class, including style-conscious and stylefree prototypes. A test point is then classified based on the nearest prototype without considering style consistency. • ILVQ/STM [21, 39]: It performs the same learning step as above, while utilizing style consistency of writer-specific data during prediction. To extensively evaluate the proposed algorithm SPN-DSC, we then compare its performance to several baseline algorithms for data stream classification.

re-

• Hoeffding Adaptive Tree [15, 40]. It is an adaptation of the decision tree induction algorithm which operates on continuously-changing data streams by dynamically growing an alternate sub-tree to replace the old one when the new one becomes more accurate.

lP

• Weighted Ensemble[17, 40]. The algorithm trains an ensemble of classification models from sequential chunks of the data stream. The classifiers in the ensemble are judiciously weighted based on their expected prediction accuracy on the current test data.

urn a

• OzaBagAdwin [18, 40]. It is an online bagging framework for tackling concept drift. Two new variants of Bagging are presented: ASHT Bagging using adaptive-size Hoeffding tree and ADWIN Bagging using a change detector to decide when to discard underperforming ensemble members. • IBLStreams [19]6 . It is an instance-based learning algorithm for classification on data streams. Two specifically designed editing strategies are used to deal with gradual and abrupt concept drift. Predictions are derived by combining the output values provided by the neighbours. The source code is available at http://www.uni-marburg.de/fb12/kebi/research/software/iblstreams.

Jo

6

19

Journal Pre-proof

• SyncStream [20]7 . The algorithm identifies the most important prototypes to capture evolving concepts in an implicitly way, while the sudden concept drift is handled based on two heuristic strategies. The P-Tree data structure is proposed to support efficient online data maintenance.

pro of

For the stream classification datasets, we adopt the interleaved-test-then-train evaluation method to measure the quality of the five algorithms in addition to SPN-DSC. We consider the following performance metrics: accuracy, precision, recall and runtime.

Jo

urn a

lP

re-

5.3. Prediction Performance Analysis 5.3.1. Hold-out Evaluation The hold-out method aims at giving an insight on how the model will generalize to new data that was not used in estimating it. We first compare the generalization performance of our algorithm against that of two well-developed online SPNs, oSLRAU and oBMM, with continuous variables. The performance in terms of accuracy and runtime are shown in Table 3 and Table 4. The best results for each dataset are highlighted in bold. Compared to oSLRAU, we see that SPN-DSC speeds up the learning process by almost 20 to 125 times, yet the classification accuracy does not suffer. For all the four datasets, SPN-DSC consistently yields higher accuracy performance than oSLRAU, rising around 2.8 percent on average. Compared to oBMM, the proposed SPN-DSC method also shows apparently superior performance and achieves an improvement of more than 4.0 percent on SD7-Train/SD3-Test dataset. Nevertheless, the gap of runtime seems less significant. This suggests that the parameter update component accounts for only a tiny fraction of runtime, but leads to non-trivial accuracy benefits. Next we compare SPN-DSC to its components SPN-DSC-Batch and SPNDSC-CLL. We observe that SPN-DSC with a mini-batch of observations at a time often leads to the smaller or comparable error rates compared to its nonincremental counterpart SPN-DSC-Batch. This demonstrates that the introduced discriminative loss and merging strategies can guide the incremental learning process to more accurate classification models, albeit at an increased computational cost in many cases. Note that SPN-DSC sometimes runs faster than SPN-DSCBatch (e.g., on the dataset of SD3-Train/SD7-Test), with a slight loss in accuracy. This is because SPN-DSC applies a vigilance parameter for collecting represen7

The source code is available at http://dm.uestc.edu.cn/junming-shao/.

20

Journal Pre-proof

pro of

tative examples. When only a small fraction of instances are eligible to learn the structure, the computational effort is reduced accordingly. Comparing SPN-DSC and SPN-DSC-CLL, we observe that SPN-DSC achieves consistently lower classification errors. This suggests that the proposed Huber loss is more robust to outliers than the classical conditional log-likelihood alone. Examining the runtime of SPN-DSC-CLL, we find that it requires longer runtime than SPN-DSC. The reason lies behind how the discriminant function is punished. For the instances that are well classified, the discriminant function in SPN-DSC receives no punishment, thus reducing the parameter update frequency and making the algorithm more time efficient.

lP

re-

Table 3: Comparison of accuracy on the handwritten digit datasets. SD3-Train/ SD3-Train/ SD7-Train/ SD7-Train/ Dataset SD3-Test SD7-Test SD3-Test SD7-Test SPN-DSC 99.05% 95.63% 97.33% 97.23% SPN-DSC-Batch 99.02% 95.70% 97.09% 97.21% SPN-DSC-CLL 98.91% 95.21% 96.67% 96.95% oSLRAU 97.01% 93.01% 93.68% 94.38% oBMM 97.93% 92.58% 93.30% 94.73%

urn a

Table 4: Comparison of runtime on the handwritten digit datasets. SD3-Train/ SD3-Train/ SD7-Train/ SD7-Train/ Dataset SD3-Test SD7-Test SD3-Test SD7-Test SPN-DSC 3116.926 502.143 398.131 441.983 SPN-DSC-Batch 2589.628 2593.344 256.389 255.000 SPN-DSC-CLL 4635.175 673.926 562.676 574.196 oSLRAU 61158.13 62290.43 26383.03 25447.44 oBMM 4620.113 4597.620 467.111 476.046

Jo

When comparing our predictor to current state-of-the-art prototype classifiers for the online character recognition task, SPN-DSC consistently outperforms the two baselines in terms of accuracy, as shown in Fig. 2, thus scoring the new state-of-the-art. It is worth noting that ILVQ/STM is a style-aware method, and designed by assuming the patterns of each writer have the same writing style. This suggests that SPN-DSC’s ability to track representative examples that capture the trend of time-changing class distributions is an advantage over the local style consistency utilized by STM.

21

Journal Pre-proof

100 SPN-DSC ILVQ/NN

99

ILVQ/STM

97

96

95

94

93

92 SD3-Train/SD3-Test

SD3-Train/SD7-Test

pro of

Accuracy (%)

98

SD7-Train/SD3-Test

SD7-Train/SD7-Test

Datasets

Figure 2: Comparison of SPN-DSC with state-of-the-art prototype classifiers.

Jo

urn a

lP

re-

5.3.2. Interleaved Test-then-Train Evaluation The interleaved test-then-train method is designed specifically for stream settings, in the sense that each instance is used to test the model, and then the same instance is used to train the model. We report the performance with respect to the benchmark stream classification algorithms on the mentioned three real data streams. The evaluation metrics include accuracy, precision, recall and computation time. The results are summarized in Table 5, the former three metrics shown in percentage. It is evident that SPN-DSC outperforms the baseline techniques in overall classification accuracy, precision and recall. For example, SyncStream’s accuracy is 84.59% on Electricity, which is the best result up to now, while SPNDSC raises the accuracy to 95.35%, a significant improvement of more than 10%. Besides, SPN-DSC also achieves much higher precision and recall. These results suggest the effectiveness of SPN-DSC, a deep neural probabilistic model, for online learning with challenging non-stationary data distributions. Regarding the computation time, we see that Hoeffding Adaptive Tree is the fastest algorithm due to the Hoeffding bound, yet compromising the classification quality. SPN-DSC, on the other hand, is the most time-consuming algorithm, with the exception of having lower time costs than IBLStream on Spam dataset. The difference in computation time is obvious on the largest dataset (Covtype), where SPN-DSC is 2 orders of magnitude slower than SyncStream, Weighted Ensemble and OzaBagAdwin that have comparable runtime. Nevertheless, SPN-DSC yields the best prediction performance across all the datasets, in the sense of trading runtime for accuracy. We expect less runtime overhead by devising more efficient 22

Journal Pre-proof

pro of

merging strategies for SPN-DSC. However, this is the scope of our future work. We also compare SPN-DSC to prototype based algorithm ILVQ/NN which has achieved the current state-of-the-art for the online character recognition task. Results are shown in Fig. 3. Apparently, SPN-DSC dominates ILVQ/NN in accuracy, further validating the advantages of SPN-DSC. Recall that ILVQ/STM requires knowing in advance which instances belong to the same writer during prediction, and so we do not evaluate it using the interleaved test-then-train method. Table 5: Comparison of performance metrics on the stream classification datasets.

Electricity

urn a

Covtype

re-

Spam

Methods accuracy precision recall runtime(s) SPN-DSC 99.39% 99.34% 99.05% 121.294 SyncStream 97.19% 95.90% 96.65% 29.780 IBLStream 93.70% 90.70% 93.72% 702.632 Hoeffding Adaptive Tree 90.71% 87.17% 89.35% 2.252 Weighted Ensemble 86.29% 81.39% 81.76% 13.000 OzaBagAdwin 91.08% 87.65% 89.73% 10.848 SPN-DSC 95.35% 95.42% 95.07% 22.710 SyncStream 84.59% 84.25% 84.19% 3.280 IBLStream 76.88% 76.48% 75.84% 7.512 Hoeffding Adaptive Tree 83.98% 84.09% 82.96% 0.750 Weighted Ensemble 70.92% 70.24% 70.22% 3.920 OzaBagAdwin 83.97% 83.99% 83.02% 3.810 SPN-DSC 97.78% 93.91% 93.61% 36697.281 SyncStream 94.38% 89.15% 89.80% 226.331 IBLStream 91.97% 86.20% 85.73% 3005.412 Hoeffding Adaptive Tree 80.87% 70.85% 71.73% 31.692 Weighted Ensemble 80.33% 74.76% 66.90% 365.582 OzaBagAdwin 83.83% 78.48% 77.22% 176.000

lP

Dataset

Jo

5.4. Sensitivity Analysis We examine how classification performance is sensitive against parameter variations. Fig. 4 plots the curves on accuracy and runtime versus the vigilance parameter ρ. We observe that the accuracy performance is quite stable (the variations are less than 0.44%), while the computation time has a sharp drop when ρ exceeds a certain threshold. This can be attributed to the decreased effort needed for structure updates. Fig. 5 plots the curves on accuracy and runtime versus the 23

Journal Pre-proof

100 SPN-DSC

99

ILVQ/NN

97

pro of

Accuracy(%)

98

96

95

94

93

92 SD7-Train

SD7-Test

SD3-Train

SD3-Test

Datasets

re-

Figure 3: Comparison of SPN-DSC with state-of-the-art prototype classifiers.

Jo

urn a

lP

cache size τ . Similarly, the variations of accuracy remain slim. The only exception to this trend is when the cache size is below 35 for the SD7-Train/SD7-Test dataset. This implies that too frequent updates of the structure would incur performance loss due to model overfitting. Besides, the computation time fluctuates with increasing cache size, and exhibits overall downward trend in large training set like SD3-Train. Next we examine how the structure quality is affected by the proposed learning schema and the employed merging criterion. To do this, we measure the number of nodes for the best models learned by SPN-DSC and SPN-DSC w/o merging while varying the same hyperparameters as above. Fig. 6 plots the curves on the number of nodes versus the vigilance parameter ρ. As it can be seen, the introduction of the merging strategies always reduces the number of nodes and pushes the network size towards that of SPN-DSC-Batch, except when the vigilance parameter is increased too much. It is clear from the trend that the number of nodes for both variants keeps decreasing as the vigilance parameter gets bigger, and their gaps are diminished after a certain point. This is because increasing the vigilance parameter essentially amounts to decreasing the instances to learn the structure, thus making less frequent updates for the structure. Fig. 7 plots the curves on the number of nodes versus the cache size τ . Similar characteristics are observed. The employed merging criterion always makes the networks more compact. It is worth noting that while the cache size increases, 24

SD7-Train / SD7-Test

100

500

550

97

96

500

Runtime (sec)

96

Accuracy (%)

550

97

95

95

re-

450

450

94

220

230

240

250

260

Vigilance Parameter

SD3-Train / SD3-Test

93 200

7000

98

97

6000

99

5000

98

4000

96

3000

urn a

95

240

250

260

270

280

290

300

Runtime (sec)

99

210

220

100

lP

100

400 270

250

260

SD3-Train / SD7-Test

6000

5000

3000

2000

1000

94

93 230

1000

240

250

260

270

280

290

300

Vigilance Parameter

Jo

Figure 4: The trajectories of accuracy and runtime versus vigilance parameter ρ.

25

400 270

96

95

0 310

240

97

2000

Vigilance Parameter

230

Vigilance Parameter

4000

Accuracy (%)

210

0 310

Runtime (sec)

94

Accuracy (%)

600

98

Runtime (sec)

Accuracy (%)

98

93 230

650

99

600

94

SD7-Train / SD3-Test

100

650

99

93 200

pro of

Journal Pre-proof

800

750

94

600 92 550

35

40

45

50

55

60

Cache Size

SD3-Train / SD3-Test

100

450 65

93 30

4400

100

4350

99

lP

99

4200

97

4150

96

4100

95

urn a

4050

94 50

55

60

65

70

75

80

4000 85

35

40

45

50

55

SD3-Train / SD7-Test

500 65

6800 6700 6600 6500 6400

96 6300 95

6200

94

6100

55

60

65

70

75

Cache Size

Figure 5: The trajectories of accuracy and runtime versus cache size τ .

26

60

Cache Size

97

93 50

Cache Size

Jo

550

98

Accuracy (%)

Accuracy (%)

4250

Runtime (sec)

4300

98

600

94

re-

500

30

650

96

95

90

88

700

97

Runtime (sec)

650

750

98

Runtime (sec)

Accuracy (%)

99

700

96

800

80

6000 85

Runtime (sec)

98

SD7-Train / SD3-Test

100

Accuracy (%)

SD7-Train / SD7-Test

100

pro of

Journal Pre-proof

Journal Pre-proof

2

× 105

SD7-Train / SD7-Test

2 SPN-DSC SPN-DSC w/o merging SPN-DSC-Batch

1.9

SPN-DSC SPN-DSC w/o merging SPN-DSC-Batch

1.8

Number of Nodes

1.7

1.6

1.7

1.6

pro of

Number of Nodes

SD7-Train / SD3-Test

1.9

1.8

1.5

1.5

1.4

1.4

1.3 200

210

220

230

240

250

260

1.3 200

270

Vigilance Parameter

6

× 105

× 105

220

230

240

250

SPN-DSC SPN-DSC w/o merging SPN-DSC-Batch

6

× 105

SPN-DSC SPN-DSC w/o merging SPN-DSC-Batch

Number of Nodes

4

3

2

1

4

re-

Number of Nodes

270

SD3-Train / SD7-Test

5

0 230

260

Vigilance Parameter

SD3-Train / SD3-Test

5

210

3

2

1

240

250

260

270

280

290

300

0 230

310

lP

Vigilance Parameter

240

250

260

270

280

290

300

310

Vigilance Parameter

Figure 6: The trajectories of structure quality versus vigilance parameter ρ.

urn a

the number of nodes for both variants become less, except for SPN-DSC on SD7Train/SD7-Test and SD3-Train/SD7-Test. The decrement occurs because larger cache size means more instances for generating sub-SPNs, which leads to less frequent updates. On the other hand, as merging helps yielding smaller networks, the number of nodes for SPN-DSC fluctuates on some datasets. 6. Conclusion

Jo

We proposed SPN-DSC, a novel SPN-based classification algorithm for conceptdrifting data streams, with the ability of online structure and discriminative parameter learning. SPN-DSC keeps representative examples that characterize the time-changing class distributions to enrich the network representation. A vigilance parameter is used to the tradeoff between the adaptation of already learned 27

× 105

SD7-Train / SD7-Test

2

SPN-DSC SPN-DSC w/o merging SPN-DSC-Batch

1.8

1.8

1.7

1.6 30

35

40

45

50

55

60

Cache Size

× 105

SD3-Train / SD3-Test

1.6 30

65

3.3

35

40

45

50

lP

3.9

3.7

3.5

urn a

3.3

55

60

65

70

75

80

× 105

SPN-DSC SPN-DSC w/o merging SPN-batch SPN-DSC-Batch

× 104

9.8

9.4

9

55

60

65

70

75

Figure 7: The trajectories of structure quality versus cache size τ .

Jo

65

SD3-Train / SD7-Test

Cache Size

28

60

3.29

10.1

8.6 50

85

Cache Size

55

Cache Size

SPN-DSC SPN-DSC w/o merging SPN-DSC-Batch

4.1

3.1 50

re-

1.7

Number of Nodes

SD7-Train / SD3-Test

1.9

Number of Nodes

Number of Nodes

1.9

4.2

× 105

SPN-DSC SPN-DSC w/o merging SPN-DSC-Batch

Number of Nodes

2

pro of

Journal Pre-proof

80

85

Journal Pre-proof

7. Acknowledgments

lP

re-

pro of

structure (i.e., parameter update) and the generation of new sub-structures (i.e., structure update). Specifically, if the new mini-batch of samples deviate significantly from the current distribution estimation of SPN, new sub-SPNs are generated and added to the class-conditional SPNs. While in parameter learning, we use a modified Huber loss, which is a margin-based log-likelihood loss and is robust to outliers, to fine tune the parameters of SPN. The benefit of structure update is the ability to improve the representation for streaming data, while parameter update is to improve the prediction under drift. Experimental results on different types of data show that the proposed approach outperforms the state-of-the-art online structure learner for SPNs, and yields higher classification performance than state-of-the-art data stream classifiers. As the merging strategy of sub-SPNs may influence space and computation efficiency, a special emphasis for future work is placed on the relaxation of product node merging requirement that is strictly based on the dependency of the variables. Another focus concerns strategies dealing with feature evolution and concept-evolution. The case of feature evolution involves changing feature space representation over time in data stream, thus requires dynamic feature space conversion techniques. The latter case involves new classes evolving in the steam, such that adaptive novel class detection techniques are required before generating new class-conditional structures accordingly.

urn a

We wish to thank the anonymous referees for their careful reading and valuable comments. This research work was supported by the National Key R&D Program of China (No. 2017YFC0803700), the National Natural Science Foundation of China (No. 61876183, 61721004, 61772525 and U1636220) and the Natural Science Foundation of Beijing Municipality (No. 4172063). References

[1] H. Poon, P. Domingos, Sum-product networks: a new deep architecture, in: Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence, 2011, pp. 2551–2558.

Jo

[2] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, L. K. Saul, An introduction to variational methods for graphical models, Machine Learning 37 (2) (1999) 183–233. 29

Journal Pre-proof

[3] M. J. Wainwright, M. I. Jordan, Graphical models, exponential families, and variational inference, Foundations and Trendsr in Machine Learning 1 (12) (2008) 1–305.

pro of

[4] R. Gens, P. Domingos, Discriminative learning of sum-product networks, in: Advances in Neural Information Processing Systems 25, 2012, pp. 3248– 3256. [5] R. Peharz, G. Kapeller, P. Mowlaee, F. Pernkopf, Modeling speech with sumproduct networks: application to bandwidth extension, in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing 39, 2014, pp. 3699–3703.

re-

[6] W.-C. Cheng, S. Kok, H. V. Pham, H. L. Chieu, K. M. A. Chai, Language modeling with sum-product networks, in: Proceedings of the Annual Conference of the International Speech Communication Association 15, 2014. [7] R. Gens, P. Domingos, Learning the structure of sum-product networks, in: Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 873–880.

lP

[8] A. Dennis, D. Ventura, Learning the architecture of sum-product networks using clustering on variables, in: Advances in Neural Information Processing Systems 25, 2012, pp. 2033–2041.

urn a

[9] R. Peharz, B. C. Geiger, F. Pernkopf, Greedy part-wise learning of sumproduct networks, in: Machine Learning and Knowledge Discovery in Databases, 2013, pp. 612–627. [10] A. Rashwan, H. Zhao, P. Poupart, Online and distributed bayesian moment matching for parameter learning in sum-product networks, in: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 2016, pp. 1469–1477.

Jo

[11] P. Jaini, A. Rashwan, H. Zhao, Y. Liu, E. Banijamali, Z. Chen, P. Poupart, Online algorithms for sum-product networks with continuous variables, in: Proceedings of the 8th International Conference on Probabilistic Graphical Models, 2016, pp. 228–239.

30

Journal Pre-proof

[12] S.-W. Lee, M.-O. Heo, B.-T. Zhang, Online incremental structure learning of sum-product networks, in: Proceedings of the 20th International Conference on Neural Information Processing, 2013, pp. 220–227.

pro of

[13] W. Hsu, A. Kalra, P. Poupart, Online structure learning for sum-product networks with gaussian leaves, in: CoRR abs/1701.05265, 2017. [14] G. Hulten, L. Spencer, P. Domingos, Mining time-changing data streams, in: Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001, pp. 97–106. [15] A. Bifet, R. Gavald`a, Adaptive learning from evolving data streams, in: IDA 2009: 8th International Symposium on Intelligent Data Analysis, 2009, pp. 249–260.

re-

[16] C. C. Aggarwal, J. Han, J. Wang, P. S. Yu, A framework for on-demand classification of evolving data streams, IEEE Transactions on Knowledge and Data Engineering 18 (5) (2006) 577–589.

lP

[17] H. Wang, W. Fan, P. S. Yu, J. Han, Mining concept-drifting data streams using ensemble classifiers, in: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, pp. 226–235.

urn a

[18] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, R. Gavald`a, New ensemble methods for evolving data streams, in: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009, pp. 139–148. [19] A. Shaker, E. H¨ullermeier, Iblstreams: A system for instance-based classification and regression on data streams, Evolving Systems 3 (4) (2012) 235–249. [20] J. Shao, Z. Ahmadi, S. Kramer, Prototype-based learning on concept-drifting data streams, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, pp. 412–421.

Jo

[21] Y. Y. Shen, C. L. Liu, Incremental learning vector quantization for character recognition with local style consistency, in: Proceedings of the 8th International Conference on Brain-Inspired Cognitive Systems, 2016, pp. 228–239. 31

Journal Pre-proof

[22] G. Zhou, K. Sohn, H. Lee, Online incremental feature learning with denoising autoencoders, in: Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, 2012, pp. 1453–1461.

pro of

[23] A. Rooshenas, D. Lowd, Learning sum-product networks with directed and indirected variable interactions, in: Proceedings of the 31st International Conference on Machine Learning, 2014, pp. 710–718. [24] A. Vergari, N. D. Mauro, F. Esposito, Simplifying, regularizing and strengthening sum-product networks structure learning, in: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, 2015, pp. 343–358.

re-

[25] T. Adel, D. Balduzzi, A. Ghodsi, Learning the structure of sum-product networks via an svd-based algorithm, in: Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence, 2015, pp. 32–41. [26] T. Rahman, V. Gogate, Merging strategies for sum-product networks: From trees to graphs, in: Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence, 2016, pp. 617–626.

lP

[27] A. Rooshenas, D. Lowd, Discriminative structure learning of arithmetic circuits, in: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 2016, pp. 1506–1514.

urn a

[28] A. Molina, A. Vergari, N. D. Mauro, S. Natarajan, F. Esposito, K. Kersting, Mixed sum-product networks: A deep architecture for hybrid domains, in: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018, pp. 3828–3835. [29] R. Peharz, A. Vergari, K. Stelzner, A. Molina, M. Trapp, K. Kersting, Z. Ghahramani, Probabilistic deep learning using random sum-product networks, in: CoRR abs/1806.01910, 2018. [30] R. Peharz, Foundations of sum-product networks for probabilistic modeling, Ph.D. thesis, Medical University of Graz (2015).

Jo

[31] R. Peharz, R. Gens, F. Pernkopf, P. Domingos, On the latent variable interpretation in sum-product networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (10) (2017) 2030–2044. 32

Journal Pre-proof

[32] B. Woolf, The log likelihood ratio test (the g-test), Annals of Human Genetics 21 (4) (1957) 397–409.

pro of

[33] Z. Bian, X. Zhang, Pattern recognition (Chinese Edition), Tsinghua University Press, 2000. [34] M. Scutari, Learning bayesian networks with the bnlearn r package, Journal of Statistical Software 35 (3) (2010) 1–22. URL http://www.jstatsoft.org/v35/i03/ [35] T. Zhang, Statistical behavior and consistency of classification methods based on convex risk minimization, The Annuals of Statistics 32 (2004) 56– 85.

re-

[36] P. Grother, Handprinted forms and character database, nist special database 19, in: Technical Report and CDROM, 1995. [37] S. Veeramachaneni, G. Nagy, Adaptive classifiers for multisource ocr, Document Analysis and Recognition 6 (3) (2003) 154–166.

lP

[38] X. B. Jin, C. L. Liu, X. Hou, Regularized margin-based conditional loglikelihood loss for prototype learning, Pattern recognition 43 (7) (2010) 2428–2438. [39] X. Y. Zhang, C. L. Liu, Writer adaptation with style transfer mapping, IEEE transactions on pattern analysis and machine intelligence 35 (7) (2013) 1773–1787.

Jo

urn a

[40] J. Montiel, J. Read, A. Bifet, T. Abdessalem, Scikit-multiflow: A multioutput streaming framework, Journal of Machine Learning Research 19 (72) (2018) 1–5.

33

Journal Pre-proof

Conflict of Interest The authors declared that they have no conflicts of interest to this work.

Jo

urn a

lP

re-

pro of

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.