Probability density function estimation with the frequency polygon transform

Probability density function estimation with the frequency polygon transform

Information Sciences 298 (2015) 136–158 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins...

3MB Sizes 18 Downloads 129 Views

Information Sciences 298 (2015) 136–158

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

Probability density function estimation with the frequency polygon transform Ezequiel López-Rubio ⇑, José Muñoz-Pérez Department of Computer Languages and Computer Science, University of Málaga, Bulevar Louis Pasteur, 35, 29071 Málaga, Spain

a r t i c l e

i n f o

Article history: Received 11 April 2014 Received in revised form 16 October 2014 Accepted 3 December 2014 Available online 12 December 2014 Keywords: Probability density function estimation Nonparametric estimation Multivariate frequency polygon Computer vision Object tracking

a b s t r a c t Most current nonparametric approaches to probability density function estimation are based on the kernel density estimator, also known as the Parzen window estimator. A usual alternative is the multivariate histogram, which features a low computational complexity. Multivariate frequency polygons have often been neglected, even though they share many of the advantages of the histograms, while they are continuous unlike the histograms. Here we build on our previous work on histograms in order to propose a new probability density estimator which is based on averaging multivariate frequency polygons. The convergence of the estimator is formally proved. Experiments are carried out with synthetic and real machine learning datasets. Finally, image denoising and object tracking applications are also considered. Ó 2014 Elsevier Inc. All rights reserved.

1. Introduction Probabilistic approaches to pattern recognition and machine learning often involve the estimation of the underlying probability density function (pdf) for a given dataset. Density estimators have been used for many years by statisticians, scientists in general and engineers as tools to draw inferences from data in social, physical and computer sciences [48,56,25]. From the theoretical point of view, density estimation is a complex problem since it is known that all nonparametric estimators of the pdf are biased [49]. As said before, the approximation of the underlying pdf of the given data set arises in many fields. Bayesian decision theory relies on the accurate estimation of the class pdfs [66,8]. Risk assessment and fault detection methods require the estimation of the pdf for historical data in order to predict events [1]. Bayesian image filtering can use pdf estimation to compute the likelihood of a possible original image, given the observed image [14]. Outlier detection and feature selection can be related to the ratio of two pdfs [59,58]. Blind signal detection greatly benefits from the estimation of the pdf of the noise when it is non Gaussian [43]. The estimation of the pdf of pixel values is of paramount importance to some methods for image denoising [46] and medical image segmentation [57]. Finally, the mean shift algorithm finds modes of estimated pdfs. It is in widespread use for object tracking [9,5,13,36,37] and image segmentation [20,24,38,60,67] applications. Here we develop a continuous pdf estimator for multivariate data with a reduced computational complexity and a high estimation accuracy. It is based on our previous work on multivariate histograms [39], and it addresses the issues which arise when they are replaced by frequency polygons. These issues include the increased computational complexity with respect to

⇑ Corresponding author. E-mail addresses: [email protected] (E. López-Rubio), [email protected] (J. Muñoz-Pérez). URL: http://www.lcc.uma.es http://dx.doi.org/10.1016/j.ins.2014.12.014 0020-0255/Ó 2014 Elsevier Inc. All rights reserved.

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158

137

histograms due to the use of more than one bin to estimate the pdf at each test point, the need for a fast procedure to locate these bins, and the fact that the optimal bin widths for histograms and frequency polygons are different. Our proposal yields a smoother and more accurate estimation, and we prove its convergence to the true pdf. Moreover, a mode finding algorithm based on this estimator is proposed as an alternative to the well known mean shift algorithm. This proposal is much faster than the algorithm that was proposed for multivariate histograms and it has less tunable parameters, so it is easier to use in applications. The structure of this paper is as follows. An overview of related work is done in Section 2. After that. Section 3 provides a gentle introduction to multivariate frequency polygons, so that the context of our proposal is outlined. The proposed pdf estimator and the mode finding algorithm are defined in Section 4. The most important properties of the proposal and the main results of the experiments are discussed in Section 5. Experiments are reported in Section 6. Finally, Section 7 is devoted to conclusions. 2. Related work Parametric estimators based on mixture models can be used to approximate a pdf when the dataset is known to contain some clusters, so that each mixture component models one of these clusters [53]. Here we are interested in the remaining situations, where nonparametric approaches are more suitable. Kernel estimators are by far the most popular [26]. Their fundamental tunable parameter is the kernel bandwidth, whose selection has been extensively studied [35,10,4,2]. They share a computational complexity issue, since in principle all the training samples are retained in the model so as to place a kernel on each sample. This means that the time complexity grows linearly with the number of training samples N and with the number of test samples M, so that the estimator is OðNMÞ. Fortunately optimization techniques are available, and among them the Fast Gauss Transform or FGT [22], and its enhancement the Improved Fast Gauss Transform or IFGT [65,42] have become a standard. Furthermore, specialized architectures based on field-programmable gate arrays (FPGAs) have been built to achieve real-time pdf estimation for particularly complex problems [16]. Another strategy to reduce the computational load of the kernel estimator is the use of fast nearest neighbors algorithms based on binary trees, such as kd-trees [6,21], and anchors hierarchies [44]. All of these optimizations produce approximations to the original kernel estimator, so the accuracy of the approximation is an issue to be taken into account. Multivariate histograms have a fundamental advantage over the kernel estimator, namely its summarizing property. That is, once the training samples have been tallied according to the histogram bins, the complexity of the test phase is no longer dependent on the size of the training set, but only on the number of nonempty bins. This comes at the price that the obtained pdf estimator is not continuous. This makes them unsuitable for many applications, so that their use is focused towards those situations where real time operation is critical. Sparse coding of object appearance is a typical example from computer vision [63,17]. The choice of the bin width plays a role similar to that of the bandwidth of the kernel estimator [31,15]. Frequency polygons are piecewise linear (first order) estimators, so that the discontinuities of the histograms are not present [52,29]. Histograms are piecewise constant (zeroth order) estimators. Frequency polygons have gone almost unnoticed for the practitioners in the above mentioned fields, even if their accuracy is much higher than that of the histograms, while they share their summarizing property [55]. 3. Motivation Next a simple example is used in order to highlight the differences between multivariate histograms and frequency polygons. Let us consider the uniform distribution over a circle, from which N ¼ 1000 training samples are drawn. For a bin width h ¼ 0:07, the training samples are binned into squares of size 0:07  0:07 (Fig. 1). At this point two alternatives are possible:

1

0.75

0.5

0.25

0

0

0.25

0.5

0.75

1

Fig. 1. Example training dataset with N ¼ 1000 samples plotted as dots. The lines are the bin boundaries for bin width h ¼ 0:07.

138

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158

 Each bin count is used to compute a constant probability density estimate for its bin, i.e. a multivariate histogram.  Each bin count is used to compute a probability density estimate for the center of its bin, and then a piecewise linear estimation of the pdf is built, i.e. a multivariate frequency polygon. In this paper two ways to build frequency polygons are considered, which are depicted in Fig. 2: a linear blend over squares (Fig. 2a), and a linear interpolation over simplices (Fig. 2b). In both cases the vertices of the polygons, i.e. the mesh points, are the bin centers of the binning. This is the reason because the lines in Fig. 2 are shifted by 2h ¼ 0:035 with respect to those in Fig. 1. It is also worth noting that for this two dimensional example, the simplices are triangles (Fig. 2b). The estimated pdfs for the three described methods are shown in Fig. 3. We have used the Morgenstemning color map for improved contrast [19]. Three key features of the methods are:  The histogram produces a discontinuous, piecewise constant estimated pdf (Fig. 3a). On the other hand, the frequency polygons yield continuous, piecewise linear estimated pdfs (Fig. 3b and c).  The underlying grids of the two frequency polygon construction methods are different: squares for linear blend (Figs. 2a and 3b); and triangles for the linear interpolation over simplices (Figs. 2b and 3c).  For a given dataset the frequency polygons attain a better approximation to the true pdf than the histogram. This is further discussed in Section 5.

1

1

0.75

0.75

0.5

0.5

0.25

0.25

0

0

0.25

0.5

0.75

0

1

0

0.25

(a)

0.5

0.75

1

(b)

Fig. 2. Polygon meshes to build frequency polygons: (a) linear blend method, and (b) simplex method.

(a)

(b)

(c)

(d)

0

1

2

3

(e) Fig. 3. Estimated probability density functions: (a) histogram, (b) frequency polygon by linear blend, (c) frequency polygon by linear interpolation over simplices, (d) true pdf, and (e) probability density key.

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158

139

In our earlier work [39], it was shown that the performance of multivariate histograms can be greatly improved by considering the average of some of them, where the input samples for each histogram are transformed by randomly chosen affine transforms. In the next section this idea is extended to multivariate frequency polygons. 4. Methodology In this section the proposed methods for pdf estimation and mode finding are defined. Section 4.1 presents the density estimator, and its convergence to the true pdf is proved in Section 4.2. Finally Section4.3 explains the algorithm to search for a mode. 4.1. Probability density estimation Following the strategy in [39], the estimator is defined as an average of many frequency polygons, plus a term which takes care of those regions of the input space where no training samples are present. The Frequency Polygon Transform (FPT) is defined as follows:

FPT ðxÞ ¼ pZeroDens pðx j ZeroDensÞ þ ð1  pZeroDens Þ

Z

pðx j A; bÞpðA; bÞdAdb

ð1Þ

where 

pZeroDens 2 ½0; 1 is the a priori probability of finding a test sample in the outer region where the frequency polygons

yield zero density, i.e. the probability that x is not properly modeled by the multivariate frequency polygons.  pðx j ZeroDensÞ is a multivariate continuous pdf with support RD which models the outer test samples.  pðA; bÞ is the a priori probability of the affine transform y ¼ Ax þ b.  pðx j A; bÞ is the multivariate frequency polygon pdf corresponding to the affine transform defined by A and b. The integral (1) cannot be evaluated analytically. It can be approximated by Monte Carlo integration, which leads to the Discrete Frequency Polygon Transform (DFPT):

DFPTðxÞ ¼ pZeroDens pðx j ZeroDensÞ þ

H 1  pZeroDens X pðx j Ai ; bi Þ H i¼1

ð2Þ

where H is the number of frequency polygons, and Ai , bi define the transformation associated to the i-th frequency polygon. Next the details of definitions (1) and (2) are given. It must be ensured that the pZeroDens pðx j ZeroDensÞ term in the right hand side of (2) tends to zero as the number of training samples grows. This is required to make the estimator converge to the true pdf, and can be achieved by letting the probability pZeroDens tend to zero as N ! 1:

pZeroDens ¼

1 Nþ1

ð3Þ

which means that the a priori probability of finding a sample where the frequency polygons return zero density is the weight which would correspond to an additional N þ 1-th training sample. The DFPT would still converge to the true pdf if (3) was substituted by any function that tends to zero as N ! 1. However, the performance for finite N would vary, since this choice defines the relative weight of the pZeroDens pðx j ZeroDensÞ term of the right hand side of (2). In absence of further information, we may choose a multivariate Gaussian as the continuous pdf which models these samples, with mean and covariance matrices computed from the overall set of training data:

pðx j ZeroDensÞ ¼ ð2pÞD=2 det ðCÞ1=2   exp ðx  lÞT C1 ðx  lÞ

l¼ C¼

1 N

N X

xj

ð4Þ ð5Þ

j¼1

N   T 1 X xj  l xj  l N  1 j¼1

ð6Þ

Given the i-th affine transform y ¼ Ax þ b, the frequency polygon probability pðx j A; bÞ is defined by considering a bin   width h ¼ 1 on the transformed space. It must be highlighted that using h – 1 is never necessary, because the same effect is obtained by scaling the transformation matrix A. The bin indices of the transformed sample y 2 RD are obtained as an index vector k 2 ZD : T

k ¼ ðk1 ; . . . ; kD Þ ¼ floorðyÞ

ð7Þ

140

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158

where floor rounds each vector component towards 1. The D-dimensional volume of the frequency polygon bins in the input space is given by: 1

V ¼ ðdet AÞ

D Y

¼

!1 ð8Þ

ks

s¼1

where ks are the eigenvalues of A. Only positive definite matrices A are considered in this paper, so ks > 0. There are two methods to define a frequency polygon probability density, namely the Linear Blend Frequency Polygon (LBFP) and the Simplex Frequency Polygon (SFP), as seen in [55], page 107, and exemplified in Section 3. The LBFP method divides the transformed space into hypercubes, and a linear blend is carried out in each hypercube. In our case, the hypercubes with integer vertex coordinates are considered. The associated probability density reads:

pLBFP ðx j A; bÞ ¼

1 NV

X

cr1 ;...;rD mk1 þr1 ;...;kD þrD

ð9Þ

r1 ;...;r D 2f0;1gD

where the linear blending coefficients cr1 ;...;rD are:

cr1 ;...;rD ¼

D Y

urss ð1  us Þ1rs

ð10Þ

s¼1

us ¼ ys  floorðys Þ

ð11Þ

On the other hand, the bin count

mz1 ;...;zD ¼

T

D

mz1 ;...;zD 2 N for the index vector z ¼ ðz1 ; . . . ; zD Þ 2 Z is given by:

N X    I z ¼ round yj

ð12Þ

j¼1

yj ¼ Axj þ b

ð13Þ

Please note that round rounds each vector component towards the nearest integer, and I stands for the indicator function. It must be highlighted that round is used instead of floor in (12) because the centers of the bins are the vertices of the hypercube where the transformed test sample y belongs. This is depicted in Figs. 1 and 2. The second method to define a probability density function divides the transformed space into simplices, and then a linear interpolation based on barycentric coordinates is carried out in each simplex. There are multiple ways to split RD into simplices; here the Kuhn triangulation is considered [32]. Other possibilities include Sallee’s corner slicing [50] and middle cut [51] triangulations, and Haiman’s triangulation [23]. Kuhn’s approach has been used here because it does not rely on the recursive triangulation of smaller polytopes or hypercubes, which simplifies the implementation. Kuhn’s triangulation first divides the space into hypercubes, and then each hypercube is split into D! simplices. Like in the linear blend case, the hyperT

cubes with integer vertex coordinates are considered here. For a hypercube with opposite vertices k ¼ ðk1 ; . . . ; kD Þ 2 ZD and T

0

D

0

k ¼ ðk1 þ 1; . . . ; kD þ 1Þ 2 Z , each shortest path over the edges of the hypercube from k to k is associated to one of the D! simplices; the simplex is obtained as the convex hull of the considered shortest path. It must be remembered that each simplex has D þ 1 vertices. Given the D þ 1 vertices of a simplex in the transformed space y0 ; y1 ; . . . ; yD , the barycentric coordinates w0 ; w1 ; . . . ; wD of the simplex for a transformed test point y are given by:

0

w1

1

B C 1 @ . . . A ¼ T ðy  y0 Þ wD D X w0 ¼ 1  ws

ð14Þ

ð15Þ

s¼1

8i; j 2 f1; . . . ; Dg; t ij ¼ yji  y0i

ð16Þ

where yji is the i-th component of yj . Please note that: D X ws ¼ 1

ð17Þ

s¼0

On the other hand, Eq. (14) is rearranged as a linear system of equations for numerical stability:

0

w1

1

B C T@ . . . A ¼ y  y0 wD Then the linear interpolated probability density function at test point x is given by:

ð18Þ

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158

pSFP ðx j A; bÞ ¼

D 1 X w my ;...;y NV s¼0 s s1 sD

141

ð19Þ

In order to select the a priori probabilities pðA; bÞ a uniform prior is used over suitable sets of matrices A and vectors b, as  proposed in [39]. The bin width on the transformed space is h ¼ 1, so we can restrict b to lie in the hypercube ½0; 1D . Otherwise the resulting frequency polygon would be identical to one with b 2 ½0; 1D . Consequently b is drawn from the uniform distribution over the hypercube ½0; 1D . The type of matrices A that will be considered is the set of matrices that can be decomposed as the product of a rotation matrix U and a diagonal scaling matrix K:

A ¼ UK

ð20Þ

det ðUÞ ¼ 1 det ðAÞ ¼ det ðKÞ ¼

ð21Þ D Y

ks

ð22Þ

s¼1

where ks is the s-th diagonal element of K. This notation is consistent with Eq. (8). Haar probability measure is the uniform distribution over the set of unitary matrices, so a random rotation matrix U is drawn from it by means of the algorithm proposed in [41]. First of all, a random matrix of size D  D is generated whose entries are drawn from the standard normal univariate distribution. Then the QR decomposition of that matrix is computed. After that, a diagonal matrix is built with the signs (either 1 or 1) of the diagonal of the R matrix resulting from the decomposition. Finally, the Q matrix from the decomposition is multiplied by this diagonal matrix to yield a unitary matrix. If that matrix has determinant 1, then it is U. Otherwise the sign of one of its columns is changed to yield U. The elements of the scaling matrix K are scale parameters. This means that Jeffreys prior for scale parameters can be used [28]. This is done by drawing log ðks Þ from the uniform distribution over a given interval of real numbers ½log ðkmin Þ; log ðkmax Þ. The bin width in the s-th transformed dimension measured on the input space is linked to the s-th scale parameter by

hs ¼ k1 s

ð23Þ

This implies that we can use the theory of multivariate frequency polygons to obtain tentative values for kmin and kmax . The optimal bin width for multivariate frequency polygons is studied in [55], page 90, and the following bin width is recommended: 1 ^ ¼ 2rN4þD h

ð24Þ

where r is the standard deviation in the relevant dimension. For our purposes it is more practical to compute a single value of r which comprises the information from all the dimensions of the input space:



rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 traceðCÞ D

ð25Þ

Next the recommended bin width is inverted to obtain the recommended scale parameter: 1 ^k ¼ 1 ¼ 1 N4þD ^ 2 r h

ð26Þ

This tentative value cannot be expected to fit all possible datasets. Hence we take:

  log ðkmin Þ ¼ hmin þ log k^   log ðkmax Þ ¼ hmax þ log ^k

ð27Þ ð28Þ

where hmin ; hmax 2 R are tunable parameters, with hmin < hmax . The gradient of the proposed estimator can be obtained analytically. Appendix A contains the derivations of the gradient for the LBFP and SFP variants of DFPT. 4.2. Convergence proof In this subsection the convergence of the DFPT to the true pdf as N ! 1 is proved. The proof has two parts. The first one shows that each multivariate frequency polygon converges to the true pdf, where it is assumed that the associated matrix AN and vector bN are obtained according to the procedure detailed in Section 4.1. The second and last part proves the same result for the DFPT. Theorem 1. The pdf of a multivariate frequency polygon chosen according to the criteria explained in Section 4.1 converges to the true pdf as the number of training samples tends to infinity. That is.

142

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158

lim pðx j AN ; bN Þ ¼ pðxÞ

N!1

ð29Þ

Proof. The convergence of a multivariate frequency polygon to the true pdf is guaranteed whenever the bin widths hNk are functions of N such that the following two conditions hold [54,55,27]:

lim



N!1

max hNk



k2f1;...;Dg

¼0

lim Nv N ¼ 1

ð30Þ ð31Þ

N!1

Let hmin ; hmax 2 R be the values of the tunable parameters. We must show that both (30) and (31) are satisfied. From (27) it can be inferred that:

kNk P ^k exp ðhmin Þ

ð32Þ

Then from (23) and (26).

hNk 6

^ h exp ðhmin Þ

ð33Þ

Now from (24) and (25) we obtain: 1

hNk 6

2N4þD

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 traceðCÞ D

exp ðhmin Þ

ð34Þ

Since this holds for all k 2 f1; . . . ; Dg, it is also true that 1

max hNk 6

k2f1;...;Dg

2N4þD

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 traceðCÞ D

exp ðhmin Þ

ð35Þ

and then (30) follows. Next, starting from (28) an analogous reasoning yields: 1

hNk P

2N4þD

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 traceðCÞ D

exp ðhmax Þ

ð36Þ

Then, from (8) and (23) we obtain:

vN

0 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1D 2 D1 traceðCÞ D A N4þD P@ exp ðhmax Þ

ð37Þ

Or equivalently.

Nv N

0 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1D 2 D1 traceðCÞ 4 A N 4þD P@ exp ðhmax Þ

ð38Þ

so that (31) follows. h Theorem 2. The DFPT converges to the true pdf as the number of training samples tends to infinity. That is.

lim DFPT ðxÞ ¼ pðxÞ

N!1

ð39Þ

Proof. From (3) it can be inferred that the pZeroDens pðx j ZeroDensÞ term in the right hand side of (2) tends to zero as N ! 1. Hence only the second term is important when N ! 1. Theorem 1 implies that all the terms of the summation on i tend to pðxÞ as N ! 1, so the theorem follows. h 4.3. Mode finding Many applications of pdf estimation require a procedure to find a local maximum of the density. The mean shift algorithm is an algorithm of choice for kernel density estimators [18,11]. Starting from a test point estimators of the mode are obtained iteratively by setting the new estimation to the weighted average of the training samples close to the old estimation, where

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158

143

the weights are given by the density kernels. While this has proved to be an effective procedure, problems might appear in those regions where the input density is low so that the weighted average is less reliable. The rationale behind the DFPT is that an average of a certain number of randomly built frequency polygons can attain a good pdf estimation performance. Moreover, the number of bins which are combined to yield the density estimation of each frequency polygon can help to overcome the issues associated to low density regions. The mode finding procedure that we propose here is inspired in this observation. Let xð0Þ be a test sample, so that we would like to find the closest local maximum (mode) of the estimated pdf. An iterative procedure is considered, where we move from xðt Þ to xðt þ 1Þ at time step t. This is in line with the well known mean shift algorithm [11,18]. For the i-th frequency polygon, the hypercube (LBFP version) or simplex (SFP version) which contains xðtÞ is found, along with its set of vertices V i ðt Þ. Then the vertex with the maximum training sample count is found:

zi ðtÞ ¼ arg max mz1 ;...;zD

ð40Þ

z2V i ðtÞ

zi ðt Þ 2 ZD is the index vector of the vertex of where mz1 ;...;zD 2 N is the bin count for the index vector z ¼ ðz1 ; . . . ; zD ÞT 2 ZD , and  interest. Next the obtained vertex is backprojected onto the input space:

i ðtÞ ¼ A1  x i ðzi ðt Þ  bi Þ

ð41Þ D

 i ðt Þ 2 R is a point of maximum density in the input space, according to the i-th frequency polygon. ConThis means that x sequently, we may average the points coming from all frequency polygons to obtain the next estimation of the mode:

xðt þ 1Þ ¼

H 1X  i ðt Þ x H i¼1

ð42Þ

It is worth noting that no linear blending or linear interpolation is necessary to evaluate (40), which speeds up the process. The algorithm can be summarized as follows: 1. At time step t ¼ 0, take xðtÞ to be the test sample. 2. For each frequency polygon i 2 f1; . . . ; Hg, find the hypercube (LBFP version) or simplex (SFP version) which contains xðt Þ and use (40) to determine the vertex  zi ðtÞ 2 ZD with the maximum training sample count. i ðtÞ 2 RD . 3. For each frequency polygon i 2 f1; . . . ; Hg, backproject  zi ðtÞ onto the input space by (41) to yield x i ðtÞ coming from all frequency polygons i 2 f1; . . . ; Hg by (42) to obtain the next estimation 4. Average the backprojections x of the mode xðt þ 1Þ. 5. If convergence has been reached, i.e. xðt þ 1Þ ¼ xðtÞ, or a maximum time step count has been reached, then halt. Otherwise go to step 2. In order to illustrate the algorithm, an example is provided next. The isotropic Gaussian distribution with D ¼ 2, unit standard deviation and zero mean has been considered, and N ¼ 100; 000 samples have been drawn from it. This is a unimodal distribution, and the mode is the coordinate origin. Consequently, it is expected that the algorithm converges to the coordinate origin from any starting point. The LBFP version of the procedure has been chosen, with H ¼ 50 frequency polygons and hmin ¼ 1; hmax ¼ 0. A test point xð0Þ ¼ ð1:1; 1:1ÞT has been selected, and the algorithm has been run until t ¼ 6. The result is depicted in Fig. 4. It can be seen that the sequence of estimations of the mode (illustrated with black arrows) converges to the mode of  i ðtÞ (hollow markers) form a cluster whose the distribution. Fig. 4a shows that at each time step t the 50 backprojections x center is the next estimation of the mode xðt þ 1Þ (solid markers). Fig. 4b is focused on one of the 50 frequency polygons, and  i ðtÞ for that frequency polygon (hollow markers) always lie on a vertex of the hypercube shows how the backprojections x lattice (gray lines) defined by the frequency polygon. Please note that for D ¼ 2 the hypercubes are squares, and that the affine transform Ai x þ bi associated to the considered frequency polygon causes that the squares backproject to the input space as parallelograms. 5. Discussion Here we discuss some important features of our proposal:      The time complexity of the method is O D2 HN for the training phase. For the test phase we have O 2D HM for the   LBFP version, and O D3 HM for the SFP version, which accounts for the differences in the CPU time reported in Sections 6.2 and 6.3. The D2 factor of the training phase comes from the matrix vector multiply Ax. The 2D factor of the LBFP is due to the use of the 2D vertices of an hypercube as mesh points to compute the linear blend (9). Finally, the solution of a linear system of equations of size D produces the D3 factor of the SFP (18). These values indicate that our proposal is best suited for moderate input dimension D. On the other hand, it is well suited for very large training sets

144

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158

0.5

0.5

0

0

−0.5

−0.5 t=0 t=1 t=2 t=3 t=4 t=5 t=6

−1

−1.5 −1.5

−1

−0.5

0

0.5

t=0 t=1 t=2 t=3 t=4 t=5 t=6

−1

−1.5 −1.5

(a)

−1

−0.5

0

0.5

(b)

i ðtÞ for all frequency polygons Fig. 4. Example of the mode finding algorithm: (a) sequence of estimations of the mode xðt Þ, along with the backprojections x i ðt Þ for one frequency polygon and the sides of the i 2 f1; . . . ; 50g; (b) sequence of estimations of the mode xðtÞ, along with the backprojections x  i ðtÞ are shown as backprojected hypercubes (squares for D ¼ 2). The estimations of the mode xðtÞ are shown with solid markers, and the backprojections x hollow markers. The time sequence of the algorithm is depicted with black arrows, and the sides of the hypercubes are shown as gray lines.

(high N), since the training time is linear in N and the test time does not depend on N. This is possible due to the summarizing property of the binning process. These considerations are confirmed by the experiments of Sections 6.3 and 6.5, where our proposal attains good computational complexity results for real datasets with large N and moderate D.  The space complexity the same of the DHT: OðH  f ðDÞÞ. This is because the binning structure is the same as that of DHT, and is confirmed by the peak memory use results reported in Section 6.2. The dependence on D cannot be expressed analytically, because the number of nonempty bins depends on the structure of the input distribution. Please note that only nonempty bins require memory allocation, which explains the low memory requirements of our proposal across all the experiments.  It has been proved in Section 4.2 that the pdf estimator converges to the true pdf as N ! 1. The proof derives from the convergence property of multivariate frequency polygons. The experiments confirm that good convergence rates are attained for practical values of N.    From a theoretical point of view, the original (naive) kernel density estimator has a convergence rate of O N 4=5 to   the true pdf, which is better than the O N 2=3 convergence rate of multivariate histograms [55]. On the other hand,   the frequency polygon has the same rate O N 4=5 as the kernel density estimator [52]. This accounts for the results where the DFPT outperforms the DHT (Section 6). It must also be highlighted that in practice the naive kernel density estimator is not applied exactly but approximated to speed up the process, so its theoretical convergence rate does not apply in many cases.  The proposed mode finding algorithm of Section 4.3 is tested in Section 6.5. The results demonstrate its accuracy and   its low computational complexity. It must be noted that the mode finding algorithm has complexity O D2 HðN þ M Þ .     The O 2D HM and O D3 HM complexities of the pdf estimation are no longer applicable for mode finding, because the proposed algorithm does not carry out any interpolation.  Overall it can be said that the accuracy of LBFP is the best among the compared methods, and SFP is the second best. In terms of computational complexity the roles are reversed, and SFP runs faster than LBFP due to the differences in the complexity of the test phase: a 2D factor for LBFP versus a D3 factor for SFP. The CPU time requirements are reasonable for moderate D, and the memory use is reduced in all situations. Hence, our proposal stands as an alternative to previous nonparametric pdf estimation methods. Next the main differences of DFPT with respect to DHT are listed:  Frequency polygons are used rather than histograms. This leads to a continuous pdf estimator, while the DHT estimator is     discontinuous. Moreover, the convergence order is enhanced from O N 2=3 for the DHT to O N 4=5 for the DFPT.  Two frequency polygon construction methods are considered, namely the Linear Blend Frequency Polygon (LBFP, based on hypercubes) and the Simplex Frequency Polygon (SFP, based on simplices). Both integrate information from several bins in order to estimate the density at a given test point: 2D bins for LBFP and D þ 1 bins for SFP. In contrast to this, each histogram of the DHT only uses one bin to estimate the density at a given test point. ^ and the recommended scale parameter ^  The recommended bin width h k for DFPT and DHT are different.  The gradient of the estimated pdf is obtained analytically for both LBFP and SFP. The DHT does not have any gradient because each histogram is a piecewise constant function.  A completely new mode finding algorithm is proposed, which is much faster than the algorithm that was proposed for DHT. In addition to this, the new algorithm has less tunable parameters, so it is easier to use in applications.

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158

145

Fig. 5. Synthetic distributions with D ¼ 2: (a) Ball, (b) Ellipse, (c) Hexagon, (d) Star, (e) TwoShapes, and (f) X. Darker tones mean higher probability density.

6. Experimental results Next the results of the experiments that have been designed to test our proposal are reported.1 Five sets of experiments have been carried out: low dimensional synthetic data (Section 6.1), experiments about the input dimensionality (Section 6.2), machine learning data (Section 6.3), and two applications of the mode finding algorithm proposed in Section 4.3, namely image denoising (Section 6.4) and object tracking (Section 6.5). For the comparisons we have selected five competing methods:  Our previous estimator, the Discrete Histogram Transform, that we note DHT [39]. It averages the estimations from several multivariate histograms. Each histogram is computed on a transformed space which is obtained from a randomly chosen affine transform. The resulting estimator is not continuous.  The kd-tree binary tree algorithm, that we note KDT [6,21]. KDT builds trees where each node represents a set of samples by means of a D-dimensional bounding box which contains them all. Non-leaf nodes have two child nodes which are obtained by splitting on the widest dimension of the bounding box of the parent node. The boxes of the children can have different sizes, so that the resolution of the tree adapts to the dataset.  The anchors hierarchy binary tree algorithm, that we note ANC [44]. It maintains a set of anchors, each one with a pivot point. Each sample is assigned to the anchor of the closest pivot point. At each step of the algorithm, a new anchor is created so that a sufficiently large set of samples is assigned to it. When all the anchors have been created, a binary tree is built to store them.  The Fast Gauss Transform, that we note FGT [22]. FGT assumes that the pdf is to be estimated by placing a Gaussian kernel at each sample, i.e. it is an optimization of the kernel density estimator. Then it uses two mathematical expansions of the  sum of the Gaussian kernels in order to yield a complexity of O 2D ðM þ N Þ , which is adequate for low values of D.  The Improved Fast Gauss Transform, that we note IFGT [65]. It is an enhancement of FGT which uses a different expansion and a tree data structure so as to speed up the process. It attains OðM þ N Þ complexity in many cases, although the performance depends on the structure of the dataset. This is because a clustering of the dataset is necessary in order to build the tree structure, and the quality of the clustering impacts the results. The implementations that we have used are freely available as Matlab scripts with additional C/C++ code for the most time consuming modules; the DFPT has been implemented the same way. The implementations for KDT and ANC come from [33], and those for FGT and IFGT from [45]. The implementation of DHT is available at [40]. For the approximate kernel methods (KDT, ANC, FGT and IFGT), we have set their relative and absolute error limits to 103 , which is the default for all the implementations. The reported CPU times correspond to a single core of a 3 GHz CPU with 64 bit architecture, and they include both the training and test procedures unless otherwise noticed. 6.1. Low dimensional synthetic data Several low dimensional synthetic distributions have been considered to test the accuracy of the pdf estimators. Six uniform and non uniform distributions with D ¼ 2 have been chosen; they are depicted in Fig. 5. Four distributions with D ¼ 3 have also been selected: 1

The source code and several demos of our proposal are available at http://www.lcc.uma.es/%7Eezeqlr/dfpt/dfpt.html.

146

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158

 Circle. The values of x1 and x2 are drawn from the uniform distribution over the circle with center at the coordinate origin and unit radius. The value of x3 is drawn from a univariate Gaussian with zero mean and a standard deviation of 0.05.  Square. The values of x1 and x2 are drawn from the uniform distribution over the square with vertices at ð0; 0; 0Þ, ð1; 0; 0Þ; ð0; 1; 0Þ and ð1; 1; 0Þ. The value of x3 is drawn from a univariate Gaussian with zero mean and a standard deviation of 0.05.  Segment. The value of x1 is drawn from the uniform distribution over the segment with vertices at ð0; 0; 0Þ and ð1; 0; 0Þ. The values of x2 and x3 are drawn from a bivariate Gaussian with zero mean and diagonal covariance matrix C ¼ 0:05I.  Sphere. The values of x1 . x2 and x3 are drawn from a trivariate Gaussian with zero mean and diagonal covariance matrix C ¼ 0:05I. The samples outside the sphere with center at the coordinate origin and unit radius are discarded. That is, the support of the pdf is the previously mentioned sphere. The kernel bandwidth must be optimized for all of the kernel methods (KDT, ANC, FGT and IFGT). The DHT and the DFPT require the optimization of the tunable parameters hmin ; hmax 2 R. All these optimizations have been done with the Nelder– Mead method [47], with a limit of 100 function evaluations. The parameter optimization has been run 1.000 samples as the training set and a disjoint validation set of another 1.000 samples, so that the optimization stage does not take too long. The objective function to be minimized has been the cross validated Average Negative Log-Likelihood (ANLL, lower is better):

ANLL ¼ 

M   1X ~ xj log p M j¼1

ð43Þ

  ~ xj stands for the estimated probability density for the test sample xj . These parameter optimizations have been where p done for each dataset and method. Once the optimal parameter values have been determined, disjoint training and test sets with N ¼ M ¼ 100; 000 samples have been used to obtain the pdf estimation performance results. Since the distributions are synthetic, it is possible to use the Mean Squared Error as the performance measure:

MSE ¼

M      1X  xj  p ~ xj 2 p M j¼1

ð44Þ

   xj stands for the true probability density for the test sample xj . We have chosen MSE over ANLL to measure the where p estimation performance because ANLL does not have the reference of the true pdf, so it is less precise when it comes to evaluating density estimators. On the other hand, the ANLL is the right choice for parameter optimization, since in a practical setting the true pdf is not available to tune the parameters. Ten runs have been carried out for each dataset and method, each with different randomly generated data. The results for MSE, CPU time and peak memory use are reported in Tables 1–3, respectively. It can be noticed that our proposals LBFP and SFP achieve the best pdf estimation accuracy, followed by DHT. The fastest approach is IFGT, and the smallest memory requirements are those of FGT. However, LBFP and SFP attain competitive results in terms of time and space complexity. The KDT and ANC approaches stand behind in all three criteria. Graphical depictions of the density estimations are shown for the Hexagon and Square datasets in Figs. 6 and 7, respectively. For the 3D Square dataset, the plane x1 ¼ 0:5 has been chosen to obtain the pictures. The plots show that the kernel based methods (KDT, ANC, FGT and IFGT) suffer from an excessive spread of the probability mass, which leads to underestimation of the pdf in the higher density regions. On the other hand, LBFP, SFP and DHT concentrate the probability mass so as to match the true pdf more closely, as confirmed by the quantitative results in Table 1. The main difference among the three methods is that LBFP and SFP yield smoother estimations than DHT. This is due to the advantages of continuous first order approximation over discontinuous zeroth order approximation which were outlined in Section 3.

Table 1 Mean Squared Error MSE for the low dimensional synthetic datasets. Standard deviations are shown in parentheses. Best results are marked in bold. LBFP Ball Ellipse Hexagon Star TwoShapes X Circle Square Segment Sphere (103 ) Rank sums

.107 12.1 .086 .020 .121 .136 .082 .688 12.0 .938 14

SFP (.004) (.14) (.004) (.001) (.003) (.004) (.004) (.011) (.686) (.032)

.108 12.1 .086 .020 .121 .137 .083 .692 12.9 .935 20

DHT (.003) (.28) (.006) (.001) (.003) (.005) (.004) (.021) (1.84) (.019)

.117 12.3 .075 .024 .131 .144 .087 .707 11.9 1.14 26

KDT (.011) (.36) (.011) (.004) (.012) (.013) (.006) (.036) (.872) (.070)

.179 25.4 .191 .031 .206 .233 .286 2.03 56.5 1.31 51

ANC (.012) (1.40) (.011) (.004) (.011) (.011) (.021) (.101) (6.76) (.072)

.179 25.2 .191 .031 .205 .231 .286 2.04 56.4 1.31 49

FGT (.012) (1.44) (.010) (.004) (.012) (.013) (.021) (.098) (5.47) (.070)

.181 24.9 .195 .032 .210 .234 .181 1.88 79.4 1.34 55

IFGT (.012) (.81) (.010) (.004) (.011) (.012) (.020) (.131) (18.5) (.072)

.181 24.9 .196 .032 .211 .234 .197 2.09 87.8 1.35 65

(.013) (.81) (.010) (.004) (.011) (.012) (.030) (.399) (21.0) (.073)

147

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158 Table 2 CPU times (in minutes) for the low dimensional synthetic datasets. Standard deviations are shown in parentheses. Best results are marked in bold.

Ball Ellipse Hexagon Star TwoShapes X Circle Square Segment Sphere Rank sums

LBFP

SFP

0.5 0.5 0.5 0.5 0.5 0.5 3.6 3.2 2.3 3.6 34

0.6 0.6 0.7 0.6 0.7 0.6 2.3 2.1 1.6 2.2 36

(0.0) (0.0) (0.0) (0.0) (0.0) (0.0) (0.2) (0.2) (0.2) (0.5)

(0.0) (0.0) (0.1) (0.0) (0.0) (0.0) (0.2) (0.1) (0.1) (0.1)

DHT

KDT

ANC

FGT

0.3 0.3 0.3 0.3 0.3 0.3 0.9 0.8 0.6 0.9 19

13.8 (2.6) 4.0 (0.4) 12.4 (1.3) 14.4 (2.5) 8.6 (1.5) 9.6 (1.6) 15.0 (3.0) 28.0 (4.5) 41.6 (8.5) 26.6 (4.3) 57

21.9 (5.5) 6.0 (1.2) 17.6 (3.8) 24.2 (8.2) 13.8 (3.4) 14.1 (3.2) 22.8 (6.7) 37.6 (10.0) 66.5 (20.6) 49.4 (13.3) 67

11.3 32.1 11.4 11.4 18.1 16.9 11.9 11.9 11.8 11.9 56

(0.0) (0.0) (0.1) (0.0) (0.0) (0.0) (0.1) (0.1) (0.1) (0.1)

IFGT (0.3) (0.6) (0.3) (0.3) (1.7) (1.3) (0.7) (0.6) (0.6) (0.6)

0.1 0.1 0.1 0.1 0.1 0.1 0.7 0.7 0.7 0.7 11

(0.0) (0.0) (0.0) (0.0) (0.0) (0.0) (0.0) (0.0) (0.0) (0.0)

Table 3 Peak memory use (in GBytes) for the low dimensional synthetic datasets. Standard deviations are shown in parentheses. Best results are marked in bold.

Ball Ellipse Hexagon Star TwoShapes X Circle Square Segment Sphere Rank sums

LBFP

SFP

.17 (.01) .17 (.00) .17 (.01) .18 (.00) .19 (.01) .18 (.00) .20 (.03) .20 (.04) .21 (.00) .21 (.01) 30

.20 .18 .19 .18 .19 .18 .20 .20 .23 .20 40

(.01) (.00) (.00) (.00) (.00) (.01) (.02) (.00) (.00) (.00)

DHT

KDT

ANC

FGT

IFGT

.19 .19 .18 .18 .18 .17 .20 .21 .21 .20 36

7.66 (.59) 1.95 (.07) 6.11 (.57) 9.35 (1.57) 4.60 (.12) 5.00 (.00) 4.28 (.20) 6.79 (.15) 14.07 (.90) 1.70 (.66) 60

15.49 (.60) 2.22 (.09) 12.82 (.63) 1.25 (1.69) 5.28 (.15) 5.94 (.43) 4.88 (.18) 13.01 (2.04) 14.96 (.77) 12.71 (.90) 70

.17 .16 .17 .17 .17 .18 .19 .18 .20 .19 16

.19 (.01) .15 (.00) .17 (.00) .20 (.00) .17 (.00) .17 (.00) .18 (.02) .21 (.00) .22 (.00) .20 (.00) 28

(.00) (.00) (.01) (.00) (.00) (.00) (.03) (.03) (.00) (.00)

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

0

1

2

(.00) (.01) (.00) (.01) (.00) (.00) (.02) (.00) (.00) (.00)

3

(i) Fig. 6. Results for the Hexagon dataset: (a) true pdf, (b) LBFP, (c) SFP, (d) DHT, (e) KDT, (f) ANC, (g) FGT, (h) IFGT, and (i) probability density key.

6.2. Input dimensionality A set of experiments has been designed to assess the behavior of the methods depending on the input dimension D. For values D 2 f2; 3; 4; 5; 6g, the following three synthetic datasets have been considered:

148

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

0

2

4

6

8

(i) Fig. 7. Results for the Square dataset: (a) true pdf, (b) LBFP, (c) SFP, (d) DHT, (e) KDT, (f) ANC, (g) FGT, (h) IFGT, and (i) probability density key.

 UniformSphere. The uniform distribution over the D-dimensional hypersphere with unit radius.  SphericalGauss. The D-dimensional Gaussian distribution with zero mean and unit covariance matrix.  ElongatedGauss. The D-dimensional Gaussian distribution with zero mean and a diagonal covariance matrix C with diagonal elements given by:

( css ¼

D2

if s ¼ 1

1

if s > 1

ð45Þ

where s 2 f1; . . . ; Dg stands for the row index to the C matrix. The three distributions are shown for the D ¼ 2 case in Fig. 8. The experimental setup has been the same as in Section 6.1. The results in terms of MSE, overall CPU time and peak memory use are reported in Fig. 9. As in Section 6.1, LBFP, SFP and DHT exhibit the best estimation accuracy. Among these three methods DHT is the fastest, SFP follows and LBFP is the slowest. On the other hand, the three of them have low memory requirements. The KDT and ANC methods have high computational requirements in terms of time and space. Finally, FGT exhibits a nearly constant time complexity with respect to D, although its accuracy is very poor. Additional data concerning the separate training and test CPU times of LBFP, SFP and DHT are shown in Fig. 10. It must be noted that for FGT and IFGT this does not make sense because both methods process the training and test sets jointly, while for KDT and ANC the timing data cannot be split into training and test because the considered implementations are mono-

Fig. 8. Synthetic distributions considered in the input dimensionality experiments for the D ¼ 2 case: (a) UniformSphere, (b) SphericalGauss, and (c) ElongatedGauss. Darker tones mean higher probability density.

149

5

10−2.1

10

10−2.3

10

−2.5

10−2.7

Peak memory

10

108

4

CPU time

MSE

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158

3

10

2

10

1

107

106

10 10−2.9

0

3

4

5

10

6

2

3

4

(c)

5

5

6

5

6

5

6

108

−5

10−6

Peak memory

4

3

10

2

10

1

10−7

10

0

2

3

4

5

10

6

2

3

4

5

107 106 105 104

6

2

3

4

D

D

D

(d)

(e)

(f)

5

10−4

10

10−5

10

108

10−6 10−7

Peak memory

4

CPU time

MSE

105

6

(b)

10

3

10

2

10

1

−8

10−9

5

(a)

10−4

10

4

D

10

10−8

3

D

10−3

10

2

D

CPU time

MSE

2

107

106

10

0

2

3

4

5

10

6

2

3

4

5

105

6

2

3

4

D

D

D

(g)

(h)

(i)

LBFP

SFP

DHT

KDT

ANC

FGT

IFGT

(j) Fig. 9. Results for the input dimensionality experiments. From top to bottom: UnitSphere, SphericalGauss, ElongatedGauss, and key. Overall CPU times (second column) are in seconds, and peak memory uses (third column) are in KBytes. Please note that all the plots are in log scale.

lithic, i.e. a single function does all the job. As seen the training times are fairly similar, because the training phase for the three methods consists of the computation of the bin counts. However the test times exhibit important differences, since each histogram of the DHT (the fastest of the three) only uses one bin to estimate the density at a given test point, while SFP uses D þ 1 bins and LBFP (the slowest of the three) uses 2D bins. This dependency on D is the reason that the differences increase as the dimensionality grows. 6.3. Machine learning data Density estimation experiments with real data have been run with machine learning datasets from the UCI Repository of Machine Learning Databases [3]. Seven datasets have been used: CorelColorHistogram, CorelColorMoments, CorelCoocTexture, CorelLayoutHistogram, CoverType, MilleniumSimulation, and MiniBooNE, which we will note CCH, CCM, CCT, CLH, CT,

150

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158

101

2

3

4

5

102

101

100

6

103

Training CPU time

Training CPU time

102

100

2

3

4

5

102

101

100

6

(a)

(b)

(c) 105

104

104

104

103 102 101 3

4

D

105

2

3

D

105

100

2

D

4

5

6

Test CPU time

Test CPU time

103

Test CPU time

Training CPU time

103

103 102 101 100

2

3

4

5

6

5

6

102 101 100

2

3

4

D

D

(d)

(e)

(f) SFP

6

103

D

LBFP

5

DHT

(g) Fig. 10. Training CPU times (first row) and test CPU times (second row) for the input dimensionality experiments. From left to right: UnitSphere, SphericalGauss and ElongatedGauss. The key is shown at the bottom (third row). CPU times are in seconds. Please note that all the plots are in log scale.

MS and MBN, respectively. The dimensionality of the datasets has been reduced to D ¼ 4 by Principal Components Analysis (PCA) in order to enable all the competing methods to produce good results. The parameter optimization method has been the same as that of Section 6.1, with the exception that 1% of the available data has been used as the training set and another disjoint 1% as the validation set. After the parameters have been optimized, we have used the 10-fold cross-validated ANLL as the performance measure because for these real datasets the true pdf is not available, which rules out the use of the MSE. The results for ANLL, overall CPU time and peak memory use are reported in Tables 4–6 respectively. Our LBFP proposal attains the best pdf estimation accuracy results, followed by DHT and SFP. This is in line with the results reported in Sections 6.1 and 6.2. In terms of CPU time needs, the SFP version attain good results, although the LBFP version is slower. As seen in Tables 7,8, this is due to the differences in the test CPU times as explained in Section 6.2. Finally, all the compared methods have similar memory requirements, except KDT and ANC which exhibit a much larger memory consumption. 6.4. Image denoising Mode finding algorithms can be used to enhance images corrupted by noise. The mean shift algorithm, which places a cluster center at each local maximum (mode) of the pdf, is a common tool for this purpose [11,34]. First of all, a feature Table 4 Average Negative Log Likelihood ANLL for the UCI machine learning datasets. Standard deviations are shown in parentheses. Best results are marked in bold.

CCH CCM CCT CLH CT MS MBN Rank sums

LBFP

SFP

DHT

KDT

ANC

4.45 (.01) 5.35 (.02) 4.81 (.05) 3.85 (.02) 10.11 (.34) 5.54 (.04) 6.74 (.05) 15

4.46 (.02) 5.36 (.02) 4.87 (.05) 3.86 (.02) 10.92 (1.46) 5.55 (.05) 4.30 (1.01) 25

4.42 (.01) 5.36 (.02) 4.88 (.07) 3.82 (.02) 12.08 (1.02) 5.75 (.07) 5.45 (1.35) 24

4.70 (.02) 5.49 (.02) 4.87 (.02) 4.23 (.02) 7.15 (.15) 5.55 (.05) 2.09 (.16) 29

4.70 5.49 4.98 4.23 6.76 5.47 0.25 28

FGT (.02) (.02) (.16) (.01) (.35) (.03) (1.43)

4.74 5.52 5.79 4.32 5.84 5.27 2.30 32

IFGT (.02) (.03) (.14) (.01) (.02) (.04) (2.32)

4.80 5.66 7.49 4.38 5.96 5.70 6.90 43

(.07) (.11) (.60) (.04) (.03) (0.25) (2.98)

151

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158 Table 5 Overall CPU times (in minutes) for the UCI machine learning datasets. Standard deviations are shown in parentheses. Best results are marked in bold.

CCH CCM CCT CLH CT MS MBN Rank sums

LBFP

SFP

11.08 (0.74) 10.15 (1.42) 8.41 (0.83) 12.07 (0.60) 29.57 (1.81) 27.54 (2.87) 5.64 (0.47) 40

3.92 3.48 2.91 4.33 9.82 8.95 4.11 27

DHT (0.33) (0.48) (0.42) (0.25) (1.93) (0.79) (0.83)

1.44 1.23 1.06 1.36 3.09 2.16 0.98 12

(0.07) (0.18) (0.10) (0.06) (0.08) (0.17) (0.34)

KDT

ANC

FGT

IFGT

7.81 (1.97) 17.87 (5.45) 15.81 (3.95) 5.18 (1.16) 0.72 (0.09) 62.84 (14.48) 9.41 (3.43) 35

10.95 (3.54) 24.56 (7.84) 31.26 (13.00) 6.57 (1.96) 1.44 (1.07) 108.41 (33.76) 63.22 (40.82) 42

2.79 (0.08) 2.79 (0.09) 2.82 (0.06) 2.69 (0.08) 12.96 (1.02) 12.56 (0.58) 7.99 (0.58) 27

1.70 1.72 0.30 1.38 1.60 2.55 0.21 13

(0.08) (0.26) (0.09) (0.05) (0.07) (0.55) (0.10)

Table 6 Peak memory use (in GBytes) for the UCI machine learning datasets. Standard deviations are shown in parentheses. Best results are marked in bold.

CCH CCM CCT CLH CT MS MBN Rank sums

LBFP

SFP

DHT

KDT

ANC

FGT

IFGT

.23 (.00) .19 (.01) .14 (.01) .25 (.00) .52 (.01) 3.36 (.00) .36 (.00) 26

.21 (.00) .08 (.01) .16 (.00) .22 (.00) .52 (.00) 3.36 (.00) .34 (.02) 21

.20 (.00) .22 (.00) .19 (.01) .26 (.02) .34 (.00) 3.35 (.00) .37 (.00) 27

4.97 (.10) 11.24 (.00) 9.36 (.00) 2.94 (.03) .69 (.04) 26.70 (.34) 5.70 (.05) 42

7.27 (.13) 23.95 (.00) 26.64 (8.70) 4.16 (.05) 1.28 (.48) 34.74 (1.65) 84.37 (40.59) 49

.18 (.00) .14 (.00) .05 (.00) .19 (.00) .33 (.00) 3.32 (.00) .25 (.00) 10

.26 (.01) .10 (.01) .07 (.00) .26 (.01) .31 (.00) 3.37 (.00) .27 (.00) 21

Table 7 Training CPU times (in minutes) for the UCI machine learning datasets. Standard deviations are shown in parentheses. Best results are marked in bold.

CCH CCM CCT CLH CT MS MBN Rank sums

LBFP

SFP

0.94 (0.05) 0.86 (0.09) 0.70 (0.05) 0.94 (0.03) 1.35 (0.10) 1.11 (0.12) 0.56 (0.02) 18

0.88 0.77 0.63 0.92 1.32 1.05 0.79 12

DHT (0.06) (0.09) (0.08) (0.03) (0.23) (0.08) (0.12)

0.90 (0.04) 0.77 (0.11) 0.66 (0.06) 0.86 (0.03) 1.57 (0.04) 0.95 (0.07) 0.66 (0.17) 12

Table 8 Test CPU times (in minutes) for the UCI machine learning datasets. Standard deviations are shown in parentheses. Best results are marked in bold.

CCH CCM CCT CLH CT MS MBN Rank sums

LBFP

SFP

10.13 (0.68) 9.28 (1.33) 7.70 (0.78) 11.12 (0.56) 28.21 (1.71) 26.42 (2.75) 5.06 (0.44) 21

3.02 2.69 2.27 3.40 8.48 7.89 3.30 14

DHT (0.26) (0.39) (0.33) (0.22) (1.69) (0.71) (0.71)

0.53 0.45 0.39 0.50 1.50 1.20 0.31 7

(0.02) (0.07) (0.04) (0.02) (0.03) (0.09) (0.16)

vector is defined which comprises the intensity of the pixel and a scaled version of its coordinates, where the scale factor is a tunable parameter of the algorithm. Then the mode finding algorithm is run, and a mode vector is determined. Finally, the intensity of the restored pixel is the corresponding component of the mode vector. For this set of experiments, the mean shift algorithm is used in combination with KDT, ANC, FGT and IFGT, while the DHT is combined with the mode finding procedure proposed in [39]. Finally, the mode finding algorithm of Section 4.3 is used with our proposals LBFP and SFP. We have chosen five well known benchmark images from the USC-SIPI Image Database [62], depicted in Fig. 11. We have scaled them down from 512  512 to 256  256 pixels in order to speed up the procedure. The original RGB color space has been used, with intensity values in the interval ½0; 1. Each RGB color channel has been processed separately, which implies that D ¼ 3 since the feature vectors comprise the intensity for a single color channel and the scaled horizontal and vertical

152

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158

Fig. 11. Benchmark color images: (a) Baboon, (b) F16, (c) Lake, (d) Lenna, and (e) Peppers.

Table 9 Peak Signal to Noise Ratio PSNR for the image denoising experiment. Best results are marked in bold.

Baboon F16 Lake Lenna Peppers Rank sums

LBFP

SFP

DHT

KDT

ANC

FGT

IFGT

32.3795 34.6527 33.7038 34.6318 34.6684 13

32.2794 34.7227 33.7555 34.7507 34.8140 8

30.7900 31.0993 30.7910 31.2510 30.9798 29

29.3847 32.8254 31.7205 34.9214 34.6851 16

29.3851 32.8249 31.7205 34.9217 34.6855 15

29.0022 32.3210 30.0353 32.5198 31.8055 32

29.0039 32.3221 30.0363 32.5204 31.8065 27

Table 10 Overall CPU times (in minutes) for the image denoising experiment. Best results are marked in bold.

Baboon F16 Lake Lenna Peppers Rank sums

LBFP

SFP

DHT

KDT

ANC

FGT

IFGT

5.3043 5.2981 5.3037 5.1671 5.2626 23

3.3333 3.1121 3.2216 3.1990 3.0624 10

29.1992 28.0944 28.1716 27.4562 27.8320 35

4.5444 4.6390 4.6388 4.5079 4.5102 15

5.4323 5.4418 5.5582 5.4591 5.3821 29

2.1660 2.2589 2.1712 2.1665 2.1651 5

5.4824 5.4403 4.8700 5.0594 4.6466 23

Table 11 Peak memory use (in MBytes) for the image denoising experiment. Best results are marked in bold.

Baboon F16 Lake Lenna Peppers Rank sums

LBFP

SFP

DHT

KDT

ANC

FGT

IFGT

8.8906 10.7422 10.0195 12.7891 10.0508 26

10.5781 11.5742 11.0469 9.8125 10.4805 29

25.3164 27.9023 24.6211 24.1172 28.3320 35

3.5781 3.5703 3.5586 3.5703 3.5625 18

2.7930 3.5391 3.6172 3.9492 2.7188 16

0.5703 0.6328 2.1172 2.1250 1.9375 5

2.5273 2.5938 2.7656 2.2891 2.7695 11

coordinates of the pixels. Spherical Gaussian noise with standard deviation 0.05 has been added to the original images. The performance of the methods has been evaluated by the Peak Signal to Noise Ratio (higher is better):

PSNR ¼ 10log10

1 3M

PM

1 

m¼1 kxðmÞ

~ðmÞk x

2

ð46Þ

 ðmÞ is the original (uncorrupted) RGB value of the m-th pixel, and x ~ ðmÞ is the where M is the number of pixels in the image, x restored RGB value of the m-th pixel. For each image and competing method, the scale factor and the method parameters have been optimized so that the results corresponding to the best (highest) PSNR are reported. The quantitative results of the experiment are given in Tables 9–11. It can be seen that LBFP and SFP are able to remove a significant amount of noise, followed by ANC and KDT. The memory usage of all methods is quite small, although LBFP, SFP and DHT require a bit more space. In terms of CPU time, all methods obtain similar results except DHT, which is considerably slower because its optimal configuration creates many histogram bins. The qualitative results for the F16 image are shown in Fig. 12. LBFP and SFP are able to remove more noise than DHT, while KDT and ANC produce oversmoothing. On the other hand, FGT and IFGT reduce the color range excessively because they find modes which are too far from the starting test points.

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158

153

Fig. 12. Image denoising results for the F16 benchmark image: (a) noiseless, (b) noisy, (c) LBFP, (d) SFP, (e) DHT, (f) KDT, (g) ANC, (h) FGT, and (i) IFGT.

6.5. Object tracking For the last set of experiments we have chosen a very popular current application of pdf estimators, namely object tracking. This is achieved through the mean shift algorithm, which was outlined in Section 6.4. A pdf is built which reflects the probability that a pixel in a frame is the center of the tracked object. Then the mean shift algorithm is used to find the pixel with the maximum probability density, which is taken to be the position of the object in the frame. This procedure has become a standard to solve this fundamental computer vision problem [64,5,12,61,68]. The Bonn Benchmark on Tracking (BoBoT) has been chosen as the testbed [30]. It consists of 12 videos with ground truth, which are publicly available.2 Example frames are depicted in Fig. 13. In order to build the pdf, the frames have been converted from the original RGB color space to HSV. The color information of the tracked object in the initial frame is used to build a 2D reference histogram of the H and S color channels, with 30 bins for the H channel and 32 bins for the S channel. After that, for each frame and possible location of the tracked object, the local histogram is computed. Each local histogram is compared to the reference histogram by the Bhattacharyya distance [7], and finally a sample is set at the possible location of the object with a weight inversely proportional to the Bhattacharyya distance. Please note that these histograms are not those built by the DHT method. Also, it must be highlighted that each sample has an associated weight, which was not the case for the experiments in Sections 6.1, 6.2, 6.3 and 6.4. In order to find the local maximum (mode) of the pdf, the mean shift algorithm has been used for the kernel estimators (KDT, ANC, FGT and IFGT). For the DHT approach, the procedure described in [39] has been employed. Finally, for our LBFP and SFP proposals the mode finding algorithm of Section 4.3 has been used. The parameter optimization has been carried out by taking the first 100 frames of each sequence as the training set. Then the remaining frames have been used to test the competing approaches with their corresponding optimized parameter values. Since the ground truth is available at each time step t, the tracking Mean Squared Error between the estimated mode and the true location of the object has been chosen as the tracking performance measure:

MSETrack ¼

2

T 1X  ðt Þ  x ~ðt Þk2 kx T t¼1

http://www.iai.uni-bonn.de/%7Ekleind/tracking/.

ð47Þ

154

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158

Fig. 13. Examples of frames for the object tracking application. From left to right, first frames of sequences A, B and D.

Table 12 Tracking Mean Squared Error MSETrack results for the object tracking experiments. Best results are marked in bold.

A B C D E F G H I J K L Rank sums

LBFP

SFP

DHT

KDT

ANC

FGT

IFGT

5879.7 2730.2 1249.5 35.3 160.7 1649.3 441.9 18.5 681.1 1932.3 149.1 2302.4 40

5710.1 2740.3 1234.8 47.3 159.7 2018.4 523.6 7.0 681.5 1352.8 60.9 6191.1 45

5514.0 9562.8 1436.0 93.6 201.9 3035.8 504.1 662.1 782.8 2261.1 249.9 3337.1 62

5107.2 11396.8 1911.4 14.7 140.7 1905.2 246.1 2.7 1247.4 1889.0 579.9 7170.4 50

5106.9 11395.8 2149.2 14.7 140.7 1904.3 236.0 2.7 1236.6 1962.3 597.5 7613.1 49

4392.6 7702.4 1236.9 12.7 218.0 1728.9 527.4 3.8 661.2 2457.4 691.3 3026.2 41

4393.6 7752.1 1237.4 12.7 218.1 1728.9 527.7 3.8 661.2 2458.2 691.2 3026.4 49

Table 13 Frames per second results for the object tracking experiments (higher is better). Best results are marked in bold.

A B C D E F G H I J K L Rank sums

LBFP

SFP

DHT

KDT

ANC

FGT

IFGT

22.26 20.83 22.02 9.75 28.00 24.63 19.45 11.87 20.70 26.10 9.62 12.55 64

23.03 22.71 23.20 22.33 28.89 25.05 17.44 13.56 21.72 12.64 12.08 11.41 71

1.46 1.50 1.45 1.45 1.84 1.74 1.40 1.62 1.45 1.80 1.32 1.27 34

3.61 6.09 2.62 1.75 4.90 3.44 2.70 2.40 2.56 2.53 1.88 2.52 48

1.21 2.19 1.00 0.82 1.94 1.39 0.96 0.87 0.99 0.95 0.65 0.88 26

35.10 24.87 35.17 17.06 24.40 34.74 29.34 19.14 28.66 36.29 17.75 17.02 81

0.75 0.14 0.22 0.05 0.06 0.84 0.69 0.05 0.71 0.43 0.04 0.04 12

 ðt Þ and x ~ ðtÞ stand for the true and estimated locations of the tracked object at time step t, respectively; and T is the where x number of time steps (video frames) of the test section of the video sequence. The speed of the estimation is also of paramount importance in this real time application. So the number of frames per second FPS has been obtained, i.e. the inverse of the CPU time in seconds. The quantitative results in terms of MSETrack ; FPS and peak memory use are reported in Tables 12–14, respectively. The tracking accuracy results (MSETrack ) are fairly similar for all the methods with a slight advantage for FGT, IFGT, LBFP and SFP, while DHT is the least accurate. The accuracy varies significantly depending on the sequences, since some of them are more complex than others. The most complex scenes are those where the reference histogram picks up some features of the background, so that the methods cannot track the object properly. The memory usages are almost the same for all methods because the main overhead of the tracking system is not in the mode finding algorithm. However, the differences in time requirements are very significant. The speed of FGT, LBFP and SFP is adequate for real time operation, but the other methods are considerably slower. It must be highlighted that IFGT is extremely slow, which rules it out for a practical application even if it attains values of MSETrack nearly equal to those of FGT. Consequently it can be said that FGT, LBFP and SFP are the only compared methods which are well suited for this task.

155

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158 Table 14 Peak memory use results (in MBytes) for the object tracking experiments. Best results are marked in bold.

A B C D E F G H I J K L Rank sums

LBFP

SFP

DHT

KDT

ANC

FGT

IFGT

546.3 605.8 365.4 809.6 252.1 371.9 709.7 385.6 876.1 374.6 992.7 1207.2 62

556.4 625.1 404.9 838.5 275.6 406.6 736.8 391.5 860.5 383.2 973.4 1222.2 76

543.7 603.6 362.7 798.0 274.4 400.3 748.9 388.1 858.6 359.3 982.9 1231.0 59

562.7 627.6 357.7 798.3 248.8 360.9 699.9 361.6 823.5 341.3 939.6 1184.3 35

568.9 602.5 361.2 803.9 244.3 364.0 701.4 358.9 822.3 334.5 948.5 1194.3 35

541.7 606.0 362.2 802.1 241.8 364.4 701.9 362.8 833.4 351.5 932.8 1170.5 34

542.7 622.9 357.7 824.8 262.4 361.6 696.7 383.6 821.5 328.5 956.4 1187.0 35

Fig. 14. Tracking results for the last frames of some sequences: (a) sequence B, (b) sequence E, (c) sequence H, and (d) sequence K. The estimated locations of the tracked objects are shown as circles, with the color key of Fig. 9j. The ground truth (GT) is shown in black. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Some qualitative results are depicted in Fig. 14. They correspond to the last frames of some sequences. The full results can be downloaded from the Internet.3 It can be noticed that the videos range from easy (sequence H) to hard (sequence B). Sequence H is interesting because DHT is the only method which loses track of the object; this can also be seen on Table 12. The qualitative assessment of the results agrees with the quantitative measurements, i.e. FGT, LBFP and SFP are the best performers, while DHT is the worst.

7. Conclusions A new probability density function estimator has been proposed, which is based on the averaging of several multivariate frequency polygons. It provides a smooth and accurate approximation to the underlying density, while its computational requirements are reasonable. It overcomes the limitations of multivariate histograms, which are not continuous, and whose convergence rate is slower. Experimental results with synthetic and real datasets indicate that our proposal performs better 3

http://www.lcc.uma.es/%7Eezeqlr/dfpt/dfpt.html.

156

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158

than a selection of state-of-the-art nonparametric estimators. Furthermore, the suitability of the approach to mode finding has been demonstrated on image filtering and object tracking applications. Future research lines include the development of optimized data structures to store the sample counts of each frequency polygon. In addition to this, structured matrices could be considered for the affine transforms, in particular when the input dimension D is large. Finally, new frequency polygon construction methods could be considered other than LBFP and SFP which feature efficiency and accuracy. Acknowledgments This work is partially supported by the Ministry of Economy and Competitiveness of Spain under grant TIN2011-24141, project name Detection of anomalous activities in video sequences by self-organizing neural systems. It is also partially supported by the Autonomous Government of Andalusia (Spain) under projects TIC-6213, project name Development of SelfOrganizing Neural Networks for Information Technologies; and TIC-657, project name Self-organizing systems and robust estimators for video surveillance. All of them include funds from the European Regional Development Fund (ERDF). The authors thankfully acknowledge the computer resources, technical expertise and assistance provided by the SCBI (Supercomputing and Bioinformatics) center of the University of Málaga. Appendix A. Gradient of the density estimator First the gradient for the LBFP version of the DFPT is computed:

Y @cr1 ;...;rD ¼ ð1Þrq þ1 urss ð1  us Þ1rs @yq s2f1;...;Dg;s–q @pLBFP 1 ¼ NV @yq

X

mk1 þr1 ;...;kD þrD

r 1 ;...;rD 2f0;1gD

ðA:1Þ

@cr1 ;...;rD @yq

ðA:2Þ

@pLBFP @pLBFP @y @pLBFP ¼ ¼ A @x @y @x @y

ðA:3Þ

Secondly the gradient for the SFP version is found. Let us consider the SFP density:

0 0 10 1 1 my11 ;...;y1D T w1 D 1 X 1 B B CB . . . C C pSFP ðx j A; bÞ ¼ w my ;...;y ¼ @w0 my01 ;...;y0D þ @ . . . A@ A A NV s¼0 s s1 sD NV myD1 ;...;yDD wD

ðA:4Þ

Now we note:

0

1 w1 B C U ¼ @...A wD 0

1

m¼B @

C A

ðA:5Þ

my11 ;...;y1D ...

myD1 ;...;yDD

ðA:6Þ

So that we may rewrite (A.4) as:

pSFP ðx j A; bÞ ¼

 1  w my ;...;y þ Um T NV 0 01 0D

ðA:7Þ

Now from (15) we can write:

pSFP ðx j A; bÞ ¼

1 NV

1

D X ws

!

!

my01 ;...;y0D þ Um T ¼

s¼1

   1  1  my01 ;...;y0D þ Um T  Um T0 ¼ my01 ;...;y0D þ U m T  mT0 NV NV

ðA:8Þ

where

0

1 1 B C m 0 ¼ my01 ;...;y0D @ . . . A 1

ðA:9Þ

Next, from (14),

pSFP ðx j A; bÞ ¼

  1  my01 ;...;y0D þ T1 ðy  y0 Þ mT  mT0 NV

ðA:10Þ

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158

157

So the gradient of pSFP ðx j A; bÞ with respect to y reads:

 @pSFP 1 1  T ¼ T m  m T0 NV @y

ðA:11Þ

From the chain rule we know that:

@pSFP @pSFP @y ¼ @x @y @x

ðA:12Þ

Consequently the gradient of pSFP ðx j A; bÞ with respect to x is given by:

 @pSFP 1 1  T ¼ T m  m T0 A NV @x

ðA:13Þ

References [1] T. Ahooyi, M. Soroush, J. Arbogast, W. Seider, U. Oktem, Maximum-likelihood maximum-entropy constrained probability density function estimation for prediction of rare events, AIChE J. 60 (3) (2014) 1013–1026. [2] W. Andrzejewski, A. Gramacki, J. Gramacki, Graphics processing units in acceleration of bandwidth selection for kernel density estimation, Int. J. Appl. Math. Comput. Sci. 23 (4) (2013) 869–885. [3] A. Asuncion, D. Newman, UCI Machine Learning Repository. . [4] D. Bashtannyk, R. Hyndman, Bandwidth selection for kernel conditional density estimation, Comput. Statist. Data Anal. 36 (3) (2001) 279–298. [5] T. Baum, I. Izhaki, E. Rivlin, G. Katzir, Active tracking and pursuit under different levels of occlusion: a two-layer approach, Mach. Vis. Appl. 25 (1) (2014) 173–184. [6] J.L. Bentley, Multidimensional binary search trees used for associative searching, Commun. ACM 18 (9) (1975) 509–517. [7] A. Bhattacharyya, On a measure of divergence between two statistical populations defined by their probability distributions, Bull. Calcutta Math. Soc. 35 (1943) 99–109. [8] R. Cappelli, D. Maltoni, On the spatial distribution of fingerprint singularities, IEEE Trans. Pattern Anal. Mach. Intell. 31 (4) (2009) 742–748. [9] Z. Chen, L. Zheng, Y. Chen, Y. Zhang, 2D hand tracking based on flocking with obstacle avoidance, Int. J. Adv. Rob. Syst. 11 (1) (2014). [10] S. Chiu, An automatic bandwidth selector for kernel density estimation, Biometrika 79 (4) (1992) 771–782. [11] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature space analysis, IEEE Trans. Pattern Anal. Mach. Intell. 24 (5) (2002) 609–619. [12] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, IEEE Trans. Pattern Anal. Mach. Intell. 25 (5) (2003) 564–577. [13] F. Cordelières, V. Petit, M. Kumasaka, O. Debeir, V. Letort, S. Gallagher, L. Larue, Automated cell tracking and analysis in phase-contrast videos (iTrack4U): development of Java software based on combined mean-shift processes, PLoS ONE 8 (11) (2013). [14] J. de la Rosa, J. Villa-Hernandez, E. De la Rosa, E. Gonzalez-Ramirez, O. Gutierrez, N. Escalante, R. Ivanov, G. Fleury, MAP entropy estimation: applications in robust image filtering, J. Eur. Opt. Soc. 8 (2013) 13047. [15] L. Devroye, G. Lugosi, Bin width selection in multivariate histograms by the combinatorial method, Test 13 (1) (2004) 129–145. [16] S. Fahmy, A. Mohan, Architecture for real-time nonparametric probability density function estimation, IEEE Trans. Very Large Scale Integr. VLSI Syst. 21 (5) (2013) 910–920. [17] S. Fanello, I. Gori, G. Metta, F. Odone, Keep it simple and sparse: real-time action recognition, J. Mach. Learn. Res. 14 (2013) 2617–2640. [18] K. Fukunaga, L.D. Hostetler, The estimation of the gradient of a density function, with applications in pattern recognition, IEEE Trans. Inf. Theory 21 (1) (1975) 32–40. [19] M. Geissbuehler, T. Lasser, How to display data by color schemes compatible with red-green color perception deficiencies, Opt. Express 21 (8) (2013) 9862–9874. [20] P. Ghamisi, M. Couceiro, M. Fauvel, J. Benediktsson, Integration of segmentation techniques for classification of hyperspectral images, IEEE Geosci. Remote Sens. Lett. 11 (1) (2014) 342–346. [21] A.G. Gray, A.W. Moore, N-body problems in statistical learning, Adv. Neural Inf. Process. Syst. 4 (2000) 521–527. [22] L. Greengard, J. Strain, The fast Gauss transform, SIAM J. Sci. Statist. Comput. 12 (1991) 79–94. [23] M. Haiman, A simple and relatively efficient triangulation of the n-cube, Discrete Comput. Geomet. 6 (1991) 287–289. [24] M. Hanmandlu, O. Verma, S. Susan, V. Madasu, Color segmentation by fuzzy co-clustering of chrominance color features, Neurocomputing 120 (2013) 235–249. [25] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, second ed., Springer, 2009. [26] N.-B. Heidenreich, A. Schindler, S. Sperlich, Bandwidth selection for kernel density estimation: a review of fully automatic selectors, AStA Adv. Statist. Anal. 97 (4) (2013) 403–433. [27] Hjort, N.L., 1986. On Frequency Polygons and Average Shifted Histograms in Higher Dimensions, Tech. Rep. 22, Stanford University. [28] H. Jeffreys, An invariant form for the prior probability in estimation problems, Proc. R. Soc. London, Ser. A, Math. Phys. Sci. 186 (1007) (1946) 453–461. [29] M. Jones, M. Samiuddin, A. Al-Harbey, T. Maatouk, The edge frequency polygon, Biometrika 85 (1) (1998) 235–239. [30] D.A. Klein, D. Schulz, S. Frintrop, A.B. Cremers, Adaptive real-time video-tracking for arbitrary objects, in: IEEE International Conference on Intelligent Robots and Systems (IROS), October 2010, pp. 772–777. [31] J. Klemelä, Multivariate histograms with data-dependent partitions, Statistica Sinica 19 (1) (2009) 159–176. [32] H. Kuhn, Some combinatorial lemmas in topology, IBM J. Res. Dev. 45 (1960) 518–524. [33] D. Lang, M. Klaas, F. Hamze, A. Lee, N-Body Methods Code and Matlab Binaries, 2006. . [34] F. Lang, J. Yang, D. Li, L. Shi, J. Wei, Mean-shift-based speckle filtering of polarimetric SAR data, IEEE Trans. Geosci. Remote Sens. 52 (7) (2014) 4440– 4454. [35] J. Leiva-Murillo, A. Artés-Rodríguez, Algorithms for maximum-likelihood bandwidth selection in kernel density estimators, Pattern Recogn. Lett. 33 (13) (2012) 1717–1724. [36] T. Li, S. Sun, T. Sattar, J. Corchado, Fight sample degeneracy and impoverishment in particle filters: a review of intelligent approaches, Expert Syst. Appl. 41 (8) (2014) 3944–3954. [37] B. Liu, J. Huang, C. Kulikowski, L. Yang, Robust visual tracking using local sparse appearance model and k-selection, IEEE Trans. Pattern Anal. Mach. Intell. 35 (12) (2013) 2968–2981. [38] D. Liu, M. Cong, Y. Du, C. De Silva, GMM-based visual attention for target selection of indoor robotic tasks, Ind. Robot 40 (6) (2013) 583–596. [39] E. López-Rubio, A histogram transform for probability density function estimation, IEEE Trans. Pattern Anal. Mach. Intell. 36 (4) (2014) 644–656. [40] E. López-Rubio, A Histogram Transform for Probability Density Function Estimation: Source Code and Demos, 2014. . [41] F. Mezzadri, How to generate random matrices from the classical compact groups, Notices ACM 54 (5) (2007) 592–604.

158

E. López-Rubio, J. Muñoz-Pérez / Information Sciences 298 (2015) 136–158

[42] R. Mittelman, E.L. Miller, Nonlinear filtering using a new proposal distribution and the improved fast Gauss transform with tighter performance bounds, IEEE Trans. Signal Process. 56 (12) (2008) 5746–5757. [43] S. Mohammad Saberali, H. Amindavar, J. Ritcey, Blind detection in symmetric non-Gaussian noise with unknown PDF using maximum entropy method with moment generating function constraints, Signal Process. 90 (3) (2010) 891–899. [44] A. Moore, The anchors hierarchy: using the triangle inequality to survive high-dimensional data, in: Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence, AAAI Press, 2000, pp. 397–405. [45] V.I. Morariu, V.C. Raykar, C. Yang, FIGTree Library, 2007. . [46] G. Moser, J. Zerubia, S. Serpico, SAR amplitude probability density function estimation based on a generalized gaussian model, IEEE Trans. Image Process. 15 (6) (2006) 1429–1442. [47] J.A. Nelder, R. Mead, A simplex method for function minimization, Comput. J. 7 (1965) 308–313. [48] B. Prakasa Rao, Nonparametric Functional Estimation, Academic Press, 1983. [49] M. Rosenblatt, Remarks on some nonparametric estimates of a density function, Ann. Math. Statist. 27 (3) (1956) 832–837. [50] J.F. Sallee, A triangulation of the n-cube, Discrete Math. 40 (1982) 81–86. [51] J.F. Sallee, The middle-cut triangulations of the n-cube, SIAM J. Algebraic Discrete Methods 5 (1984) 407–419. [52] D. Scott, Frequency polygons: theory and application, J. Am. Statist. Assoc. 80 (1985) 348–354. [53] D. Scott, S. Sain, Data mining and computational statistics, Handbook of Statistics, vol. 24, Elsevier, Amsterdam, 2005 (Chapter Multi-Dimensional Density Estimation, pp. 229–261). [54] D.W. Scott, Averaged shifted histograms: effective nonparametric density estimators in several dimensions, Ann. Statist. 13 (3) (1985) 1024–1040. [55] D.W. Scott, Multivariate Density Estimation, Wiley, 1992. [56] B.W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman and Hall, 1986. [57] T. Song, C. Gasparovic, N. Andreasen, J. Bockholt, M. Jamshidi, R. Lee, M. Huang, A hybrid tissue segmentation approach for brain MR images, Med. Biol. Eng. Comput. 44 (3) (2006) 242–249. [58] M. Sugiyama, M. Kawanabe, P. Chui, Dimensionality reduction for density ratio estimation in high-dimensional spaces, Neural Networks 23 (1) (2010) 44–59. [59] M. Sugiyama, M. Yamada, P. von Bünau, T. Suzuki, T. Kanamori, M. Kawanabe, Direct density-ratio estimation with dimensionality reduction via leastsquares hetero-distributional subspace search, Neural Networks 24 (2) (2011) 183–898. [60] B. Wang, L. Wan, T.-D. Zhang, An underwater laser image segmentation algorithm based on pulse coupled neural network and morphology, Adv. Intell. Syst. Comput. 215 (2014) 437–449. [61] J. Wang, Y. Yagi, Integrating color and shape-texture features for adaptive real-time object tracking, IEEE Trans. Image Process. 17 (2) (2008) 235–240. [62] A. Weber, USC-SIPI Image Database, 2010. . [63] C. Xie, J. Tan, P. Chen, J. Zhang, L. He, Collaborative object tracking model with local sparse representation, J. Vis. Commun. Image Represent. 25 (2) (2014) 423–434. [64] B. Xu, X. Shen, F. Ding, A VSS algorithm based on multiple features for object tracking, J. Software 8 (12) (2013) 3029–3034. [65] C. Yang, R. Duraiswami, N.A. Gumerov, L. Davis, Improved fast Gauss transform and efficient kernel density estimation, in: Proceedings of the IEEE International Conference on Computer Vision, 2003, pp. 464–471. [66] L.-J. Zhao, T.-Y. Chai, D.-C. Yuan, X.-K. Diao, Probabilistic partial least square based extreme learning machine to enhance reliability of operating conditions recognition, J. Zhejiang Univ. (Eng. Sci.) 47 (10) (2013) 1747–1752. [67] Y. Zhong, B. Zhao, L. Zhang, Multiagent object-based classifier for high spatial resolution imagery, IEEE Trans. Geosci. Remote Sens. 52 (2) (2014) 841– 847. [68] H. Zhou, Y. Yuan, C. Shi, Object tracking using SIFT features and mean shift, Comput. Vis. Image Underst. 113 (3) (2009) 345–352.