Gaussian process variance reduction by location selection

Gaussian process variance reduction by location selection

Pattern Recognition Letters 125 (2019) 727–734 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier...

1008KB Sizes 0 Downloads 18 Views

Pattern Recognition Letters 125 (2019) 727–734

Contents lists available at ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Gaussian process variance reduction by location selectionR Lorenzo Bottarelli a,∗, Marco Loog b a b

Department of Computer Science, University of Verona, Strada le Grazie 15, Verona 37134, Italy Pattern Recognition Laboratory, Delft University of Technology, Van Mourik Broekmanweg 6, 2628 XE, the Netherlands

a r t i c l e

i n f o

Article history: Received 11 February 2019 Revised 15 May 2019 Accepted 14 July 2019 Available online 15 July 2019 MSC: 41A05 41A10 65D05 65D17

a b s t r a c t A key issue in Gaussian Process modeling is to decide on the locations where measurements are going to be taken. A good set of observations will provide a better model. Current state of the art selects such a set so as to minimize the posterior variance of the Gaussian process by exploiting submodularity. We propose two techniques, a Gradient Descent procedure and an heuristic algorithm to iteratively improve an initial set of observations so as to minimize the posterior variance directly. Results show the clear improvements that can be obtain under different settings. © 2019 Elsevier B.V. All rights reserved.

Keywords: Gaussian process Variance reduction Gradient descent Sampling

1. Introduction In many analyses we are dealing with a spatial phenomenon of interest that is modeled using Gaussian processes (GPs) [9,12]. When tackling the analysis of spatial phenomena in a data-driven manner, a key issue, before setting up the actual model, is to decide on the locations where measurements are going to be taken. The better our choice of locations, the better the Gaussian process will approximate the true underlying functional relationship or the fewer measurements we need to build a model that provides a prespecified level of performance. Recent research in the context of optimizing sampling locations has focused on selecting a set of measurement so as to minimize the posterior variance of the Gaussian process [7]. This selection of measurement locations is basically performed through the use of greedy procedures. In particular, submodularity, an intuitive diminishing returns property, is often exploited [5,6,11]. Although submodular objective functions allow for a greedy optimization with approximation guarantees [10], these guarantees can be rather loose. Moreover, submodularity is inherently discrete since we can select the sensing points between a given set of possible locations.

R ∗

Editor: Prof. G. Sanniti di Baja. Corresponding author. E-mail address: [email protected] (L. Bottarelli).

https://doi.org/10.1016/j.patrec.2019.07.013 0167-8655/© 2019 Elsevier B.V. All rights reserved.

Though a very fine grid allows submodular techniques to optimize the sensing points almost continuously, this comes at the cost of heavy computation. The main goal of this work is to improve upon the above two points. It offers a significantly extended and fully revised version of [2], in which we propose a new technique to tackle the same problem. Specifically, compared to [2], 1) We present a novel fast iterative heuristic to optimize sensing locations by minimizing the GP posterior variance. 2) We prove the approximation of the heuristic with respect to the real objective function, showing its connection with our previous contribution, i.e., the technique based on gradient descent. 3) We perform a more extensive empirical evaluation under different conditions by varying, the hyperparameters, the number of sampling points K, the method of initialization and dimensionality of the dataset. The remainder is organized as follows. Section 2 provides the required background and problem definition. Sections 3 and 4 presents our GD and heuristic algorithms. Section 5 provides a detailed description of the experimental settings and presents the results. Section 6 provides a discussion and conclusions.

728

L. Bottarelli and M. Loog / Pattern Recognition Letters 125 (2019) 727–734

2. Background 2.1. Gaussian processes Gaussian processes are a widely used tool in machine learning [9,12]. It provide a way to model an unknown function f. It is a flexible and non-parametric tool that defines a prior distribution over functions, which can be converted into a posterior distribution over functions once we have observed data. A GP is completely defined by its mean and kernel function (also called covariance function) which encodes the smoothness properties of the modeled function f. Throughout this work, we use the standard Gaussian kernel:



k(a, b) = σ f2 exp −

( a − b )T ( a − b )



2l 2

(1)

with hyperparameters l and σ f . The former describes the smoothness property while the latter the standard deviation of the true underlying function. Now, let {(μi , yi )|i = 1, . . . , K } be a training set, i.e., a set of K measurements {y1 , y2 , . . . , yK } taken at locations K = {μ1 , μ2 , . . . , μK }. Gaussian processes allow such observations to be modeled as noisy measurements. As such, we assume that yi = f (xi ) +  where  ∼ N (0, σn2 ), i.e., observations with additive independent identically distributed Gaussian noise  with variance σn2 . The posterior mean and variance over f for a test point x∗ can therefore be computed as follows [9,12]:

f (x∗ ) = k∗ (K + σn2 I )−1 y T

σ ( x∗ ) = k ( x∗ , x∗ ) − 2

T k∗

F (A ) ≥ F (B ∪ {x} ) − F (B ). This definition captures a concept known as diminishing return property, which allows for a simple greedy forward-selection algorithm with an optimality bound guarantee [10]. This is widely exploited in sensing optimization literature [4–6,11,13]. The posterior variance of a GP given a set of observations belongs to this class of submodular functions as shown in [3]. Therefore, adding a new observation reduces the variance in x more if we have made few observations so far, and less if we have already made many observations. Although submodular objective functions allow for an efficient optimization, the solution is inherently discrete since we can select the sensing points only within a given set of feasible locations. Moreover, as we will see in our experiments, the optimality guarantees can be rather loose and allow for significant improvements. 3. Gradient Descent technique We propose to rely on more straightforward approach of Gradient Descent (GD) to optimize the sensing locations in a continuous manner. Specifically, starting from an initial configuration of measurement points in the domain, we perform a GD procedure to (locally) minimize the total posterior variance of the Gaussian process. The gradient of the total posterior variance in Eq. (4) is used to iteratively re-adapt the locations of the measurement points across the domain. The submodular approach does not consider all these points simultaneously and this is what gives our Algorithm 1 an advantage.

(2)

(K + σ )

2 −1 k∗ nI

(3)

where k∗ = [k(μ1 , x∗ ), . . . , k(μK , x∗ and K = [k(μi , μ j )]μi ,μ j ∈K . Thus, we can compute the GP, updating our knowledge about the unknown function f based on information acquired through observations. Finally note that the variance computed using Eq. (3) does not depend of the observed values yi but only on the set of locations K. This is an important property of GPs and will play a significant role throughout our contributions.

)] T

Algorithm 1 Gradient Descent procedure. input: set of initial sampling locations K0 = {μ01 , μ02 , . . . , μ0K }, set X , convergence factor cf 1: 2: 3: 4: 5: 6: 7: 8: 9:

2.2. Problem definition: optimizing sample locations

10:

Given a continuous domain D ⊂ Rd , we want to select a set of K points where to perform measurements in order to minimize the total posterior variance of the Gaussian Process. Specifically, we want to select a set of K measurements taken at locations K = {μ1 , μ2 , . . . , μK } ∈ D such that we minimize the following objective function:

J ({μ1 , μ2 , . . . , μK } ) =



σ ( xi ) 2

(4)

xi ∈X

where σ 2 (xi ) is the Gaussian process variance computed in point xi as defined by Eq. (3) and X is a set of points where we test the Gaussian process. The variance computed with Eq. (3) in locations xi ∈ X is dependent on the set of sampling points K as explained in Section 2.1. J ({μ1 , μ2 , . . . , μK } ) is a multi-dimensional function that represents the total posterior variance of the GP.

11: 12: 13: 14:

Initialization while not converged do i ← i + 1; step ← step + 1; improved ← f alse while not improved and not converged do Ki ← Ki−1 − ∇ J (Ki−1 )/step if J (Ki ) < J (Ki−1 ) then improved ← true else step ← step + 1; Ki ← Ki−1 end if Check convergence using c f end while end while return Ki

As inputs, the procedure takes the set of initial sampling points K0 that can be initialized using different choices. The input X , represent the set of locations where we want to evaluate our GP in order to compute the posterior variance using Eq. (3). The remaining input (cf) is used to determine the convergence of the procedure (see Section 3.1). The output of the procedure is represented by the final set Ki of sampling locations after i iterations of the algorithm. After initializing the procedure with an initial set of sampling locations K0 , the main loop (lines 2–13) iterates until convergence. It is made up of two main components: i) the GD iterative step that allows to minimize the objective function (lines 4–12) and ii) the check of convergence (line 11).

2.3. Submodular functions 3.1. GD iterative step and convergence A set function is a function which takes as input a set of elements. Particular classes of set functions turn out to be what is referred to as submodular [10]. A function F is submodular if and only if for all A⊆B⊆X and x ∈ XࢨB it holds that: F (A ∪ {x} ) −

We want to guarantee that our algorithm at every iteration improves the solution. A simple method is to check whether the current step improves upon the current solution or not. If it does not,

L. Bottarelli and M. Loog / Pattern Recognition Letters 125 (2019) 727–734

we roll-back to the previous solution Ki−1 and recompute a smaller displacement (line 9). To this aim we make use of the additional variable step. This variable is used to compute the size of the displacement in line 5. The step is increased at each iteration of the algorithm at least once (in line 3) and, potentially, an additional number of times (line 9) to guarantee that at each iteration we obtain an improvement. The constant cf in Algorithm 1 specifies what is the lowest percentage (with respect to the dataset diameter) of displacement that points we are adapting should move. At the beginning of the procedure (line 1) we compute the diameter of the dataset maxD. We then check the convergence (line 11). When all the points in K have a displacement lower than cf · maxD we consider the procedure terminated. The cf parameter acts as a trade-off between the precision of the solution and the computation (number of iterations) required to converge.

For any GD procedure some mechanism that controls the learning rate is required and the convergence of such procedure usually requires significant computational effort. Here we propose a different technique to optimize the objective in Eq. (4) that is inspired by the Lloyd’s algorithm for soft K-means [8]. The basic idea is to iteratively readapt a set of cluster centroids by computing soft indicator variables zi,j which represent the degree of point j belonging to the cluster i, and subsequently updates the centroids of the clusters [1]. In our case, a centroid μ represents a sampling location, hence the indicator variable zi,j will represent the degree of which the variance in a location x j is reduced by sampling in position μj . The pseudo code of our technique is listed in Algorithm 2. Note that our heuristic has been specifically designed for the Gaussian kernel (Eq. (1)). Algorithm 2 Pseudo code of our heuristic. input: set of K initial sampling locations K0 = {μ01 , μ02 , . . . , μ0K }, convergence criteria 1: 3: 4: 5: 6: 7:

8: 9: 10: 11: 12: 13: 14:

t←0 repeat for i = 1 to K do // compute GP variance given sampling locations Kt \ μti  = GP|Kt \μt i // compute indicator variables for j = 1 to |X | do zi, j = (x j )

We can think of the algorithm as follows: an indicator variable zi,j (computed as in line 8) represent the attractive force that a domain location x j ∈ X exert on sampling point μi . As mentioned, zi,j represent the degree of which the variance of point x j is reduced by sampling in position μj . We compute this force by taking into account how the variance of the Gaussian process is already effected by the presence of the other sampling locations Kt \ μti . The choice of σ f2 /(σ f2 + σn2 ) exp(− l12 ||x j − μti ||2 ) as scalar factor for the variance (x j ) in location x j is suggested by the following theorem. Theorem 1. Using a Gaussian kernel, in a GP with a single sampling point in a generic location μ, the variance of this GP at any location x is equivalent to the multiplication of the variance of the Gaussian process without any sampling point and 1 − f (x ), where f (x ) is a Gaussian function. Specifically:

σ 2 (x ) = σ f2 (1 − f (x ) )

4. The heuristic

2:

729

σ

2 f

σ +σ 2 f

2 n

e



  x j − μti 2 l2

end for // update  sampling location j zi, j x j μt+1 =  i j zi, j end for t ←t +1 until convergence criteria are satisfied

During iteration t for each sampling point i (for loop in lines 3–12) the iterative step of the heuristic is composed of three components: 1) Line 5: compute  which represent the GP variance (using Eq. (3)) at every location x j ∈ X , given Kt \ μti , the set of sampling location at time t, μti excluded. 2) Line 8: compute the soft indicator variables zi,j for each x j ∈ X . 3) Line 11: compute the location of sampling point μt+1 for the next iteration t + 1. i

with f (x ) =

σ f2 σ f2 +σn2

e



(5)

||x−μ||2 l2

and any hyperparameters σ f , σ n and l

used in the GP with the Gaussian kernel. Proof. Given Eq. (3) for the variance of a Gaussian process and Eq. (1) of the Gaussian kernel, for a single sampling point in a generic location μ we obtain:

  σ 2 (x ) = k(x, x ) − k(x, μ ) k(μ, μ ) + σn2 −1 k(x, μ )   ||x−μ||2 σ f2 e− 2l2 2 2 k ( x, μ ) = k ( x, x ) − = σ f2 − k(μ, μ ) + σn2 σ f2 + σn2   2 σ f4 e− l2 σ f2 − ||x−2μ|| 2 l =σ − = σ 1 − e f σ f2 + σn2 σ f2 + σn2 ||x−μ||2

2 f

 Now, the indicator variable zi,j represents an attractive force proportional to the amount of variance reduction obtainable in location x j by sampling in position μj . Consequently, the update step of the sampling locations (line 11) represents the sum of the contributions from all the attractive forces from the domain locations xj ∈ X . The main loop of the procedure (lines 2–14) iterates until some convergence criteria of the procedure are met. Specifically, one of our conditions is that the procedure has to always improve the solution at each iteration. The second condition is based on a convergence factor cf that is used in the same way as presented for the GD procedure (Section 3.1). Specifically, this parameter is intended as a threshold to determine whether the procedure has to terminate or not. cf specifies what is the lowest percentage (with respect to the dataset diameter) of displacement that any points we are adapting can move. As described in Section 3.1, the cf parameter act as a trade-off between the precision of the solution and the computation (number of iteration) required to converge. 4.1. Approximation performed by our heuristic Given K sampling locations in a d-dimensional space Eq. (4) is Kd-dimensional, where each dimension of the objective function represent the position of a sampling point in one of the dimension of the domain. Fig. 1a shows a bi-dimensional example objective function for a problem of adapting two sampling points in a onedimensional domain (see Fig. 1b). Computing the derivative for a specific dimensionality of the objective function correspond to computing the direction in which a single sampling point μi has to move along a single dimension d

730

L. Bottarelli and M. Loog / Pattern Recognition Letters 125 (2019) 727–734

Fig. 1. (a) Example of the objective function given 2 sampling points in a 1-D domain. (b) Plot of the variance of the Gaussian process with sampling points K = {μ1 = 25, μ2 = 50}. (c) Section of the objective function given μ2 = 50 and varying μ1 . (d) Section of the objective function given μ1 = 25 and varying μ2 .

of the domain given the current configuration of the other sampling points Kt \ μti . For a graphical representation of this concept let us take look at the images in Fig. 1. The current configuration of sampling points is K = {μ1 = 25, μ2 = 50}. The value of the objective function (Fig. 1a) in the point J(25, 50) represent the sum of the variance (Fig. 1b) along the entire domain. Computing the derivative in J(25, 50) along the first dimension of the objective function correspond to computing the derivative of the section of the objective function given the second dimension, i.e. μ2 = 50 (Fig. 1c). Respectively, computing the derivative along the second dimension of the objective function correspond to computing the derivative of the section of J given the first dimension, i.e. μ1 = 25 (Fig. 1d). Our heuristic iterates in a similar way: at each iteration the point in the objective function that represents the current configuration of sampling locations Kt moves along each axis in the same direction as GD would do on an approximated objective function: Theorem 2. Given that  > 0, the direction of movement of a sampling point μ is in the same direction as gradient descent would compute in the location μ on the objective function  ∗ 1 − σ f2

e

σ f2 +σn2



( x −μ ) 2 l2



Proof. To simplify the notation let a =







(x−μ )2 d  ∗ 1 − ae− l2 dx



= sign

∗



= sign

2

= sign

l2





2ax − x22 e l (μ ) l2

(τ )

 ∗ 1−

σ f2

e− 2

(x−μ )2



(7)

l2

σ f2 + σn

Theorem 1 showed that the variance at any location x of a GP with a single sampling point in a generic location μ is equivalent to the multiplication of the variance of the GP without any sampling point and 1 −

σ f2 σ f2 +σn2

e



( x −μ ) 2 l2

. By definition of the convolu-

tion, Eq. (7) computes the integral of the multiplication between

 and 1 −

σ f2 σ f2 +σn2

e



( x −μ ) 2 l2

by shifting one over the other. This rep-

resents an approximation of the section of the objective function J (μ|K \ μ ), that is, the section of the objective function varying μ given the other sampling points K \ μ. Looking at Fig. 1, we are approximating the section J (μ1 |μ2 = 50 ) in Fig. 1 by computing Eq. (7) where  represent the variance of the GP given μ2 . 5. Empirical evaluation

. That is:

(x−μ )2 σ2  (x ) σ 2 +f σ 2 e− l2 x dx  n f sign μ − (x−μ )2 σ2 (x ) σ 2 +f σ 2 e− l2 dx n f      2 σ f2 d − (x−2μ) l = sign ∗ 1− 2 e (μ ) dx σ f + σn2

sign

puted by our heuristic (line 11). This movement correspond to the movement that gradient descent would compute in the location μ of the function:



σ f2 σ f2 +σn2

(μ )



2a(μ − τ ) − (μ−2τ )2 e l dτ l2

(τ )ae−

(μ−τ )2 l2

(6)



μ − (τ )ae−

(μ−τ )2 l2

τ dτ



  (μ−τ )2 (μ−τ )2 = sign μ (τ )ae− l2 dτ − (τ )ae− l2 τ dτ 

(μ−τ )2 (τ )ae− l2 τ dτ = sign μ − (μ−τ )2 (τ )ae− l2 dτ  In the theorem the left side of the equation represents the direction of movement of a sampling point μ along an axis as com-

We generated cubic datasets with domains from 1 up to 5 dimensions with equally spaced domain points X on a grid. The cardinality of the grid |X | on which we evaluate the Gaussian Process, has been adapted to be at least 10 0 0 points. For different GPs, we varied the hyperparameters of the Gaussian kernel considerably. Specifically, we used 20 different length-scale l and 15 different σ f . We assume that predictions are made using noisy observations, hence we also used 10 different σ n . As we can observe in Eqs. (1) and (3) these hyperparameters are fundamental to determine the variance of a GP. We have tested the procedures by adapting different numbers K of measurement points, varying from 2 to 7. The case of a single point has been excluded since the submodular greedy technique is optimal by definition. Starting locations are required to initialize both our algorithms, hence in our experiments we initialized them using the submodular procedure [5,6,11] and measure the magnitude of the possible improvements. The convergence factor cf is set to 1/10 0 0. All in all, we performed 90,0 0 0 different experiments for our algorithms to characterize and study the improvement obtainable with respect to the widely used submodular technique. Notice that the numeric value of the posterior variance of a Gaussian process is directly dependent from the size of the domain and the hyperparameters setting. As a consequence in our results we report the percentage of improvements. In addition, we performed the same set of experiments by initializing the points randomly. This allows us to study the average improvement without the use of the submodular initialization. Finally, we selected a subset of the hyperparameters and datasets to perform a test with repeated random

L. Bottarelli and M. Loog / Pattern Recognition Letters 125 (2019) 727–734

731

Fig. 2. Results as a function of the hyperparameters, (a) and (b) refers to the gradient descent technique whereas (c) and (d) to the heuristic algorithm. In all these plots the horizontal axis represent variations in the standard deviation σ f and on the vertical axis in the length-scale l. Colors represent the % of variance reduction of our techniques with respect to the submodular greedy solution. These results refer to the adaptation of 5 points in the 2-dimensional dataset for a fixed σ n . Specifically in the right images σ n is about three times higher then in the left ones. Table 1 Average % gain of our gradient descent and heuristic algorithms with respect to the submodular greedy solution. Number of points (K) Gradient descent

1-D 2-D 3-D 4-D 5-D

Heuristic

2

3

4

5

6

7

2

3

4

5

6

7

32.83 4.14 1.03 0.32 0.01

18.23 16.93 2.83 0.97 0.60

17.60 19.68 8.79 1.94 1.07

17.05 9.16 8.00 5.06 1.67

14.76 13.66 10.55 3.46 3.89

8.50 14.48 8.20 4.91 2.20

24.02 1.76 0.16 0.01 0.00

12.32 5.58 0.43 0.03 0.03

14.55 7.57 1.48 0.16 0.19

14.95 4.81 0.86 0.50 0.43

14.58 4.80 0.93 0.11 1.01

6.87 5.65 1.54 0.17 0.23

Table 2 Maximum % gain of our gradient descent and heuristic algorithms with respect to the submodular greedy solution. Number of points (K) Gradient descent

1-D 2-D 3-D 4-D 5-D

Heuristic

2

3

4

5

6

7

2

3

4

5

6

7

59.91 21.11 6.23 6.59 2.99

86.77 60.27 15.84 11.48 8.80

89.80 54.91 52.09 12.17 8.23

89.15 33.36 29.91 31.11 17.54

71.57 76.67 41.23 20.73 40.09

71.72 72.27 31.00 22.57 22.64

69.59 8.44 1.15 0.10 0.08

82.38 27.13 3.23 0.25 0.52

80.84 46.69 8.98 1.03 2.24

75.82 43.43 8.85 4.67 5.00

86.43 52.72 11.32 1.18 14.22

89.13 60.15 14.69 7.01 6.82

initializations. The experiments have been performed using the same machine equipped with an AMD FX 6300 processor with 16GB RAM. This allows us to compare the performance in terms of computation time.

5.1. Results as a function of the hyperparameters Fig. 2 gives the % of improvement that our techniques obtain compared to the submodular solution by varying the hyperparameters. Independently of σ f and σ n , when we use very small lengthscales (top rows of the four pictures in Fig. 2), the advantages we can obtain with gradient descent and the heuristic are negligible. The reason is that with small length-scales the contribution in variance reduction given by an observations is mostly concentrated in a very narrow position. Consider that we are trying to estimate where to make two observations, as long as they are a little separated one another we are already obtaining most of the variance reduction possible. When the length-scale of the kernel becomes bigger the reduction in variance given by a measurement point has an effect on a larger portion of the domain, hence the location where the measurements are taken affect the total amount of posterior variance reduction. In this case we observe that the sampling locations selected by our algorithms obtain an advantage with respect to the submodular greedy technique.

When the length-scale becomes bigger we notice that the σ f and σ n parameters affect the results differently. Consider for instance pictures 2a and 2c. We can observe that for small values of σ f we obtain a small advantage and vice versa. These results are shifted to the right when the σ n parameter increases (pictures 2b and 2d). This show that the ratio σ f /σ n affects the quality of the results: the higher the ratio the higher the improvements we can obtain with our techniques. These observations holds both for GD and the heuristic approaches. Fig. 2 shows that the advantage obtainable by our techniques follows a same pattern. However, if we compare the images 2a and 2b (that refers to gradient descent) with 2c and 2d (that refers to the heuristic), we notice that the improvement offered by the latter are confined to a smaller area, i.e, to a smaller portion of hyperparameter combinations. A likely explanation for this is the approximate nature of our heuristic, compared to the true GD procedure. 5.2. Varying the number of points and dimensionality In Tables 1 and 2 we report, respectively, the average and maximum percentage gain of variance reduction that our procedures obtains with respect to the total posterior variance of the Gaussian process when the measurement locations are selected by the submodular greedy technique. Specifically, each entry of the tables reflects the improvement obtained for a specific combination of number of points and dimensionality of the domain. In more

732

L. Bottarelli and M. Loog / Pattern Recognition Letters 125 (2019) 727–734 Table 3 Average % gain of our gradient descent and heuristic algorithms with respect to a single random configuration. Number of points (K) Gradient descent

1-D 2-D 3-D 4-D 5-D

Heuristic

2

3

4

5

6

7

2

3

4

5

6

7

38.81 19.69 18.32 17.16 15.90

44.97 34.98 17.95 14.61 13.43

45.57 36.39 32.27 16.92 12.92

46.62 35.80 30.13 30.32 15.92

47.14 36.96 30.87 27.37 28.02

46.59 38.59 30.73 25.94 25.11

25.43 15.83 15.57 14.85 14.27

22.65 15.31 9.83 9.42 9.27

23.23 14.15 12.52 8.00 7.05

23.90 12.77 9.79 10.71 7.16

24.44 13.22 8.23 9.21 9.65

25.18 13.79 7.99 7.71 8.74

Table 4 Maximum % gain of our gradient descent and heuristic algorithms with respect to a single random configuration. Number of points (K) Gradient descent

1-D 2-D 3-D 4-D 5-D

Heuristic

2

3

4

5

6

7

2

3

4

5

6

7

99.38 78.28 69.95 62.85 59.92

99.33 99.09 81.10 66.11 58.77

99.59 97.36 98.44 76.19 62.33

99.78 96.85 96.63 96.71 75.27

99.74 94.41 94.12 94.23 95.58

99.59 96.48 88.85 94.37 97.10

97.54 74.91 65.43 64.93 58.81

98.15 94.43 64.66 58.35 53.68

99.19 87.57 88.37 61.02 47.70

99.80 84.49 75.29 90.83 54.30

99.78 85.30 55.94 74.40 83.79

99.95 89.82 67.26 70.95 75.78

Table 5 Maximum % gain of gradient descent starting from a random configuration with respect to gradient descent starting from the submodular greedy solution. Number of points (K) Gradient descent

1-D 2-D 3-D 4-D 5-D

Heuristic

2

3

4

5

6

7

2

3

4

5

6

7

43.37 14.15 9.68 4.90 1.23

75.97 34.63 15.80 7.74 7.00

74.02 31.88 30.15 14.09 7.03

39.10 35.27 16.44 26.64 7.20

53.18 52.05 35.86 15.33 26.68

36.90 52.13 21.94 15.28 21.44

68.81 22.34 9.38 8.22 0.40

91.85 48.10 19.08 9.91 5.69

94.42 49.40 31.85 10.28 10.36

89.69 35.12 20.79 15.38 9.39

51.40 67.00 12.46 8.24 9.84

74.83 68.31 7.59 3.62 3.72

details, Table 1 represents the average % gain of our gradient descent and heuristic algorithms with respect to the submodular greedy solution. Both gradient descent and the heuristic procedures significantly improve the submodular solutions for small dimensionality and number of points. Table 2 instead represent the maximum % improvement that our algorithms can obtain with respect to the submodular greedy solution on a specific combination of dimensionality of the domain and number of measurement points. Also in this case, both GD and the heuristic produce better results for small dimensionality and number of points. Generally, the results obtained with both techniques follow the same trend. If we compare the average improvements in Table 1, we notice that the advantage of the heuristic are lower then GD for the reasons explained in Section 4.1. However, if we compare the maximum improvements (Table 2) we notice that in some cases (e.g., 2 points in 1-D) the heuristic achieve better results then GD, meaning that in some restricted cases the heuristic converges to a better solution even with the same convergence conditions. 5.3. Random initialization Tables 3 and 4 report respectively the average and maximum improvements achieved using random initializations. These results represent the gain in terms of percentage of variance reduction with respect to the variance of the GP with the measurement points in random locations. As we expected, since the random collocation of point can represent a very bad quality solution compared to the submodular greedy procedure, results show much bigger improvements. A more interesting point of view is offered in

Table 5. Here we compare the total posterior variance of the Gaussian process by using our techniques with a random initialization of points, with the total posterior variance by using them starting from the submodular greedy solution. Specifically, Table 5 shows the maximum improvements encountered among all the 30 0 0 combinations of hyperparameters. Since the randomly generated sampling locations can vary considerably this is reflected in the quality of the results. However, what these results show is that in some cases our algorithms are able to obtain better results with a random initialization rather then using a submodular greedy procedure to select the starting configuration of sampling points. Tables 3–5 present results considering only a single randomly generated initialization of sampling point per instance (i.e. per combination of hyperparameters, dimensionality of domain and number of sampling points). Since the selection of the initial configuration of sampling points can vary considerably along the domain, we also performed a more detailed test on a small subset of cases. In Fig. 3 we show results varying the hyperparameters l and σ f and by fixing a specific σ n on two specific cases. In particular, we have selected the 2-D dataset and we used our algorithms by randomly initializing two sampling locations 100 times. The same for the case of six points in the 3-D dataset. We observe that by performing multiple random initializations, on average we obtain a spectrum of improvements similar as what previously shown in Fig. 2. 5.4. Computation time comparison The main advantages offered by the heuristics with respect to GD, is a faster computation time, in Table 6 we report this

L. Bottarelli and M. Loog / Pattern Recognition Letters 125 (2019) 727–734

733

Table 6 Computation times in seconds of gradient descent and the heuristic (both with submodular initialization). Each entry represent the average over the 30 0 0 hyperparameters combinations for a specific dimensionality of the domain and number of sampling locations. Computation time (s) Gradient descent

1-D 2-D 3-D 4-D 5-D

Heuristic

2

3

4

5

6

7

2

3

4

5

6

7

1.12 2.13 2.05 3.00 3.15

1.70 2.81 2.84 6.10 7.29

1.87 6.53 4.27 7.88 14.96

2.55 4.08 7.15 9.46 20.95

2.53 7.27 11.07 18.27 20.59

3.13 10.92 12.63 17.32 25.76

0.13 0.24 0.13 0.14 0.03

0.05 0.29 0.21 0.24 0.10

0.17 0.40 0.77 0.69 0.31

0.19 0.20 0.83 0.92 0.40

0.21 0.31 0.61 0.96 0.32

0.13 0.66 0.86 0.89 0.60

Table 7 Percentage of time and percentage of improvement in variance reduction of the heuristic with respect to gradient descent. % of time

1-D 2-D 3-D 4-D 5-D

% improvement

2

3

4

5

6

7

2

3

4

5

6

7

11.6 11.2 6.4 4.5 0.8

2.8 10.4 7.5 4.0 1.3

9.3 6.2 18.1 8.8 2.0

7.6 4.9 11.6 9.7 1.9

8.3 4.3 5.5 5.2 1.5

4.3 6.0 6.8 5.1 2.3

73.1 42.5 15.5 4.2 0.5

67.6 33.0 15.1 2.6 5.7

82.7 38.5 16.8 8.3 17.3

87.7 52.5 10.7 10.0 26.0

98.8 35.1 8.8 3.3 25.9

80.8 39.0 18.7 3.5 10.3

information for the two procedures. As we can observe the computation times for the heuristic algorithm constitute a fraction of those for the GD procedure. Given that the heuristic is much faster but obtains a smaller advantage with respect to GD, a more interesting comparison is offered in Table 7. As we can observe, in most cases the heuristic is able to obtain a high percentage of improve-

ment with a fraction of the time. For example if we look at the case of two points in a one dimensional domain, we observe that with only 11.6% of computation time it is able to obtain 73.1% of the variance improvement. Although results reported in this section and in the previous one refers to experiments that vary considerably from one to

Fig. 3. Average gain over 100 randomly initialized execution of: (a) and (b) GD, (c) and (d) the heuristic. The left plots refer to the case with 2 points in the 2-D dataset and right plots to the case with 6 points in the 3-D dataset.

734

L. Bottarelli and M. Loog / Pattern Recognition Letters 125 (2019) 727–734

another (given the wide range of hyperparameters and datasets) we can give an even more concise description. If we average the computation times in Table 6 we see that the heuristic takes 0.40 seconds per instance whereas gradient descent takes 8.05. Now, if we average the improvement results in Table 1 for the heuristic and gradient descent we obtain 4.19 and 8.42 respectively. We can therefore conclude observing that the heuristic obtains 50% of the improvement that gradient descent can achieve but with only 5% of the computation time, thus constituting an excellent compromise between quality of the results and computation time required. 6. Discussion and conclusions In this paper we proposed a gradient descent procedure and an iterative heuristic algorithm to minimize the posterior variance of a Gaussian process. The algorithms have been analyzed under different settings in order to show the advantages and the generality of the approaches. Specifically, we performed our empirical evaluation by varying the hyperparameters of the kernel, the number of sampling points and the dimensionality of the domain. Results show that in many cases it is possible to obtain a significant improvement with respect to a random initialization or the well-known submodular greedy procedure. Although with a random initialization the performance can vary considerably, in some cases it is possible to obtain better solutions than with a submodular greedy initialization. A comparison between gradient descent and the heuristic shows that although the improvements are lower, the heuristic with a fraction of the computation time is still notably improving with respect to the submodular greedy approach. As a consequence the heuristic should be able to better scale to applications that requires large instances. However, in general we showed that with both techniques the variance reduction can be widely improved if we can optimize the sensing locations in domain of interest that is continuous and not discrete.

Conflict of interest None. Supplementary material Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.patrec.2019.07.013. References [1] C. Bauckhage, Lecture Notes on Data Science: Soft K-means Clustering. [2] L. Bottarelli, M. Loog, Gradient descent for Gaussian processes variance reduction, in: X. Bai, E.R. Hancock, T.K. Ho, R.C. Wilson, B. Biggio, A. Robles-Kelly (Eds.), Structural, Syntactic, and Statistical Pattern Recognition, Springer International Publishing, Cham, 2018, pp. 160–169. [3] A. Das, D. Kempe, Algorithms for subset selection in linear regression, in: Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, ACM, 2008, pp. 45–54. [4] A. Krause, D. Golovin, Submodular Function Maximization, 2014. [5] A. Krause, C. Guestrin, Near-optimal observation selection using submodular functions, in: National Conference on Artificial Intelligence (AAAI), Nectar Track, 2007. [6] A. Krause, C. Guestrin, A. Gupta, J. Kleinberg, Robust sensor placements at informative and communication-efficient locations, ACM Trans. Sens. Netw. 7 (4) (2011) 31:1–31:33. [7] A. Krause, H.B. McMahan, C. Guestrin, A. Gupta, Robust submodular observation selection, J. Mach. Learn. Res. 9 (Dec) (2008) 2761–2801. [8] S. Lloyd, Least squares quantization in PCM, IEEE Trans. Inf. Theory 28 (2) (1982) 129–137. [9] K.P. Murphy, Machine Learning: A Probabilistic Perspective, The MIT Press, 2012. [10] G.L. Nemhauser, L.A. Wolsey, M.L. Fisher, An analysis of approximations for maximizing submodular set functions—I, Math. Programm. 14 (1) (1978) 265–294. [11] T. Powers, J. Bilmes, D.W. Krout, L. Atlas, Constrained robust submodular sensor selection with applications to multistatic sonar arrays, in: 2016 19th International Conference on Information Fusion (FUSION), 2016, pp. 2179–2185. [12] C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning, MIT Press, Cambridge, MA, USA, 2006. [13] V. Tzoumas, Resilient Submodular Maximization for Control and Sensing, University of Pennsylvania, 2018 Ph.D. thesis.