Exact methods for variable selection in principal component analysis: Guide functions and pre-selection

Exact methods for variable selection in principal component analysis: Guide functions and pre-selection

Computational Statistics and Data Analysis 57 (2013) 95–111 Contents lists available at SciVerse ScienceDirect Computational Statistics and Data Ana...

696KB Sizes 0 Downloads 148 Views

Computational Statistics and Data Analysis 57 (2013) 95–111

Contents lists available at SciVerse ScienceDirect

Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda

Exact methods for variable selection in principal component analysis: Guide functions and pre-selection Joaquín Pacheco ∗ , Silvia Casado, Santiago Porras Departamento de Economía Aplicada, Universidad de Burgos, Spain

article

info

Article history: Received 16 November 2011 Received in revised form 5 June 2012 Accepted 13 June 2012 Available online 18 June 2012 Keywords: PCA Variable selection Branch & Bound methods Guide functions Filters

abstract A variable selection problem is analysed for use in Principal Component Analysis (PCA). In this case, the set of original variables is divided into disjoint groups. The problem resides in the selection of variables, but with the restriction that the set of variables that is selected should contain at least one variable from each group. The objective function under consideration is the sum of the first eigenvalues of the correlation matrix of the subset of selected variables. This problem, with no known prior references, has two further difficulties, in addition to that of the variable selection problem: the evaluation of the objective function and the restriction that the subset of selected variables should also contain elements from all of the groups. Two Branch & Bound methods are proposed to obtain exact solutions that incorporate two strategies: the first one is the use of ‘‘fast’’ guide functions as alternatives to the objective function; the second one is the preselection of variables that help to comply with the latter restriction. From the computational tests, it is seen that both strategies are very efficient and achieve significant reductions in calculation times. © 2012 Elsevier B.V. All rights reserved.

1. Introduction Data sets with large numbers of variables are processed in many disciplines such as economics, sociology, engineering, medicine and biology, among others. The researcher has to process a lot of data classified by a large number of variables, which are often difficult to summarize or interpret. One useful approach involves the reduction of data dimensionality, while trying to preserve as much of the original information as possible. A common way of doing this is through Principal Component Analysis (PCA). PCA is widely applied in data mining to investigate data structures. Its purpose is to construct components, each of which contains a maximal amount of variation with respect to the data that are unexplained by any other components. Standard results guarantee that retaining the k Principal Components (PCs) with the largest associated variance produces the k-subset of linear combinations of the n original variables which, according to various criteria, represents the best approximation of the original variables (see, for example, Jolliffe, 2002). The user therefore expects that the information in the data can be summarized into a few principal components. Once the principal components have been determined, all further analysis can be carried out on them instead of on the original data, as they summarize the relevant information. Thus, PCA is frequently



Correspondence to: Fac C. Económicas y Empresariales, Plaza Infanta Elena s/n, BURGOS 09001, Spain. Tel.: +34 947 25 90 21; fax: +34 947 25 80 13. E-mail addresses: [email protected] (J. Pacheco), [email protected] (S. Casado), [email protected] (S. Porras).

0167-9473/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2012.06.014

96

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

considered the first step of a statistical data analysis that aims at data compression: decreasing the dimensionality of the data, but losing as little information as possible. While PCA is highly effective at reduced-dimensional representation, it does not provide a real reduction of dimensionality in terms of the original variables, since all n original variables are required to define a single Principal Component (PC). As McCabe (1984) has stated, ‘‘interpretation of the results and possible subsequent data collection and analysis still involve all of the variables’’. Moreover, Cadima and Jolliffe (1995, 2001) have shown that a PC can provide a misleading measure of variable importance, in terms of preserving information, because it is based on the assumption that the resultant of a linear combination (PC) is dominated by the vectors (variables) with large magnitude coefficients in that linear combination (high PC loadings). This assumption ignores the influence of the magnitude of each vector (the standard deviation of each variable) and the relative positions of the vectors (the pattern of correlations between the variables). One way of achieving a simple interpretation is to reduce the number of variables, that is, to look for a subset of the n variables that approximate, as far as possible, the k retained PCs. We consider the combinatorial problem of identifying, for any arbitrary integer p (k ≤ p ≤ n), a p-variable subset which is optimal with respect to a given criterion. In most cases, the inclusion of all the variables in a statistical analysis is, at best, unnecessary and, at worst, a serious impediment to the correct interpretation of the data. From a computational point of view, variable selection is a Nondeterministic Polynomial Time-Hard or NP-Hard problem (Kohavi, 1995); and (Cotta et al., 2004): there is therefore no guarantee of finding the optimum solution. This means that when the size of the problem is large, finding an optimum solution is, in practice, unfeasible. Two different methodological approaches have been developed for variable selection problems: optimal or exact techniques (enumerative techniques), which are able to guarantee an optimal solution, but which are only applicable to small-sized sets; and heuristic techniques, which are able to find good solutions (although unable to guarantee the optimum) within a reasonable amount of time. The problem of selecting a subset of variables from a larger candidate pool abounds in areas such as multiple linear regression (Furnival and Wilson, 1974; Miller, 2002; Gatu and Kontoghiorghes, 2006; Hofmann et al., 2007; Gatu et al., 2007), logistic regression (Pacheco et al., 2009), polynomial regression (Peixoto, 1987; Brusco et al., 2009b; Brusco and Steinley, 2010), factor analysis (Kano and Harada, 2000; Hogarty et al., 2004), cluster analysis (Brusco and Cradit, 2001; Steinley and Brusco, 2008; Krzanowski and Hand, 2009), and discriminant analysis (McCabe, 1975; McKay and Campbell, 1982a,b; Pacheco et al., 2006). Specifically, the problem of variable selection in PCA has been investigated by Jolliffe (1972), Jolliffe (1973), Robert and Escoufier (1976), McCabe (1984), Bonifas et al. (1984), Krzanowski (1987a), Falguerolles and Jmel (1993), Mori et al. (1994), Jolliffe (2002), Duarte Silva (2002), Cadima et al. (2004), Mori et al. (2007) and Brusco et al. (2009a) among others. These studies sought to obtain PCs based on a subset of variables, in such a way that they retain as much information as possible in comparison to PCs that are based on all the variables. To address this problem, it is necessary to tackle two secondary problems: (1) the establishment of an objective criterion that can measure the quality or fitness of every subset of variables; and (2) the development of solution procedures for finding optimal, or at least near-optimal, subsets based on these criteria. The methods proposed by Jolliffe (1972, 1973) consider PC loadings, and those of McCabe (1984) and Falguerolles and Jmel (1993) use a partial covariance matrix to select a subset of variables which, as far as possible, maintains information on all variables. Robert and Escoufier (1976) and Bonifas et al. (1984) used the RV-coefficient, and Krzanowski (1987a,b) used Procrustes analysis to evaluate closeness between the configuration of PC computations based on selected variables and the configuration based on all of the variables. Tanaka and Mori (1997) discussed a method called ‘‘modified PCA’’ (MPCA) to derive PCs which were computed by using only a selected subset of variables but which represented all of the variables, including those that were not selected. Since MPCA naturally includes variable selection procedures in its analysis, its criteria can be used directly to detect a reasonable subset of variables. Further criteria may be considered based, for example, on the influence analysis of variables and on predictive residuals, using the concepts reported in Tanaka and Mori (1997), and in Krzanowski (1987a,b), respectively; for more details see Iizuka et al. (2003). Thus, the existence of several methods and criteria is one of the typical features of variable selection in multivariate methods without external variables such as PCA (here the term ‘‘external variable’’ is used as a variable to be predicted or explained using the information derived from other variables). Moreover, the existing methods and criteria often provide different results (selected subsets of variables), which is regarded as a further typical feature. This occurs because each criterion has its own reasonable variable selection method, despite its purpose of selecting variables; we cannot say, therefore, that one is better than any other. These features are not observed in multivariate methods with external variable(s), such as multiple regression analysis (Mori et al., 2007). In practical applications of variable selection, it is desirable to have a computational environment where those who want to select variables can apply a suitable method for their own selection purposes without difficulties and/or can try various methods and choose the best method by comparing results. In many studies, the initial variables are divided into previously selected groups. In these cases it is required, or at least recommended to use variables from all groups considered. This happens, for example, in the construction of composite indicators that are used in several areas (economy, society, quality of life, nature, technology, etc.). They are used as measure of the evolution of regions or countries in such areas. The composite indicators should try to cover all points of view of

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

97

the analysed phenomenon (which may be identified with each of the different groups of variables) and should therefore contain at least one variable from each group (or other types of similar conditions), so that they encompass all the points of view. The importance of composite indicators is explained in Nardo et al. (2005a,b) and Bandura (2008) among other references. The convenience of using variables from all groups considered is explicitly indicated at least in Nardo et al. (2005a), Ramajo-Hernández and Márquez-Paniagua (2001) and López-García and Castro-Núñez (2004). In several of the examples mentioned in these previous references and links (see http://composite-indicators.jrc.ec.europa.eu/articles_books.htm) there are previous groups of variables and every group participates in the final Composite Indicator. For example, in Tangian (2007) is given an example of building a composite indicator to measure the working conditions. In this work 10 groups of variables are considered (Physical environment, Health, Time factors, etc.). In Chan et al. (2005) a composite indicator is built to be used as an analytic tool to examine the quality of life in Hong Kong. The Index is released annually. It consists of 20 variables that are grouped into three groups: Social (10 variables), economic (7) and environmental (3). In Nardo et al. (2005a,b) a Technology Achievement Index is proposed, and the following groups are considered: creation of technology (2 variables), diffusion of recent innovations (2), diffusion of old innovations (2) and human skills (2). In López-García and CastroNúñez (2004) an indicator for regional economic activity in Spain is built. The following groups are considered: Agriculture (2 variables), Construction (1), Industry (4), Merchandise Services (9) and No-Merchandise Services (2). Several more examples about it can be found in the literature. Some of them can be taken from Bandura (2008), an annual survey with about 170 international composite indicators. Several restrictions can be considered in order to ensure the ‘‘participation’’ of every group in the final solution. In this work we consider the simplest one: ‘‘to take at least one variable from each group’’. This constraint has been considered in Calzada et al. (2011). Nevertheless, other harder and/or more sophisticated restrictions can be considered depending on different questions (initial size of every group, opinion of experts, etc.). In anyway this simplest constraint can illustrate the difficulty of these kinds of problems. A variable selection problem is studied in this work for subsequent use in PCA, but with the restriction that the set of variables that are selected contain at least 1 variable from each of the previously selected groups. The objective function under consideration is the percentage of the total variation explained by the first k Principal Components of the selected variables, which is to say, the sum of the first k eigenvalues divided by the sum of all eigenvalues. As is said in Cadima and Jolliffe (2001), the first few principal components of a set of variables are derived variables with optimal properties in terms of approximating the original variables. Although other choices are possible (see Jolliffe, 2002, Section 10.1, Jolliffe, 1989) the principal components of interest are conventionally those with the largest variance. How many of these to keep can be determined in a variety of ways (Jolliffe, 2002, Section 6.1 and Richman, 1992 review some of many of the methods that have been suggested). This paper deals with the k principal components simultaneously because if we retain k PCs, we are interested in interpreting the space spanned by those PCs rather than being considered to individual PCs. The restriction that there should be at least one variable from each group in the selected set ensures that all viewpoints are considered. In addition, it helps to avoid the selection of variables that have high correlations between each other. In general the variables of the same group are usually more highly correlated between each other than with the rest of the group. There are no references in the literature for this specific problem. One difficulty in this problem is as follows: as the objective function under consideration is known (the percentage of the total variation explained by the first k Principal Components), it requires a high number of calculations every time it is evaluated (calculation and sum of eigenvalues). If an exploration is done in which many solutions are analysed, the total computation time of the search might be excessive, due to the evaluation time of the objective function in each solution. This difficulty can be overcome with the design of methods where the search is not directly guided by the objective function, but by other alternative functions that are related to it, which require fewer calculations. The other main difficulty comes from the restriction that selected subsets should have at least one element from each group. The process of constructing solutions (in the context of both exact and heuristic methods) should consider this restriction, as otherwise the constructed solutions might not comply with this construction criterion and might, therefore, be rejected. The inclusion of a strategy which will, as far as possible, prevent the generation of unfeasible solutions, and in consequence the completion achievement of unnecessary calculations appears necessary. In this paper, two Branch and Bound-based methods are proposed. Two alternative guide functions that differ from the objective function will be used in these exact methods. Specifically they will be used in the branch process. In addition, a pre-selection strategy will be used also in the branch process, which will favour the generation of feasible solutions. As will be confirmed later on, the alternative guide functions and the preselection manage to reduce total computing time considerably, thereby achieving faster and more efficient explorations. The remainder of this work is organized as follows: Section 2 is devoted to the problem definition and to show some theoretical results that will be used later on. Section 3 describes the proposed guide functions. Section 4 describes the two Branch and Bound methods. The computational experiences are shown in Section 5. They analyse, above all, the effect of the use of guide functions and the pre-selection strategy. Finally, Section 6 presents the final conclusions of the study.

98

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

2. Definition of the problem 2.1. Prior definitions Consider a data matrix, X, corresponding to m cases and characterized by n variables. We shall label the set of variables V = {1, 2, . . . , n} (the variables are identified by their indices for the sake of simplicity). Let xij be the value of variable j in the case i, i = 1, . . . , m; j = 1, . . . , n;

  x1j

x Let xj be the column vector with the values of variable j, in other words xj = · 2j  j = 1, . . . , n; xmj





 

It is known that X = xij i=1..m;j=1..n = x1 x2 · · · xn .   Consider the correlation matrix Σ = ρij i,j=1..n , in other words ρij is the correlation between variables i and j. For any subset of variables S ⊂ V , the correlation submatrix of the variables of S is denoted by Σ (S ) . Given a small fixed positive integer k (for example k = 2, 3, or 4), then f (S ) = the sum of the largest k eigenvalues of the matrix Σ (S ) . Logically, |S | ≥ k, if |S | = k, obviously f (S ) = k. 2.2. Formulation of the problem

q

Consider a partition in q groups of the subset of variables V , that is V = r =1 Gr ; where Gr represents each one of the groups of variables into which V is divided. Let p ∈ N, checking that p ≥ k, p ≥ q, p ≤ n, so that the problem may be defined as Maximize f (S ).

(1)

subject to

|S | = p S ∩ Gr ̸= ∅, S ⊂ V.

(2) r = 1..q

(3) (4)

The optimum solution and the value corresponding to the problem defined by (1)–(4) are respectively denoted by Sp∗ and g (p), that is g (p) = f (Sp∗ ) = max{f (S ) : |S | = p, S ⊂ V , S ∩ Gr ̸= ∅∀r = 1..q}.

2.3. Observations and theoretical results Let Σ be the correlation matrix defined in Section 2.1 (and in general any correlation matrix) and let λ1 , λ2 , . . . , λk be their largest k eigenvalues in order (λ1 > λ2 > · · · > λk ). It is well known that

λr = u′r · Σ · ur ∀r = 1..k where, {u1 , u2 . . . uk } is the subset of orthonormal eigenvectors respectively associated with λ1 , λ2 , . . . , λk . Theorem 1. Let Σ be the correlation matrix defined in Section 2.1 and λ1 , λ2 , . . . , λk its largest k eigenvalues in decreasing order (λ1 > λ2 > · · · > λk ), then k  r =1

λr = max

 k 

 vr · ′



·vr / ∥vr ∥ = 1, vs · vr = 0, r , s = 1..k; r ̸= s . ′

r =1

In order to demonstrate this result, it may be confirmed that the earlier set of orthonormal eigenvectors {u1 , u2 . . . uk } verifies the conditions of Lagrange optimality. Two corollaries arise as a consequence of this theorem. The mathematical proofs of these corollaries are explained in the Appendix and their results will be used in the design of the Branch and Bound algorithms described in Section 4.

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

99

3. Search for values and guide functions The problem proposed by (1)–(4) involves an important difficulty, due to the computation time needed to calculate the function f (S ) for any S ⊂ V . The methods to determine the largest k eigenvalues of a positive-defined symmetric matrix of order p and, as a result, its sum, are iterative methods based on QR decomposition (Francis, 1961–1962); (Kublanovskaya, 1961). These methods usually use various iterations to converge, and θ (p3 ) operations are required for each iteration. If the size of the problem (value of n) is not small, the number of solutions to evaluate will be large, and the required computation time could therefore be high. Moreover, when a set of variables S changes, even though only slightly (when a new variable is incorporated, an existing one eliminated or both simultaneously), the eigenvalues of the corresponding correlation matrix change and there is no way (to the best of the authors’ knowledge) of calculating the eigenvalues of the new matrix from the eigenvalues of the former matrix that might involve a reduction in the calculation time. In other words, although the set of variables S ′ may have been obtained with a small modification made to S, calculation of f (S ′ ) has to be done by applying the methods mentioned earlier, and the information relating to S and f (S ) cannot be used for a faster calculation of f (S ′ ). However, such information can be used as a guide or approximation, in some cases. More specifically, for the sake of simplicity, let us assume that S = {1, . . . , p} and let Σ (S ) be its correlation matrix, λ1 , λ2 , . . . , λp its eigenvalues ordered from large to small, and u1 , u2 , . . . and up the corresponding eigenvectors, verifying ∥ui ∥ = 1, and u′i · uj = 0, ∀i, j ∈ {1, . . . , p}, i ̸= j. Also: – – –

U = (u1 u2 · · · up ) ∈ Rp·p x∗i = the column vector of standardized variable i, i = 1..p X ∗(p) = the data matrix corresponding to the standardized variables of S, i.e. X ∗(p) = (x∗1 x∗2 · · · x∗p )

– zi = X ∗(p) · ui i.e. the column vector of the values of the inth component. Consider now the matrix U ∗ =





0

U ′

0

1

∈ Rp+1·p+1 , where 0 = (0, 0, . . . , 0)′ ∈ Rp , and consider any t ∈ V − S (in this

case any t ∈ {p + 1, . . . , n}). It may easily be verified that (zi z2 · · · zp x∗t ) = U ∗ · (x∗i x∗2 · · · x∗p x∗t ). In other words, the set of columns (zi z2 · · · zp x∗t ) may be obtained by an orthonormal transformation from the set of columns (x∗i x∗2 · · · x∗p x∗t ), and the variance–covariance matrices of both sets of columns will therefore have the same eigenvalues. We refer to both matrices as C and D, respectively. Thus:



λ1

··· ··· ··· ··· ···

0

λ2 ·

0  D=· 0

0

σ1

σ2

 σ1 σ2   · σ

0 0

· λp σp

p

1

in which, σi is the covariance between the columns zi and x∗t , i = 1, . . . , p. Both matrices have the same eigenvalues, and although the second has a simpler appearance, the calculation of its eigenvalues continues to require (according to our knowledge) the use of the aforementioned methods. Now, if we consider z columns zi∗ = √λi , i = 1, . . . , p, then the variance–covariance matrix E of columns (z1∗ z2∗ · · · zp∗ x∗t ) is as follows: i



1  0

0 1

E= ·  0

·



0

σ1 ∗

σ2 ∗

··· ··· ··· ··· ···

0 0

· 1

σp ∗

 σ1 ∗ σ2 ∗  ·  σp ∗  1

in which, σi∗ is the covariance between columns zi∗ yx∗t , i = 1, . . . , p (and also its correlation as they are standardized variables). The eigenvalues λ∗i of matrix E may be easily calculated by

λ∗1 = 1 +



λi = 1,

i = 2, . . . , p;



λ∗p+1 = 1 −

σ1 ∗2 +σ2 ∗2 + · · · + σp ∗2 ,  σ1 ∗2 +σ2 ∗2 + · · · + σp ∗2 .

These eigenvalues of matrix λ∗i are obviously not the same as those of D (and therefore of C ), as columns zi have been replaced by zi∗ , (that is, they have been standardized); however, they can given us an idea of the contribution of each column x∗t to the sum of eigenvalues of C . As it is a question of maximizing the sum of higher eigenvalues of C , it could be thought

100

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

that this could be achieved by maximizing the values of the largest λ∗i and, more specifically, the expression of the radicand that appears in λ∗1 . We first calculate σi∗ , i = i, . . . , p, in order to evaluate the following expression: p 1 

σi∗ = cov(x∗t , zi∗ ) = √ λi

uij · ρjt ,

j =1

in which, uij represents the components of vector ui . Thus p  i=1

p  1 σi ∗ = λi i=1



2

p 

2 uij · ρjt

(5)

j =1

the value of (5), the radicand that appears in λ1 , requires a calculation time of θ (p2 ) once the eigenvectors ui are obtained. The sum of the square of the co-variances of column x∗t with columns zi∗ can give us a relative measure of the contribution to the objective function of variable t when it is added to S (in comparison with the other variables that may be added). In the same way a further measure relating to that contribution could be given by the sum of the co-variances to the square of column x∗t with the original columns zi (that is, not standardized). If σi is the co-variance between x∗t and zi , then the sum will take the following form: p 

σ = 2 i

i=1

 p p   i =1

2 uij · ρjt

=

j=1

i=1

 = ρ1t · · · ρpt · 

p     ′ ρ1t · · · ρpt · ui · u′i · ρ1t · · · ρpt





p





ui · ui

 ′    ′ · ρ1t · · · ρpt = ρ1t · · · ρpt · Ipp · ρ1t · · · ρpt ;

i=1

where Ipp is the identity matrix of order p. The last equality is consequence of the orthonormality of columns ui . Thus: p  i =1

σi2 =

p 

ρit2 .

(6)

i=1

This expression requires a calculation time of θ (p) and, in addition, requires no prior calculation of eigenvectors ui . Let S be any set of variables S ⊂ V , then for each t ∈ V − S let us define h0 (S , t ) = f (S ∪ {t }), that is the value of the objective function when t is added to S. Similarly, we shall denote expressions (5) and (6) by h1 (S , t ) and h2 (S , t ), respectively. In Section 5, we shall examine the efficacy of both functions as ‘‘guides’’ to determine the variables t ∈ V − S with larger values of h0 (S , t ). 4. Description of the methodology 4.1. First Branch and Bound method Corollaries 1 and 2 (in the Appendix) allow the design of an exact method Branch and Bound (BnB) based method, similar to some that we can find in the literature for several variable selection problems (Brusco et al., 2009a,b; Brusco and Steinley, 2011). This method allows to find the optimal solution, Sp∗ and g (p), for all values of p (verifying p ≥ k, p ≥ q, p ≤ n), when the value of n is moderate. The BnB algorithm performs a recursive analysis of the set of solutions. This analysis is performed by means of a search tree. Each node of the tree corresponds to a set of solutions. When a node J is explored, the question of whether the set of associated solutions can improve some of the values g (p) found up until that moment is determined. If it is determined that no associated solution to node J can improve any value of g, the exploration of that node is ended. If otherwise, the associated set is divided into two subsets that are associated with both nodes K and L that emerge from node J. Subsequently, nodes K and L are explored. Initially the origin node, the one that corresponds to all the solutions, is explored. More specifically, the solutions associated with each node J are determined by two subsets A and B ⊂ V , such that A ∩ B = ∅. In this way, the set of solutions for J are all subsets S ⊂ V that contain A (A ⊂ S) and that do not contain elements of B (B ∩ S = ∅). (Put another way, the elements of A would be ‘‘fixed, and those of B ‘‘forbidden’’). Division of node J in nodes K and L entails the determination of an element a ∈ V − A − B. Subsequently, are defined the sets A′ = A ∪{a}, B′ = B, A′′ = A and B′′ = B ∪ {a} and then nodes L and K are respectively associated with the solutions determined by A′′ and B′′ (L) and A′ and B′ (K ). Fig. 1 illustrates the way in which the recursive division functions.

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

101

Fig. 1. Branch process in the first Branch and Bound method.

Let A and B be two subsets A, B ⊂ V , such that A ∩ B = ∅; a description in pseudocode of the exploration of each node associated with them is as follows:

Thus, the BnB method functions in the following way:

Finally, the following points should be made: – The restriction that is asked for in line (7) of the Exploration_node procedure ensures that there are no solutions in that node that will improve the values of g and the exploration should therefore be ended. This is based on Corollaries 1 and 2 from the Appendix. In fact, it follows that any solution S in that node verifies A ⊂ S ⊂ V − B. Therefore, if the restriction in line (7) is fulfilled, then f (S ) ≤ f (V − B) ≤ g (|A|) ≤ g (|S |); so, S will not improve g (|S |). – The way variable a is chosen in line (8) will obviously not change the final result, but it can be important for total calculation time. In our case, a ∈ V − A − B is chosen which increases more the value of the objective function by adding it to A. In other words, a is chosen such that f (A ∪ {a}) = max{f (A ∪ {v})/v ∈ V − A − B}. As commented on earlier, calculation of f can spend a high calculation time; thus is proposed as an alternative to choose a ∈ V − A − B that maximizes either h1 (A, a) or h2 (A, a).

102

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

– Furthermore, the restriction of having one element of each group in the solutions that are obtained, can be taken into account when considering the element a to enter. Regardless of the guide function in use (f , h1 or h2 ), unnecessary explorations may be avoided if preference is given to the variables of those groups that have no element in A. More formally, a subset of variables Pre_Sel ⊂ V − A − B may be formed and a can be chosen in this subset. The subset Pre_Sel (pre-selected variables) is defined as follows: If A ∩ Gr ̸= ∅,

∀r ∈ {1..q} then make Pre_Sel = V − A − B; otherwise, make Pre_Sel = {v ∈ (V − A − B) ∩ Gr : r ∈ {1..q}, A ∩ Gr = ∅}. The effect of using this alternative form of selecting a will be analysed in the computational tests. 4.2. Second Branch and Bound method A second Branch and Bound method (BnB2) will now be explained. The difference with the first one is that from each node of the search tree can emerge two or more branches. The idea of this new method is try to obtain the greatest advantage from the earlier information. In consequence, unnecessary divisions are identified in the exploration of each node (with solutions that do not improve the current ones) as soon as possible. As before, each node has two subsets A and B ⊂ V , such that A ∩ B = ∅. In that way, as in the method BnB, the S solutions associated with the aforementioned node are all the subsets S ⊂ V that contain A (A ⊂ S) and that do not contain elements of B (B ∩ S = ∅). Exploration of each node in this case may be described as follows in pseudocode.

The size t of subset {a1 , a2 , . . . , at } is not fixed. For each node defined by A and B, the way in which this subset is determined in line (10) is described as follows:

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

103

Fig. 2. Branch process in the second Branch and Bound method.

In this procedure, ar values are selected that cause the largest decrease in the value of f (Lt ), when that element is eliminated from Lt −1 (in other words V − B − {a1 , a2 , . . . , at −1 }). In order to determine those at elements, as the calculation time of the corresponding f values can be high, the sum of their correlations to the square with the other elements of Lt −1 is used as the guide function. Initially, a1 ∈ V − B − A with the largest value in the sum of the square correlations is selected, subsequently a2 , a3 and so on, are successively determined by recursive methods. The process continues until a set Lt is formed, such that either g (|A|) ≥ f (Lt ) or Lt contain no elements from all groups. Note that, in the first iteration, the value of the sum of the correlation squares of each element v ∈ L0 − A with the other elements of L0 , that appear in (11) requires θ (|L0 − A| − 1) operations. However, this sum may be very easily updated in the other iterations, in the following way:



corr2 (v, b) =

b∈Lr −1 b̸=v



corr2 (v, b) − corr2 (v, ar −1 ) .

b∈Lr −2 b̸=v

As in BnB method, in order to avoid unnecessary explorations preference may be given to the choice of groups of variables that have no element in A. Thus, instead of considering Lr −1 − A as an alternative, line (11) proposes that the elements v ∈ Pre_Sel ⊂ Lr −1 − A should be considered, where the subset Pre-Sel is defined in the following way: If A ∩ Gr ̸= ∅,

∀r ∈ {1..q}, then make Pre_Sel = Lt −1 − A; otherwise, make Pre_Sel = {v ∈ (Lt −1 − A) ∩ Gr : r ∈ {1..q}, (A ∪ {a1 , a2 , . . . , at −1 }) ∩ Gr ̸= ∅}. Fig. 2 illustrates the way in which the recursive division functions. Finally, as in the earlier case, the BnB2 method is very simple.

5. Computational experiences 5.1. Design of correlation matrix samples A series of correlation matrix samples will be generated for the different computational tests. The process of generating these matrices, similar to that used in Brusco et al. (2009a,b), consists of designing population correlation matrices L; a set of vectors following the normal distribution with the L correlation matrix is generated from each population correlation matrix L, and finally the sample correlation matrix corresponding to Σ is obtained.

104

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

The method of generating vectors of a certain multivariate normal distribution of order n, 0 means, and correlation matrix L, that is N (0, L), is as follows. A lower triangular matrix T , of order n, is determined such that T · T ′ = L, subsequently the vectors z with distribution N (0, In ) are determined, and y = T · z is calculated; the y vectors calculated in this way follow the distribution N (0, L). There are several ways of obtaining the lower triangular matrix T such that T · T ′ = L. In our case, the square root method was used, which we find in works by Naylor (1977) and Rubinstein (1981). Also, different methods may be found in these texts to generate the values of a normal distribution N (0, 1). The population correlation matrices L are designed according to a simple pattern: the correlations between the different variables can have two values: a high value if they belong to the same group, or otherwise a low one. The population matrices therefore depend on the following parameters: – – – –

n: the number of variables and therefore the sample size nblocks: number of groups sizeblock: size of each group (to simplify, let us suppose groups of the same size, and then sizeblock = n/nblocks) wcor: correlation between variables of the same group (let us suppose that this correlation is higher than between the variables of different groups) – lcor: correlation between variables of different groups.

A population matrix L is shown below with the following values: n = 12, nblocks = 4, sizeblock = 3, wcor = 0.7 and lcor = 0.2.

1 0.7 0.7  0.2  0.2  0.2 0.2  0.2  0.2  0.2  0.2 0.2

0.7 1 0.7 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2

07 0.7 1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2

0.2 0.2 0.2 1 0.7 0.7 0.2 0.2 0.2 0.2 0.2 0.2

0.2 0.2 0.2 0.7 1 0.7 0.2 0.2 0.2 0.2 0.2 0.2

0.2 0.2 0.2 0.7 0.7 1 0.2 0.2 0.2 0.2 0.2 0.2

0.2 0.2 0.2 0.2 0.2 0.2 1 0.7 0.7 0.2 0.2 0.2

0.2 0.2 0.2 0.2 0.2 0.2 0.7 1 0.7 0.2 0.2 0.2

0.2 0.2 0.2 0.2 0.2 0.2 0.7 0.7 1 0.2 0.2 0.2

0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 1 0.7 0.7

0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.7 1 0.7

0.2 0.2 0.2  0.2  0.2  0.2 . 0.2 0.2  0.2  0.7  0.7 1



As may be seen, in order to simplify the design, the variables of the same groups follow each other, starting with the variable 1. In this case, the first block is formed by variables 1, 2, and 3, the second by the variables 4, 5 and 6, and so on. Finally, to remain consistent with previous definitions, m is the number of vectors or cases that are generated following the distribution N (0, L), from which Σ is subsequently obtained. A value of m = 1000 was used. 5.2. Use of the guide functions h1 and h2 to add elements to solutions In this section, a set of computational tests are described to check the efficiency and efficacy of the two functions h1 and h2 when they are used, instead of the objective function f , as a guide to determine the best elements to add to the solution S. More specifically, suppose that we have a solution S, and we wish to add an element v ∈ V − S that increases the value of f (S ∪ {v}). The value of f (S ∪ {v}) may be calculated for each v ∈ V − S, although as commented on earlier, it might require a lengthy calculation time. An alternative, mentioned in Section 3, applies functions h1 and h2 that use less calculation time. In this sub-section the quality of these approximations is tested. A set of tests were therefore conducted in which a set V with its corresponding correlation matrix was considered in each one, as well as a subset S ⊂ V obtained at random, and a preset value for k. The following were calculated in every one of these tests – – – – – –

fmax = max{f (S ∪ {v})/v ∈ V − S } fmin = min{f (S ∪ {v})/v ∈ V − S } v1 = argmax{h1 (S , v)/v ∈ V − S } v2 = argmax{h2 (S , v)/v ∈ V − S } r1 = (f (S ∪ {v1 }) − fmin)/(fmax − fmin) r2 = (f (S ∪ {v2 }) − fmin)/(fmax − fmin).

The values r1 and r2 could be quality indicators of the candidate variables to use in S that were suggested by h1 and h2 . Moreover, v3 ∈ V − S was chosen at random and r3 = (f (S ∪ {v3 }) − fmin)/(fmax − fmin) was calculated. These tests have been divided in sets according to the values of n(=|V |), p(=|S |) and k (∈ {2, 3}). In all, 20 combinations of these values were considered, in other words 20 sets, and 100 tests were conducted for each one. The groups were named by the values of the parameters. So, 20-6-2 indicates the group, in which n = 20, p = 6, and k = 2. The 20 groups under consideration are shown in Table 1.

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

105

Table 1 Division into sets. #

Set

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

20-6-2 20-6-3 20-9-2 20-9-3 25-8-2 25-8-3 25-12-2 25-12-3 50-12-2 50-12-3 50-18-2 50-18-3 100-12-2 100-12-3 100-18-2 100-18-3 100-24-2 100-24-3 100-30-2 100-30-3

Table 2 Summary of results.1 Set

m1

m2

m3

Y1

Y2

Y3

T1

T2

T0

20-6-2 20-6-3 20-9-2 20-9-3 25-8-2 25-8-3 25-12-2 25-12-3 50-12-2 50-12-3 50-18-2 50-18-3 100-12-2 100-12-3 100-18-2 100-18-3 100-24-2 100-24-3 100-30-2 100-30-3

0.99101 0.98737 0.95115 0.98828 0.97395 0.99472 0.93317 0.95742 0.95063 0.98021 0.91520 0.93772 0.96283 0.98690 0.94968 0.97772 0.92245 0.95992 0.87862 0.93784

0.98185 0.97902 0.93721 0.97478 0.96831 0.98838 0.90631 0.93010 0.95421 0.98152 0.89546 0.92604 0.94783 0.96361 0.92309 0.95779 0.85352 0.89764 0.80894 0.82939

0.39025 0.48697 0.35196 0.45421 0.32228 0.46189 0.36185 0.36159 0.22190 0.28572 0.27501 0.25350 0.19646 0.23799 0.18365 0.21339 0.21521 0.20109 0.25126 0.20992

999 1000 969 997 989 1000 948 993 963 988 910 981 965 998 958 986 916 991 819 968

998 1000 956 997 995 1000 915 982 982 999 881 978 982 995 941 988 790 925 667 779

203 487 204 284 142 382 220 243 72 120 115 125 77 143 42 68 62 66 61 72

0.000061 0.000047 0.000140 0.000094 0.000047 0.000096 0.000157 0.000247 0.000300 0.000219 0.000530 0.000486 0.000345 0.000267 0.000688 0.000537 0.001341 0.001079 0.001872 0.002096

0.000000 0.000000 0.000000 0.000000 0.000015 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000015 0.000000 0.000016 0.000015 0.000000 0.000000 0.000016 0.000015 0.000015

0.000500 0.000501 0.000747 0.000748 0.000935 0.000907 0.001580 0.001699 0.004160 0.004070 0.007427 0.006760 0.009166 0.009298 0.017865 0.018336 0.030685 0.029221 0.042016 0.044597

Note that 4 values for n were considered: 20, 25, 50 and 100. A single correlation matrix was considered for each value of n. Thus, 4 sample correlation matrices were generated from 4 population matrices, as it was explained in Section 5.1, the size were n = 20, 25, 50 and 100. The values of the other parameters of the population matrices are: nblocks = n/5, sizeblock = 5, wcor = 0.7 and lcor = 0.2. A summary of the test results by sets are shown in Table 2. Specifically: – – – –

mi columns indicate the average of the values ri , i = 1..3. Yi columns indicate the number of times in which ri has obtained a value higher than 0.75, i = 1..3. Ti columns indicate the time measured in seconds used to obtain vi in each test, i = 1..3. T0 column indicates the average time used to obtain fmax. The following conclusions may be drawn from the results shown in Table 2:

– Guide functions h1 and h2 yield results that are very close to those that would be obtained using the objective function, but with a much lower computation time. – As it can be observed, in the case of h1 the corresponding m1 values are, in all cases, over 0.90, except for the group 100-30-2 that dropped to 0.878. In 14 groups, 0.95 is exceeded and there are even two sets in which 0.99 is exceeded.

1 Values in Table 2 that are 0.000000 in T columns indicates values lower than 10−6 .

106

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111 Table 3 Average calculation times and explorations of each BnB variant. n

12 15 18 21 24

Computation time

Number of explored nodes

Variant 0

Variant 1

Variant 2

Variant 0

Variant 1

Variant 2

0.2495 2.636 23.687 206.474 1682.827

0.291 2.936 25.707 207.667 1644.170

0.183 1.840 15.510 121.340 936.152

3662.00 21 491.50 126 838.25 938 034.75 5 892 405.50

3748.75 20 922.50 127 766.25 947 198.25 5 987 401.75

3744.50 20 989.75 127 576.00 952 928.25 5 985 164.50

If we observed column Y1 , it is clear that in all the sets, except for set 100-30-2, the number of tests where r1 is greater than 0.75 is largely in excess of 900. Finally, the computation time used in the calculation of v1 (T1 ) is very much lower than that used to calculate fmax (T0 ). – With regard to the analysis of the use of h2 , the values for m2 are also over 0.90 in 15 of the 20 sets. In the other 5 sets, it is over 0.80, and 4 of these correspond to larger p values. Similarly, the values of column Y2 are over 900 in 16 of the 20 cases. Finally, the calculation time used by h2 to calculate v2 is very much lower than that used to calculate fmax. In 13 of the 20 sets, the average time T2 is less than 10−6 s. – In general, the use of h1 allows obtaining better results than h2 . Nevertheless, the calculation time in the second case is clearly less than in the first one. Moreover, in both cases a certain deterioration of the results for high values of p (p = 24, 30) is evident, although they are still good. – From observation of the values of m3 and Y3 , it may be assumed that the random selection of the elements of V − S obtains results that are much worse than those obtained with h1 or h2 . In fact, the values obtained for m3 are lower than 0.5. There is therefore no distribution of the values of f (S ∪ {v}) that favours the production of high values; instead, the contrary appears to be true. In consequence, the high values obtained when h1 or h2 are used must be attributed to the quality of these functions as guides. In Section 5.3, the effect of the use of these guide functions in the different Branch and Bound algorithms will be analysed. 5.3. Analysis of the first Branch and Bound method with different guide functions In this sub-section and in Sections 5.4 and 5.5 different variants of both Branch and Bound methods are described and analysed. A table summarizing the features of every variant is shown at the end of Section 5.5. As commented on in Section 4.1, the way of selecting element a ∈ V − A − B in line (8) of the Exploration_node procedure will obviously not change the final result of the BnB method, but it may affect its computation time. In this sub-section, three variant methods of selecting this element a will be compared: – a = arg max{f (A ∪ {v})/v ∈ V − A − B} (Variant 0) – a = arg max{h1 (A, v)/v ∈ V − A − B} (Variant 1) – a = arg max{h2 (A, v)/v ∈ V − A − B} (Variant 2). In all, 20 correlation matrix samples were generated from 20 population matrices that were obtained by combining the following parameter values: n = 12, 15, 18, 21, 24; nblocks = n/3; sizeblock = 3; wcor = 0.5, 0.7 y lcor = 0.1, 0.2. Thus five different sizes were considered of n and four matrices by size. The three variants of the BnB algorithm were executed, for each matrix and each value of k (k = 2, 3). Obviously each variant obtains the optimum Sp∗ and its values g (p), ∀p ∈ {k + 1, . . . , n}. We are now interested in analysing the difference in calculation times used by each variant. Table 3 is shown below with the average times of each variant for each value of n. The average number of explored nodes is also shown (in other words, the number of times the procedure Exploration_node is executed). The following may be observed in Table 3: – Variant 2 uses less computation time than variants 0 and 1. This reduction is more easily appreciated as the size of n increases and is close on 45% for n = 24. Likewise, variants 0 and 1 use similar computation times. Perhaps variant 1 employs something less for matrices of greater size (n = 24). The significance of these differences will be analysed later. – The three variants present a very similar evolution of the number of explorations. If we compare the values of variants 0 and 2, we observe that the latter needs a slightly higher number of explorations (except for n = 15). Thus, the explorations of the variant 0 appear to be more effective but, taking the computation time into account, the explorations of variant 0 are more efficient. Table 4 shows the results of several t-tests of differences of means (Means) in calculation time for each pair of variants under consideration and for each value of n. It can be seen from Table 4 that variant 1 only uses less calculation time than variant 0 in the larger-sized problems, but this difference is not significant (tail probability of 0.475). Variant 2 has significantly shorter calculation times than the other two variants for all values of n. The tails of probabilities are all equal or less than 0.05, except two cases (var.0–var.1 with n = 21 and n = 24).

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

107

Table 4 Differences in average tests of calculation times shown in Table 3. n

Pairs

Means

−0.041125

12

Var. 0–Var. 1 Var. 0–Var. 2 Var. 1–Var. 2 Var. 0–Var. 1 Var. 0–Var. 2 Var. 1–Var. 2

−0.300000

15

Var. 0–Var. 1 Var. 0–Var. 2 Var. 1–Var. 2

−2.020125

18

Var. 0–Var. 1 Var. 0–Var. 2 Var. 1–Var. 2

−1.193125

21

24

Var. 0–Var. 1 Var. 0–Var. 2 Var. 1–Var. 2

38.656500 746.674625 708.018125

0.066125 0.107250 0.795750 1.095750 8.176375 10.196500 85.133375 86.326500

Table 5 Average calculation times for each variant of BnB. n

Variant 0

Variant 1

Variant 2

Variant 3

Variant 4

Variant 5

12 15 18 21 24

0.2495 2.636 23.687 206.474 1682.827

0.291 2.936 25.707 207.667 1644.170

0.183 1.840 15.510 121.340 936.152

0.213 2.065 17.710 145.495 1068.740

0.251 2.280 18.451 140.413 1009.731

0.152 1.433 11.059 85.938 589.427

5.4. Analysis of the preference by groups without elements in A As commented in Section 4.1. (in addition to the choice of suitable guide functions), the efficiency of the Branch and Bound procedure may be improved, if preference is given, in each exploration of nodes, to the choice of elements in those groups which have no elements in A in line (7). So that, the intention is to avoid unnecessary explorations. The following three new variants are proposed to analyse the efficiency of this strategy, i.e. ways of selecting a in line (7): – a = arg max{f (A ∪ {v})/v ∈ Pre_Sel} (Variant 3) – a = arg max{h1 (A, v)/v ∈ Pre_Sel} (Variant 4) – a = arg max{h2 (A, v)/v ∈ Pre_Sel} (Variant 5) where Pre_Sel was defined in Section 4.1. Note that these variants correspond to the 3 variants 0, 1 and 2, respectively, but changing the set V − B − A for its corresponding subset Pre_Sel. Tests were conducted to determine the computation time used for these 3 new variants with the same former values of k. Table 5 shows the average results for each value of n. The average results of variants 0, 1 and 2 are added to the table for a better comparison. As can be seen in Table 5, the variants used by the subset Pre_Sel reduce computation time with regard to the corresponding variants that do not use this pre-selection. In other words variant 3 uses less computation time than variant 0, variant 4 less than variant 1, and variant 5 less than variant 2 (more than 30% of the calculation time for large-sized matrices). Besides, a comparison of variants 3, 4 and 5 between each other and with variants 0, 1 and 2 both lead to similar conclusions: the computation times of variants 3 and 4 are similar, although for larger-sized problems, variant 4 is slightly quicker. Variant 5 uses less computation time than both of them. Tables 6 and 7 show the results of several t-tests of means applied to calculation time in different pairs of variants under consideration. The objective of these tests is to check whether these differences in calculation time are significant. Specifically, Table 6 contrasts the difference in calculation times between variants 0 and 3, variants 1 and 4 and variants 2 and 5. In other words, the intention here is to contrast the pre-selection effect. As can be seen from Table 6, regardless of the guide function in use, the use of preselection significantly improves the calculation times. All tails of probabilities are all equal or less than 0.001. Finally, Table 7 contrasts the differences in calculation times of variants 3, 4 and 5 taken in pairs. In other words, the idea is to compare the effect of the different guide functions when the preselection of elements is applied. As can be seen from Table 7, the results are similar to those shown in Table 4. Variant 4 uses less computation time than variant 3 only for larger-sized problems (n = 21 and 24). However, this difference is significant for n = 24. Moreover, variant 5 uses a significantly shorter computation time than the other 2 for all values of n. The tails of probabilities are all equal or less than 0.05, except two cases (var.3–var.4 with n = 18 and n = 21). In short, the use of suitable guide functions (especially h2 ) in the explorations of each node combined with a simple variable preselection strategy (that avoids unnecessary explorations) considerably reduces the computation time. In the

108

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111 Table 6 Mean t-tests in calculation time highlighting the pre-selection effect. n

Pairs

12

Var. 0–Var. 3 Var. 1–Var. 4 Var. 2–Var. 5

Means 0.036625 0.039250 0.031500

15

Var. 0–Var. 3 Var. 1–Var. 4 Var. 2–Var. 5

0.571625 0.656750 0.407500

18

Var. 0–Var. 3 Var. 1–Var. 4 Var. 2–Var. 5

5.976750 7.255875 4.451625

21

Var. 0–Var. 3 Var. 1–Var. 4 Var. 2–Var. 5

60.978625 67.253500 35.402000

24

Var. 0–Var. 3 Var. 1–Var. 4 Var. 2–Var. 5

614.086250 634.438625 346.725250

Table 7 Means t-tests in calculation time between variants 3, 4 and 5. n

Pairs

Means

−0.038500

12

Var. 3–Var. 4 Var. 3–Var. 5 Var. 4–Var. 5 Var. 3–Var. 4 Var. 3–Var. 5 Var. 4–Var. 5

−0.214875

15

Var. 3–Var. 4 Var. 3–Var. 5 Var. 4–Var. 5

−0.741000

18

21

Var. 3–Var. 4 Var. 3–Var. 5 Var. 4–Var. 5

5.081750 59.556750 54.475000

24

Var. 3–Var. 4 Var. 3–Var. 5 Var. 4–Var. 5

59.008875 479.313625 420.304750

0.061000 0.099500 0.631625 0.846500 6.651250 7.392250

case of larger-sized problems, if variant 0 (average time 1682.827 s) is compared with variant 5 (589.427), the reduction is around 66% of the calculation time. 5.5. Comparison of the 2 Branch and Bound methods (BnB and BnB2) In this sub-section, the functioning of the second Branch and Bound (BnB2) method, described in Section 4.2. is analysed. A first set of tests has the aim of examining the effect of using variable pre-selection during the selection of the set of variables {a1 , a2 , . . . , at } in the exploration of each node. So, two variants of the method are considered, according to the set of variables from which element at in line (11) is selected: – Variant 0: v ∈ Lr −1 − A – Variant 1: v ∈ Pre_Sel ⊂ Lr −1 − A where Pre_Sel is defined in Section 4.2. The results obtained for both variants are shown below, with the same matrices used in Section 5.4. Specifically in Table 8, the average results are shown (time and number of explorations). The results obtained by variant 5 of BnB method are added, in order to have a better comparison of both Branch and Bound methods. It may be noted from Table 8 that fewer explorations are conducted with variant 1of BnB2 and that it uses less computation time than Variant 0. Therefore, the use of the preselection of elements in BnB2 is also advantageous. However, Variant 1 of the BnB2 method does not outperform the computation time of the best variant (variant 5) of the BnB method, although it does involve a smaller number of explorations. Table 9 shows the results of several means t-tests for each value of n of the earlier calculation times. Specifically, the calculation time of variants 0 and 1 of BnB2 (BnB2. v0 and BnB2. v1) are contrasted, and the calculation time of variant 1 of BnB2 and variant 5 of BnB (BnB. v5). Table 9 shows the difference of means (Means). The results of Table 9 show how the differences between the calculation times are clearly significant; moreover, the larger the size of n, the greater it is significant. All tails of probabilities are all equal or less than 0.02.

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

109

Table 8 Comparison of the 2 Branch and Bound methods. n

12 15 18 21 24

Computation time

Number of explored nodes

BnB2 v.0

BnB2 v.1

BnB v.5

BnB2 v.0

BnB2 v.1

BnB v.5

0.199 2.131 20.409 188.393 1673.336

0.16575 1.59525 13.90925 115.76575 894.73213

0.152 1.433 11.059 85.938 589.427

3757.38 28 139.50 205 088.38 1 480 388.38 10 458 303.00

2482.00 16 387.13 104 086.13 651 873.75 3 862 981.88

4715.75 29 181.50 167 619.25 988 215.25 5 344 688.75

Table 9 Differences in average tests of calculation times shown in Table 8. n

Pairs

12

BnB2.v0–BnB2.v1 BnB2.v1–BnB.v5

Means 0.033125 0.013875

15

BnB2.v0–BnB2.v1 BnB2.v1–BnB.v5

0.536125 0.162000

18

BnB2.v0–BnB2.v1 BnB2.v1–BnB.v5

6.499625 2.850625

21

BnB2.v0–BnB2.v1 BnB2.v1–BnB.v5

72.627625 29.827250

24

BnB2.v0–BnB2.v1 BnB2.v1–BnB.v5

778.604125 305.305375

Table 10 Summary of the features of all variants. Method

Variants

Guide function

Filters (Preselection)

First Branch and Bound (BnB)

BnB.v0 BnB.v1 BnB.v2 BnB.v3 BnB.v4 BnB.v5

f h1 h2 f h1 h2

No No No Yes Yes Yes

Second Branch and Bound (BnB2)

BnB2.v0 BnB2.v1

h2 h2

No Yes

Table 10 summarizes the features of all variants analysed in this section. Finally we must point-out that similar computational experiences performed with different correlation matrix structures, and similar results have been obtained. 6. Conclusions A variable selection problem has been analysed in this work for use in Principal Component Analysis (PCA). In this case, the set of original variables is divided (partitioned) into disjoint groups. Each group corresponds to a variable type. The problem consists in the selection of variables, but with the restriction that the set of variables that is selected should have at least one variable of each type or group. This problem, as previously explained, has a wide scope of application, above all for the design of composite indicators in sociological and economic studies, among others. The different groups of variables correspond to different viewpoints or aspects of the problem under analysis and these synthetic indicators should encompass all these aspects. Therefore, these indicators should contain variables from all of the groups. The objective function under consideration is the sum of the first k (k = 2, 3) eigenvalues of the correlation matrix of the subgroup of selected variables. The problem expressed in this way has no earlier known references. In addition to the complexity associated with the selection of variables, the problem has two added difficulties: the evaluation of the objective function and the aforementioned restriction that the subset of the selected variables should contain elements from all of the groups. In this work, two Branch and Bound methods have been proposed to obtain exact solutions. Two strategies were proposed, in order to overcome or at least to alleviate or ‘‘soften’’ the aforementioned difficulties: the first is the use of fast ‘‘guide’’ functions for evaluation in the division processes, as alternatives to the objective function; the second strategy proposes ‘‘filtering’’ or preselecting those variables, at each node of the search tree, that ‘‘help’’ to fulfil the restriction that there should be elements from each group.

110

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

In the computational tests, it was noted that each of the two strategies was by itself really efficient. Important and significant reductions in calculation time were achieved. Their joint use led to reductions of more than 65% in large-sized problems. In any case, the use of exact methods can only be used in problems with moderate size due to the calculation time that they use. In this work, problems with 24 original variables have been solved, and even though immediate solutions are not required in the majority of applications, it might not be viable to use exact methods for problems involving more than 40 variables. In future works, the design of heuristic methods that may be used in larger-size problems will be analysed. Finally we have to point out that the problem addressed in this work can be considered or formulated as a Multi-objective Optimization problem. From our point of view we have 3 objectives: (1) minimizing k (in order to have the simplest as possible representation of the dates), (2) minimizing p (in order to reducing costs in taken the dates and other profits of variable selection explained in the work and other referred works), (3) maximizing sum of the k highest eigenvalues (in order to maximizing the variance of components). These objectives are in ‘‘conflict’’ as it is explained in the work. When the Branch and Bounds methods are executed with different input values the set of non-dominated solutions is obtained. Then from this set must be selected the best solution (combination) depending on the point of view and/or preferences of the analyst. Very often this selection is performed by building mixed objectives functions in which these preferences are incorporated. In any case it is always recommended to perform a deep analysis of the set of non-dominated solutions. Acknowledgements This work was partially supported by FEDER and the Spanish Ministry of Science and Innovation (Project ECO200806159/ECON), the Regional Government of Castile-Leon (Spain) (Project BU008A10-2), the University of Burgos and ‘‘CajaBurgos’’ (Spain, Internal Projects). Their support is gratefully acknowledged. Appendix Corollary 1. If S ⊂ S ′ , then f (S ) ≤ f (S ′ ). With no loss of generality and to simplify the demonstration, let us assume that k = 2 and that S ′ contains just one element more than S, for example S = {1, 2, . . . , t } and S ′ = {1, 2, . . . , t , t + 1}. Moreover, consider the vectors u1 , u2 ∈ Rt , such that f (S ) = v1′ · Σ (S ) · v1 + v2′ · Σ (S ) · u2

∥v1 ∥ = ∥v2 ∥ = 1y v1′ · v2 = 0.

Subsequently, v1∗ , v2∗ ∈ Rt +1 are defined as follows: v1∗ = (v1 , 0) and v2∗ = (v2 , 0), i.e., v1∗ and v2∗ have the same components as v1 and v2 , adding 0 in the position t + 1. Therefore, ∥v1∗ ∥ = ∥v2∗ ∥ = 1, v1∗′ · v2∗ = 0 and moreover: ′ ′ f (S ) = v1′ · Σ (S ) · v1 + v2′ · Σ (S ) · v2 = v1∗′ · Σ (S ) · v1∗ + v2∗′ · Σ (S ) · v2∗ ≤ f (S ′ ).

It may be confirmed, for example, that

   Σ (S ) ′ v1∗′ · Σ (S ) · v1∗ = v1′ , 0 · ′ a

a

σt +1t +1

      v  v · 1 = v1′ · Σ (S ) , v1′ · a · 1 = v1′ · Σ (S ) · v1 0

0

where, a = (σ1t +1 , σ2t +1 , . . . , σtt +1 )′ ∈ Rt . The consequence of this first corollary is the next one. Corollary 2.

∀p, p′ ∈ {k + 1, . . . , n − 1}, if p < p′ then g (p) ≤ g (p′ ). In effect, by simplifying let us define p′ = p + 1, and Sp∗ = {1, . . . , p}, and S ′ = {1, . . . , p, p + 1}. Obviously, Sp∗ ⊂ S ′ . In addition, S ′ ∩ Gr ̸= ∅ because Sp∗ ∩ Gr ̸= ∅, r = 1, . . . , q. Therefore g (p) = f (Sp∗ ) ≤ f (S ′ ) ≤ max{f (S )/|S | = p + 1, S ⊂ V , S ∩ Gr ̸= ∅r = 1, . . . , q} = g (p + 1). References Bandura, R., 2008. A survey of composite indices measuring country performance: 2008 update. Working Paper. Office of Development Studies, United Nations Development Programme. Bonifas, I., EscouBer, Y., Gonzalez, P.L., Sabatier, R., 1984. Choix de variables en analyse en composants principales. Revue de Statistique Appliquée 32 (2), 5–15. Brusco, M.J., Cradit, J.D., 2001. A variable-selection heuristic for k-means clustering. Psychometrika 66, 249–270. Brusco, M.J., Singh, R., Steinley, D., 2009a. Variable neighborhood search heuristics for selecting a subset of variables in principal component analysis. Psychometrika 74, 705–726. Brusco, M.J., Steinley, D., 2010. Neighborhood search heuristics for selecting hierarchically well-formulated subsets in polynomial regression. Naval Research Logistics 57, 33–44.

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

111

Brusco, M.J., Steinley, D., 2011. Exact and approximate algorithms for variable selection in linear discriminant analysis. Computational Statistics and Data Analysis 55 (1), 123–131. Brusco, M.J., Steinley, D., Cradit, J.D., 2009b. An exact algorithm for hierarchically well-formulated subsets in second-order polynomial regression. Technometrics 51 (3), 306–315. Cadima, J., Cerdeira, J.O., Minhoto, M., 2004. Computational aspects of algorithms for variable selection in the context of principal components. Computational Statistics and Data Analysis 47, 225–236. Cadima, J., Jolliffe, I.T., 1995. Loadings and correlations in the interpretation of principal components. Journal of Applied Statistics 22 (2), 203–214. Cadima, J., Jolliffe, I.T., 2001. Variable selection and the interpretation of principal subspaces. Journal of Agricultural, Biological, and Environmental Statistics 6 (1), 62–79. Calzada, J.M., et al. 2011. Boletín de Coyuntura Económica. No 4, Caja Rural Burgos. Chan, Ying Keung, Andy Kwan, Cheuk Chiu, Daniel Shek, Tan Lei, 2005. Quality of life in Hong Kong: the CUHK Hong Kong quality of life index. Social Indicators Research 71, 259–289. Cotta, C., Sloper, C., Moscato, P., 2004. Evolutionary search of thresholds for robust feature set selection: application to the analysis of microarray data. Lecture Notes in Computer Science 3005, 21–30. Duarte Silva, A.P., 2002. Discarding variables in principal component analysis: algorithms for all-subsets comparisons. Computational Statistics 17, 251–271. Falguerolles, A., Jmel, S., 1993. Un critère de choix de variables en analyse en composants principales fondé sur des modèles graphiques gaussiens particuliers. Canadian Journal of Statistics 21 (3), 239–256. Francis, J.G.F., 1961–1962. The QR transformation, parts I and II. The Computer Journal (4), pp. 265–271 and 332–345. Furnival, G.M., Wilson, R.W., 1974. Regressions by leaps and bounds. Technometrics 16, 499–511 (reprinted in Technometrics 42 (1) 69–79). Gatu, C., Kontoghiorghes, E.J., 2006. Branch-and-bound algorithms for computing the best-subset regression models. Journal of Computational and Graphical Statistics 15, 139–156. Gatu, C., Yanev, P., Kontoghiorghes, E.J., 2007. A graph approach to generate all possible regression submodels. Computational Statistics & Data Analysis 52 (2), 799–815. Hofmann, M., Gatu, C., Kontoghiorghes, E.J., 2007. Efficient algorithms for computing the best subset regression models for large-scale problems. Computational Statistics & Data Analysis 52 (1), 16–29. Hogarty, K.Y., Kromrey, J.D., Ferron, J.M., Hines, C.V., 2004. Selection of variables in exploratory factor analysis: an empirical comparison of a stepwise and traditional approach. Psychometrika 69, 593–611. Iizuka, M., Mori, Y., Tarumi, T., Tanaka, Y., 2003. Computer intensive trials to determine the number of variables in PCA. Journal of the Japanese Society of Computational Statistics 15, 337–345. Jolliffe, I.T., 1972. Discarding variables in a principal component analysis-I: artificial data. Applied Statistics 21, 160–173. Jolliffe, I.T., 1973. Discarding variables in a principal component analysis-II: real data. Applied Statistics 22, 21–31. Jolliffe, I.T., 1989. Rotation of Ill-defined principal components. Applied Statistics 38, 139–147. Jolliffe, I.T., 2002. Principal Component Analysis, second ed. Springer-Verlag, New York. Kano, Y., Harada, A., 2000. Stepwise variable selection in factor analysis. Psychometrika 65, 7–22. Kohavi, R., 1995. Wrappers for Performance Enhancement and Oblivious Decision Graphs. Stanford University, Computer Science Department. Krzanowski, W.J., 1987a. Selection of variables to preserve multivariate data structure using principal components. Applied Statistics 36, 22–33. Krzanowski, W.J., 1987b. Cross-validation in principal component analysis. Biometrics 43 (3), 575–584. Krzanowski, W.J., Hand, D.J., 2009. A simple method for screening variables before clustering of microarray data. Computational Statistics and Data Analysis 53, 2747–2753. Kublanovskaya, V.N., 1961. On some algorithms for the solution of the complete eigenvalue problem. USSR Computational Mathematics and Mathematical Physics 3, 637–657. López-García, A.M., Castro-Núñez, R.B., 2004. Valoración de la actividad económica regional de España a través de indicadores sintéticos. Estudios de Economía Aplicada 22 (3), 1–21. McCabe, G.P., 1975. Computations for variable selection in discriminant analysis. Technometrics 17, 103–109. McCabe, G.P., 1984. Principal variables. Technometrics 26 (2), 137–144. McKay, R.J., Campbell, N.A., 1982a. Variable selection techniques in discriminant analysis: I. Description. British Journal of Mathematical and Statistical Psychology 35, 1–29. McKay, R.J., Campbell, N.A., 1982b. Variable selection techniques in discriminant analysis: II. Allocation. British Journal of Mathematical and Statistical Psychology 35, 30–41. Miller, A.J., 2002. Subset Selection in Regression, second ed. Chapman and Hall, London. Mori, Y., Iizuka, M., Tarumi, T., Tanaka, Y., 2007. Variable selection in principal component analysis. In: Härdle, W., Mori, Y., Vieu, P. (Eds.), Statistical Methods for Biostatistics and Related Fields. Springer-Verlag, Berlin, pp. 265–284. Mori, Y., Tarumi, T., Tanaka, Y., 1994. Variable selection with RV-coefficient in principal component analysis. In: Dutter, R., Grossman, W. (Eds.), COMPSTAT 1994. Proceedings in Computational Statistics (Short Communications). pp. 169–170. Nardo, M., Saisana, M., Saltelli, A., Tarantola, S., 2005a. Tools for composite indicators building. Working Paper 21682. European Commission, Joint Research Centre. Nardo, M., Saisana, M., Saltelli, A., Tarantola, S., Hoffman, A., Giovannini, E., 2005b. Handbook on constructing composite indicators: methodology and user guide. Working Paper 2005/3. OECD Statistics. Naylor, T., 1977. Técnicas de Simulación en Computadoras. Limusa. Pacheco, J., Casado, S., Nunez, L., 2009. A variable selection method based on tabu search for logistic regression models. European Journal of Operational Research 199, 506–511. Pacheco, J., Casado, S., Nunez, L., Gomez, O., 2006. Analysis of new variable selection methods for discriminant analysis. Computational Statistics and Data Analysis 51, 1463–1478. Peixoto, J.L., 1987. Hierarchical variable selection in polynomial regression models. American Statistician 41, 311–313. Ramajo-Hernández, J., Márquez-Paniagua, M.A., 2001. Indicadores sintéticos de actividad económica: el caso de Extremadura. In: Cabrer-Borrás, , et al. (Eds.), Análisis Regional: El Proyecto Hispalink. Mundi Prensa, pp. 301–312. Richman, M.B., 1992. Determination of dimensionality in eigenanalysis. In: Proceedings of the Fifth International Meeting on Statistical Climatology, pp. 229–235. Robert, P., Escoufier, Y., 1976. A unifying tool for linear multivariate statistical methods. Applied Statistics 25, 257–265. Rubinstein, R.Y., 1981. Simulation and the Monte Carlo Method. Wiley. Steinley, D., Brusco, M.J., 2008. Selection of variables in cluster analysis: an empirical comparison of eight procedures. Psychometrika 73, 125–144. Tanaka, Y., Mori, Y., 1997. Principal component analysis based on a subset of variables: variable selection and sensitivity analysis. American Journal of Mathematical and Management Sciences 17 (1& 2), 61–89. Tangian, A., 2007. Analysis of the third European survey on working conditions with composite indicators. European Journal of Operational Research 181 (1), 468–499.