Estimation of an oblique structure via penalized likelihood factor analysis

Estimation of an oblique structure via penalized likelihood factor analysis

Computational Statistics and Data Analysis 79 (2014) 120–132 Contents lists available at ScienceDirect Computational Statistics and Data Analysis jo...

911KB Sizes 0 Downloads 77 Views

Computational Statistics and Data Analysis 79 (2014) 120–132

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda

Estimation of an oblique structure via penalized likelihood factor analysis Kei Hirose a,∗ , Michio Yamamoto b a

Graduate School of Engineering Science, Osaka University, Japan

b

Department of Biomedical Statistics and Bioinformatics, Kyoto University Graduate School of Medicine, Japan

article

info

Article history: Received 21 February 2013 Received in revised form 16 May 2014 Accepted 16 May 2014 Available online 21 May 2014 Keywords: Nonconvex penalty Oblique structure Rotation technique Penalized likelihood factor analysis

abstract The problem of sparse estimation via a lasso-type penalized likelihood procedure in a factor analysis model is considered. Typically, model estimation assumes that the common factors are orthogonal (i.e., uncorrelated). However, if the common factors are correlated, the lasso-type penalization method based on the orthogonal model frequently estimates an erroneous model. To overcome this problem, factor correlations are incorporated into the model. Together with parameters in the orthogonal model, these correlations are estimated by a maximum penalized likelihood procedure. Entire solutions are computed by the EM algorithm with a coordinate descent, enabling the application of a wide variety of convex and nonconvex penalties. The proposed method is applicable even when the number of variables exceeds that of observations. The effectiveness of the proposed strategy is evaluated by Monte Carlo simulations, and its utility is demonstrated through real data analysis. © 2014 Elsevier B.V. All rights reserved.

1. Introduction Factor analysis is used to explore the covariance structure among a set of observed random variables. The technique constructs a reduced number of random variables called common factors. Traditional exploratory factor analysis adopts the following two-step approach: the model first is estimated by the maximum likelihood method, assuming uncorrelated (orthogonal) common factors; next, sparse factor loadings are found by rotation techniques such as the varimax method (Kaiser, 1958) and the promax method (Hendrickson and White, 1964). However, overparameterization, which often leads to unstable estimation, is a recognized problem in the maximum likelihood method (e.g., Akaike, 1987). In particular, the commonly used algorithms for maximum likelihood factor analysis (e.g., Jöreskog, 1967; Jennrich and Robinson, 1969; Clarke, 1970; Lawley and Maxwell, 1971) are generally inapplicable when the number of variables exceeds that of observations. Furthermore, rotation techniques often yield insufficiently sparse solutions. To handle these problems, a penalized likelihood procedure such as the lasso (Tibshirani, 1996) may be adopted. Lasso-type penalized likelihood factor analysis has been recently studied by several researchers. Ning and Georgiou (2011) and Choi et al. (2011) obtained sparse factor loadings by the weighted lasso and numerically demonstrated that the penalization method often outperforms the rotation technique with maximum likelihood. Hirose and Yamamoto (2014) showed that the penalization method is a generalization of the rotation technique with maximum likelihood. By imposing nonconvex penalties such as minimax concave penalty (MC+, Zhang, 2010) and smoothly clipped absolute deviation (SCAD, Fan and Li, 2001), they achieved solutions sparser than those yielded by the lasso. Recently, Das et al. (2014)



Correspondence to: 1-3, Machikaneyama-cho, Toyonaka, Osaka 560-8531, Japan. Tel.: +81 6 6850 6482; fax: +81 6 6850 6482. E-mail addresses: [email protected], [email protected] (K. Hirose), [email protected] (M. Yamamoto).

http://dx.doi.org/10.1016/j.csda.2014.05.011 0167-9473/© 2014 Elsevier B.V. All rights reserved.

K. Hirose, M. Yamamoto / Computational Statistics and Data Analysis 79 (2014) 120–132

121

numerically investigated the effectiveness of SCAD in supersaturated designs and showed that SCAD is an important tool for identifying true variables. As assumed in exploratory factor analysis, the above studies assumed that common factors are uncorrelated (orthogonal). In some cases, however, this assumption should be relaxed (e.g., Mulaik, 2010). We also found that, if orthogonal common factors are assumed in the model, the lasso-type penalization method frequently yields an incorrect result. From our experience, this method often generates a dense column vector in the estimated loading matrix (‘‘dense’’ means that all elements are nonzero), even if the corresponding column vector is sparse in the true loading matrix. To handle this fundamental problem, we incorporate factor correlations into the model, which are estimated alongside parameters included in the orthogonal model by a maximum penalized likelihood procedure. Following the ideas of Hirose and Yamamoto (2014), we introduce an EM algorithm (Rubin and Thayer, 1982) with pathwise coordinate descent for nonconvex penalties (Mazumder et al., 2011). Our algorithm generates entire solutions for a wide variety of convex and nonconvex penalties including the lasso, SCAD, and MC+ families. Furthermore, the proposed methodology provides sparser solutions than the rotation technique with maximum likelihood, and it is applicable even when the number of variables exceeds that of observations. The remainder of this paper is organized as follows: Section 2 describes the framework of the basic factor model and maximum likelihood estimation. In Section 3, we show that the lasso-type penalized likelihood factor analysis based on the orthogonal model often deviates from the oblique structure. A penalized factor analysis of oblique common factors is introduced in Section 4. This section also presents a computational algorithm for obtaining the entire solutions, based on the EM algorithm and coordinate descent. Section 5 presents numerical analyses of both artificial and real datasets. Concluding remarks are given in Section 6. 2. Maximum likelihood factor analysis Let X = (X1 , . . . , Xp )T be a p-dimensional observable random vector with a mean vector µ and a variance–covariance matrix 6. The factor analysis model (e.g., Mulaik, 2010) is X = µ + 3F + ε, where 3 = (λij ) is a p × m matrix of factor loadings, and F = (F1 , . . . , Fm )T and ε = (ε1 , . . . , εp )T are unobservable random vectors. The elements of F and ε are called common factors and unique factors, respectively. Both F and ε are assumed to be multivariate-normally distributed with E (F ) = 0, E (ε) = 0, E (FF T ) = Im , E (εεT ) = 9, and they are independent (i.e., E (F εT ) = O). Here Im is the m × m identity matrix, and 9 is a p × p diagonal matrix with the ith diagonal element ψi , which is called unique variance. Under these assumptions, the observable random vector X is multivariate-normally distributed with a variance–covariance matrix 6 = 33T + 9. Let x1 , . . . , xN be N observations randomly sampled from the p-dimensional normal population Np (µ, 33T + 9). The factor loadings and unique variances are estimated by maximizing the log-likelihood function ℓort (3, 9) (where ‘‘ort’’ abbreviates the term orthogonal)  N  ℓort (3, 9) = − p log(2π ) + log |33T + 9| + tr{(33T + 9)−1 S} . 2 Here, S = (sij ) is the sample variance–covariance matrix. Since the solutions cannot be expressed in a closed form, the maximum likelihood estimates are obtained by iterative algorithms (such as those proposed by Jöreskog, 1967; Jennrich and Robinson, 1969; Clarke, 1970; Rubin and Thayer, 1982). In practical situations, the maximum likelihood method often yields unstable estimates because of overparameterization (e.g., Akaike, 1987). To obtain stable estimates, we employ a penalized likelihood procedure. 3. Penalized likelihood factor analysis based on the orthogonal model This section first introduces lasso-type penalized likelihood factor analysis based on the orthogonal model (Choi et al., 2011; Ning and Georgiou, 2011; Hirose and Yamamoto, 2014). We then show that such an analysis cannot recover the oblique structure. 3.1. Penalized maximum likelihood estimation In the penalized maximum likelihood procedure, the estimates of factor loadings and unique variances, here denoted

ˆ ort and 9 ˆ ort , respectively, are obtained by maximizing the penalized log-likelihood function, i.e., 3 ˆ ort , 9 ˆ ort ) = arg max ℓort (3 ρ (3, 9), 3 ,9

where ℓort ρ (3, 9) is the penalized log-likelihood function ort ℓort ρ (3, 9) = ℓ (3, 9) − N

p  m 

ρ P (|λij |).

i=1 j=1

Here P (·) is a penalty function, and ρ is a regularization parameter.

122

K. Hirose, M. Yamamoto / Computational Statistics and Data Analysis 79 (2014) 120–132

The lasso-type penalty function P (·) produces sparse solutions for some ρ , i.e., some of the factor loadings can be estimated as exactly zero. The lasso is continuous and fast; however, it is biased, and the estimated model is overly dense (Zou, 2006; Zhao and Yu, 2007; Zhang, 2010). Sparser models can generally be achieved by a nonconcave penalization procedure such as MC+ (Zhang, 2010) and SCAD (Fan and Li, 2001). An example is the MC+ penalty (Zhang, 2010), given by

ρ P (|θ|; ρ; γ ) = ρ

|θ |





x

1−

ργ  2

0

 θ = ρ |θ| − 2ργ

 dx +

I (|θ | < ργ ) +

ρ2γ 2

I (|θ| ≥ ργ ).

Here γ is a tuning parameter that controls the concavity of penalty. For each ρ > 0, γ → ∞ yields the soft threshold operator (the lasso penalty), and γ → 1+ produces the hard threshold operator. 3.2. Problem of the lasso based on the orthogonal model The lasso-type penalization based on the orthogonal model performs well when the true common factors are uncorrelated. In practice, however, the true common factors may often be correlated; that is, E [FF T ] = 8, where 8 = (φij ) is the factor correlation. In this case, the covariance matrix of the observed variables X is given by 383T + 9. If the factors are highly intercorrelated, i.e., the nondiagonal elements of 8 significantly deviate from zero, the model estimated from the orthogonal-based lasso may widely differ from the true factor structure. This phenomenon is illustrated in the following example. Example 3.1. Suppose that the true factor loadings, unique variances and factor correlations are, respectively, given by

3=



0.9 0.0

0.9 0.0

0.9 0.0

0.0 0.9

0.0 0.9

0.0 0.9

T

,

9 = 0.19I6 ,

8=



1.0

φ

 φ . 1.0

In this example, we set φ = 0.2, 0.4, 0.6, and 0.8, and we generated 50 observations from X ∼ N6 (0, 383T + 9). The model was estimated by the penalized likelihood method with P (|θ |) = |θ| (i.e., the lasso) and ρ = 0.01. The estimated factor loadings for each ρ are listed in Table 1. ˆ 12 , λˆ 22 and λˆ 32 are also approximately or exactly Because some of the factor loadings produced by the lasso are zero, λ ˆ 41 , λˆ 51 and λˆ 61 enlarge with increasing φ , although their true values are zero. Thus, the lasso based on the zero. However, λ orthogonal model failed to approximate the true factor structure. This failure is closely related to the rotation problem. In the orthogonal model, the true covariance matrix 383T + 9 is T

ˆ ort 3 ˆ ort + 9 ˆ ort approximates 3G. Here G = (gii′ ) is the m × m matrix satisfying GGT = 8. ˆ ort , implying that 3 estimated by 3 ˆ ort is not The matrix G cannot be the identity matrix unless the factor correlation matrix 8 is the identity matrix so that 3 ˆ necessarily close to 3 even if 3ort ≈ 3G. Furthermore, the matrix G can have rotational indeterminacy, since GGT = G∗ (G∗ )T = 8, where G∗ = GT with T being an arbitrary orthogonal matrix. This rotational indeterminacy of G implies that ℓort (3G, 9) = ℓort (3G∗ , 9). Thus, if we ˆ ort and 9 ˆ ort = 3G and 9 ˆ ort as 3 ˆ ort = 9, respectively, the matrix G is obtained by solving the following problem: express 3 max ℓort ρ (3G, 9) G

s.t., GGT = 8,

which is equivalent to min G

p  m 

˘ ij |) s.t., GGT = 8, P (|λ

(1)

i =1 j =1

˘ ij is the (i, j)th element of 3G. where λ For simplicity, we assume that the true factor loadings 3 possess a perfect simple structure, that is, each row contains at most one nonzero element. Expression in (1) is then written as min G

m  m 

P (wii′ |gii′ |) s.t., GGT = 8,

(2)

i=1 i′ =1

where wii′ are positive values. Because the objective function is based on the L1 loss, some of the elements of G become exactly zero. Empirically, one of the elements of G often becomes 1. When gqr = 1, we have gqi′ = 0(i′ ̸= r ) and gir = φir (i ̸= q). In this case, all elements of the rth column are nonzero unless φir = 0, and the estimated model does not approximate a perfect simple structure. Example 3.2. In Example 3.1, the problem described in (2) is written as min(|g11 | + |g12 | + |g21 | + |g22 |) s.t., GG = T

G



1.0

φ

 φ . 1.0

K. Hirose, M. Yamamoto / Computational Statistics and Data Analysis 79 (2014) 120–132

123

Table 1 Loading matrices estimated by the lasso assuming orthogonal common factors for φ = 0.2, 0.4, 0.6, and 0.8.

φ = 0.2 0.92 0.92 0.92 0.15 0.02 0.05

φ = 0.4 0.10 0.00 0.03 0.88 0.95 0.89

φ = 0.6

0.92 0.92 0.92 0.32 0.20 0.23

0.09 0.00 0.03 0.82 0.92 0.85

φ = 0.8

0.93 0.93 0.92 0.46 0.34 0.36

0.09 0.00 0.03 0.72 0.87 0.76

0.94 0.93 0.93 0.56 0.45 0.46

0.04

−0.04 0.00 0.54 0.81 0.60

Table 2 Loading matrices estimated by the lasso based on oblique common factors for φ = 0.2, 0.4, 0.6, and 0.8.

φ = 0.2 0.91 0.92 0.92 0.10 −0.03 0.00

φ = 0.4 0.07

φ = 0.6

0.91 0.93 0.92 0.11 −0.04 0.01

−0.03 0.00 0.88 0.95 0.89

0.07

−0.03 0.00 0.85 0.96 0.88

0.90 0.93 0.92 0.18 0.00 0.06

φ = 0.8 0.09

0.90 0.94 0.91 0.25 0.00 0.11

−0.01 0.03 0.78 0.94 0.83

0.09 0.00 0.05 0.64 0.93 0.71

Solving for G, we obtain

 G=

1

φ

  0 . 1 − φ2

In this case, we have

3G =



0.9

0.9

0.9

0.0

0.0

0.0

0. 9·φ

0.9 ·

0. 9·φ

1 − φ2

0.9 ·

1 − φ2

0. 9·φ

0.9 ·

T

1 − φ2

.

Therefore, when φ becomes large, λˆ 41 , λˆ 51 , and λˆ 61 (=0.9 · φ) can be far from zero although their true values are zero. In ˆ 42 , λˆ 52 , and λˆ 62 (=0.9 · 1 − φ 2 ) may significantly stray from their true values. addition, λ 4. Estimation of an oblique structure by penalized likelihood factor analysis Adopting the orthogonal model in lasso-type penalized likelihood factor analysis frequently yields the wrong oblique structure, as described in Section 3.2. In this section, we estimate the oblique model from maximum penalized likelihood with oblique factors. 4.1. Model estimation Let ℓ(3, 9, 8) be the log-likelihood function based on the oblique model

 N  p log(2π ) + log |383T + 9| + tr{(383T + 9)−1 S} , 2 and ℓρ (3, 9, 8) be the corresponding penalized log-likelihood function ℓ(3, 9, 8) = −

ℓρ (3, 9, 8) = ℓ(3, 9, 8) − N

p  m 

ρ P (|λij |).

(3)

(4)

i=1 j=1

ˆ obl , 9 ˆ obl , and 8 ˆ obl , respecWe simultaneously estimate factor loadings, unique variance, and factor correlations, denoted 3 tively (where‘‘obl’’ abbreviates the term oblique), by the maximum penalized likelihood procedure: ˆ obl , 9 ˆ obl , 8 ˆ obl ) = arg max ℓρ (3, 9, 8). (3 3,9,8

Example 4.1. The lasso assuming the oblique model was applied to the dataset used in Example 3.1. The estimates of factor loadings are listed in Table 2. Our procedure approximated the true factor structure much more accurately than the orthogonal model, whose results are summarized in Table 1. Remark 4.1. Traditionally, exploratory factor analysis uses the correlation matrix, not the covariance matrix. On the other hand, the covariance matrix is adopted in confirmatory factor analysis and LISREL (Jöreskog and Sörbom, 1996). Since we focus on exploratory factor analysis, we are interested in the correlation matrix.

124

K. Hirose, M. Yamamoto / Computational Statistics and Data Analysis 79 (2014) 120–132

The solutions using the covariance and correlation matrices are related as follows. The penalized log-likelihood function 1

1

in (4) can be expressed in terms of the correlation matrix R = V − 2 SV − 2 (V = diag(S )) as

˜,9 ˜ , 8) − N ℓρ (3, 9, 8) = ℓ(3

p  m  i=1 j=1

˜ = V where 3 written as

− 12

˜ = V 3 and 9

− 12

 √   ρ P  sii λ˜ ij  + const.,

1

9V − 2 . In particular, the penalized log-likelihood function based on the lasso can be

˜,9 ˜ , 8) − N ℓ(3, 9, 8) = ℓ(3

p  m 

ρ˜ i |λ˜ ij | + const.,

i=1 j=1



where ρ˜ i = sii ρ . Thus, the lasso solution obtained from the covariance matrix is equivalent to the weighted-lasso solution obtained from the correlation matrix. Such an explicit relationship between the two solutions is not established if nonconvex penalties such as MC+ or SCAD are imposed. 4.2. Algorithm As is well known, solutions obtained by the lasso-type regularization are generally not expressed in a closed form mainly because the penalty term includes a nondifferentiable function. In regression analysis, several researchers have proposed fast algorithms to obtain the entire solutions (e.g., Least angle regression, Efron et al., 2004; coordinate descent algorithm, Friedman et al., 2007; generalized path seeking, Friedman, 2012). The fastest of these is the coordinate descent algorithm (Friedman et al., 2010), which is applicable to a wide variety of convex and nonconvex penalties (Breheny and Huang, 2011; Mazumder et al., 2011). For this reason, we employ the coordinate descent algorithm to obtain the entire solutions in the present analysis. In the coordinate descent algorithm, each step is fast if each coordinate-wise maximization is given by an explicit formula. However, an explicit formula may not be derivable from the log-likelihood function in (3). Here, we derive an explicit formula by applying the EM algorithm (Rubin and Thayer, 1982) to penalized likelihood factor analysis. The nonconcave function is maximized by the coordinate descent algorithm in the maximization step of the EM algorithm. Because the complete-data log-likelihood function takes on a quadratic form, an explicit formula is available for each coordinate-wise maximization. 4.2.1. Update equation for the fixed regularization parameter First, we provide the update equations of the factor loadings, unique variances, and factor correlations when ρ and γ are fixed. Suppose that 3old , 9old and 8old are the current values of factor loadings, unique variances, and factor correlations, respectively. To estimate the model, we maximize the expectation of the complete-data penalized log-likelihood function ℓCρ (3, 9, 8): E [ℓCρ (3, 9, 8)] = −



p N 

2 i=1 N 2

log ψi −

log |8| −

N 2

p N  sii − 2λTi bi + λTi Aλi

ψi

2 i =1

p

tr(8−1 A) − N

m 

ρ P (|λij |) + const.,

(5)

i=1 j=1

1 −1 1 −1 1 −1 −1 where bi = M−1 3Told 9− + M−1 3Told 9− . Here M = 3Told 9− old si and A = M old S9old 3old M old 3old + 8old , and si is the ith column vector of S. The complete-data penalized log-likelihood function is derived in the Appendix. The new parameter (3new , 9new , 8new ) is computed by maximizing the complete-data penalized log-likelihood function, i.e.,

(3new , 9new , 8new ) = arg max E [ℓCρ (3, 9, 8)].

(6)

3 ,9 ,8

The solution in (6) is generally not expressed in closed form because the penalty term includes a nondifferentiable function, and hence, the solution is obtained by the coordinate descent algorithm. (j)

˜ i be an (m − 1)-dimensional vector (λ˜ i1 , λ˜ i2 , . . . , λ˜ i(j−1) , λ˜ i(j+1) , . . . , λ˜ im )T . The parameters λij can be updated by Let λ (j)

˜ i , 9 and 8 are fixed. That is, we solve the following problem: maximizing (5) while the other parameters λ λ˜ ij = arg min λij

1 2ψi





 = arg min λij



λ − 2 bij −

ajj 2ij

1 λij − 2



˜ ik akj λ

 λij + ρ P (|λij |)

k̸=j

bij −

 k̸=j

ajj

˜ ik akj λ

2 ψi ρ  P (|λij |).  + ajj

K. Hirose, M. Yamamoto / Computational Statistics and Data Analysis 79 (2014) 120–132

125

Equivalently, we minimize the following penalized squared-error loss function: S (θ˜ ) = arg min θ



1 2

 2 ∗ ˜ (θ − θ ) + ρ P (|θ|) .

The solution S (θ˜ ) can be expressed in closed form for various convex and nonconvex penalties. For example, by imposing the MC+ penalty, the update equation becomes

 ˜ − ρ ∗ )+  sgn(θ˜ )(|θ| ˜ ∗ S (θ ) =  ˜ 1 − 1/γ θ

if |θ˜ | ≤ ρ ∗ γ ∗ , if |θ˜ | > ρ ∗ γ ∗ .

After updating 3 to 3new via the coordinate descent algorithm, the new values of 9new and 8new are obtained by maximizing the expected penalized log-likelihood function in (5) as follows:

(ψi )new = sii − 2(λi )Tnew bi + (λi )Tnew A(λi )new for i = 1, . . . , p, 8new = arg min{log |8| + tr(8−1 A)}, 8

where (ψi )new is the ith diagonal element of 9new , and (λi )new is the ith column of 3new . 8new may not be readily expressed in explicit form, because all of the diagonal elements of 8 are constrained by 1. Therefore, the nondiagonal elements of 8new are estimated by the Broyden–Fletcher–Goldfarb–Shanno optimization procedure. 4.2.2. Pathwise algorithm A pathwise algorithm in the orthogonal case was proposed by Hirose and Yamamoto (2014). Here, this algorithm is applied to the oblique case. The pathwise algorithm efficiently yields solutions on a grid of increasing ρ values P = {ρ1 , . . . , ρK } and a grid of increasing γ values Γ = {γ1 , . . . , γT }, where γT specifies the lasso penalty (e.g., γT = ∞ for MC+ family). First, we compute the lasso solutions for P = {ρ1 , . . . , ρK } by decreasing the sequence of ρ values, starting with the largest value ˆ = O. Next, the value of γT −1 is selected, and solutions are produced for ρ = ρK for which the estimates of factor loadings 3 the sequence P = {ρ1 , . . . , ρK }. The solution at (γT −1 , ρk ) is computed from the solution at (γT , ρk ), leading to improved and smoother objective value surfaces (Mazumder et al., 2011). In the same way, the solution at (γt , ρk ) for t = T − 2, . . . , 1 is computed from the solution at (γt +1 , ρk ). 4.3. Selection of tuning parameters In the proposed modeling procedure, the tuning parameters ρ and γ must be appropriately selected. This is achieved by the following two selection procedures. 4.3.1. Model selection criteria The selection of the tuning parameters can be viewed as a model selection and evaluation problem. We apply the following model selection criteria:

ˆ,9 ˆ,8 ˆ ) + 2df (γt , ρk ), AIC = −2ℓ(3 ˆ,9 ˆ,8 ˆ ) + df (γt , ρk ) log N , BIC = −2ℓ(3 ˆ,9 ˆ,8 ˆ ) + df (γt , ρk )(log N + 1), CAIC = −2ℓ(3 where df (γt , ρk ) denotes the degrees of freedom at γ = γt and ρ = ρk . The degrees of freedom df (γt , ρk ) measure the complexity of the estimated model (Ye, 1998). If the model is estimated by the unpenalized maximum likelihood procedure, the degrees of freedom equal the number of parameters. In penalized likelihood factor analysis, however, estimating the degrees of freedom is a difficult task. Thus, we apply two estimation procedures that are commonly used in regression modeling. In the first procedure, the degrees of freedom are estimated by the number of nonzero parameters (Breheny, 2013). The second procedure is based on the reparameterization of the penalty function (Mazumder et al., 2011). When reparameterizing the MC+ penalty, the degrees of freedom are retained constant while varying γ . Thus, for any γ , the degrees of freedom can be computed from the estimated degrees of freedom of the lasso, themselves estimated as the number of nonzero parameters (Zou et al., 2007). The estimation procedures described above are usually applied to the regression model. In contrast, mathematical support for the factor analysis model is currently lacking, but will be developed in our future research. 4.3.2. Goodness-of-fit indices The estimated model with sufficiently sparse factor loadings may be easily interpreted. However, an excessively sparse model does not fit the data. Therefore, the selected tuning parameters should yield both a sparse solution and appropriate

126

K. Hirose, M. Yamamoto / Computational Statistics and Data Analysis 79 (2014) 120–132

values of the following commonly-used goodness-of-fit indices:

 SRMR =

2

p(p + 1) i≤j

 RMSEA =

GFI = 1 −

fˆ df



ˆ tr[{6

−1

1 N

sij − σˆ ij

2

/sii sjj ,

,

ˆ )}2 ] (S − 6 −1

AGFI = 1 − CFI =



,

ˆ S}2 ] tr[{6 p(p + 1)(1 − GFI) p(p + 1) − 2df

,

(fˆ0 − df0 /N ) − (fˆ − df /N ) , (fˆ0 − df0 /N ) T

−1

ˆ8 ˆ +9 ˆ =3 ˆ3 ˆ , fˆ = log |6 ˆ | − log |S| + tr(6 ˆ S ) − p, fˆ0 = log |diag(S )| − log |S|, and df0 = p(p − 1)/2. Indicators where 6 of good model fits are SRMR ≤ 0.05, RMSEA ≤ 0.08, and CFI ≥ 0.90 (Hu and Bentler, 1999). Remark 4.2. Model selection criteria and goodness-of-fit indices were discussed in Sections 4.3.1 and 4.3.2, respectively. Both criteria can be interactively used to select an appropriate model from all solutions. For example, we first select a model that minimizes the BIC. If the selected model yields a poor CFI score, it may be modified to score higher on the CFI. Even if the selected model scores highly on the CFI, a sparser (or denser) solution may be preferred for interpretation purposes. In this case, the selected model may be fine-tuned rather than significantly modified. In this way, we can select the tuning parameters by the BIC and CFI. 4.4. Treatment of improper solutions As is well known, the maximum likelihood estimates of unique variances can be zero or negative. Such estimates, regarded as improper solutions, have been investigated by many researchers (e.g., Van Driel, 1978; Anderson and Gerbing, 1984; Kano, 1998). Recently, the EM algorithm in maximum likelihood factor analysis has been shown to avoid improper solutions (Adachi, 2013). However, if improper solutions exist, the EM algorithm is frequently slowed down. To circumvent this problem, we impose a penalty with respect to 9 to (4). This penalty, based on the ideas of Martin and McDonald (1975) and Hirose et al. (2011), modifies Eq. (4) to

ℓ∗ρ (3, 9, 8) = ℓρ (3, 9, 8) −

N 2

ηtr(9−1/2 S9−1/2 ),

where η is a tuning parameter. Note that when ψi → 0, tr(9−1/2 S9−1/2 ) → ∞. Thus, the penalty term tr(9−1/2 S9−1/2 ) prevents the occurrence of improper solutions. η can be selected by a generalized Bayesian information criterion (Konishi et al., 2004) derived by Hirose et al. (2011). Such a criterion is not easily derivable in lasso-type penalization. In practice, the penalty term prevents improper solutions when η is as small as 0.001. 5. Numerical examples 5.1. Monte Carlo simulations The following models were evaluated in the simulation study: Model (A) : 3 =



0.9 0.0

0.9 0.0

0.9 · 125  025 Model (B) : 3 =  025 025



0.9 0.0

0.0 0.8

025 0.8 · 125 025 025

0.0 0.8

T

0.0 0.8

025 025 0.7 · 125 025

, 

025 025  , 025  0.6 · 125

where 1q is a q-dimensional vector (1, . . . , 1)T , and 0q is a q-dimensional zero vector. Model (B) is a considerably larger model than Model (A). For each model, 1000 datasets were generated such that x ∼ Np (0, 383T + 9), where 9 = diag(Ip − 383T ). The factor correlations were structured in two ways: equicorrelation and first-order autoregressive (AR) structures. The equicorrelation structure can be written as 8 = φ · 1m 1Tm + (1 − φ) · Im . In the first-order AR structure, the (i, j)th element of 8 is defined

K. Hirose, M. Yamamoto / Computational Statistics and Data Analysis 79 (2014) 120–132

127

by φij = φ |i−j| . Note that both structures are equivalent in Model (A). To investigate the strength of the correlations under which the new method will demonstrate improvement over existing methods, we varied φ as φ = 0.0, 0.2, 0.4, 0.6, and 0.8. The performance of the penalized likelihood procedure based on the oblique model was compared with that based on the orthogonal model. The number of observations N was set to 50, 100, and 200. The mean squared error (MSE) of factor loadings, sparsity, true positive rate (TPR), and true negative rate (TNR) yielded by models (A) and (B) are plotted in Fig. 1. (s)

(s)

2 ˆ ˆ is the estimated factor loading of the sth dataset. The The MSE is defined as MSE = s=1 ∥3 − 3 ∥ /(1000pm), where 3 sparsity is defined by the proportion of zero elements. The true sparsities of Models (A) and (B) (indicated by the horizontal dashed lines in Fig. 1) are 0.5 and 0.75, respectively. The TPR (TNR) indicates the proportion of cases for which the nonzero (zero) factor loadings are correctly set to nonzero (zero). The MC+ penalty was imposed, and the tuning parameters ρ and γ were selected by the BIC. The degrees of freedom were estimated by the number of nonzero parameters. The penalized likelihood procedure is available for use in the R package fanc.1 From Fig. 1, we observe that

1000

• the proposed procedure outperformed the orthogonal model, especially when the factors were highly correlated. • in Model (A), the estimation based on the oblique model significantly outperformed that based on the orthogonal model even at relatively low correlations (e.g., φ = 0.4). When φ ≥ 0.4, the factor structure yielded by the orthogonal model significantly differed from the true structure in many instances.

• in Model (B), the orthogonal model performed relatively well unless the factors were extremely highly correlated (φ = 0.8). Provided that the regularization parameter ρ was relatively large, the estimation based on the orthogonal structure recovered the true structure. If ρ was too small, the estimated model completely differed from the true model. The BIC selected the appropriate value of ρ when φ ≤ 0.6. • the equicorrelation structure returned for Model (B) was similar to the AR covariance structure. • when the true factor structure was orthogonal (i.e., φ = 0), the oblique and orthogonal models returned similar results. We also compared the performance of various procedures based on the oblique model; MC+ with AIC, lasso with AIC, MC+ with BIC, and lasso with BIC. Fig. 2 plots the MSE, sparsity, TPR, and TNR obtained by these four approaches. We observe that

• • • •

the MC+ with the BIC approach generally selected the correct model. the MC+ penalty generated sparser solutions than the lasso. the BIC more often selected the appropriate model than the AIC. the TPR, TNR, and sparsity were essentially independent of φ in all methods. However, the MSE increased as φ increased, especially in Model (B).

5.2. Analysis of Harman’s psychological test data The usefulness of our procedure is illustrated by Harman’s psychological test data (Harman, 1976). These data represent the scores of 145 subjects on 24 psychological tests. The dataset is one of several available in the R software. The model was estimated by the penalized likelihood procedure using the MC+ penalty. Both the orthogonal and oblique models were evaluated. The tuning parameters γ and ρ were selected by the BIC. The degrees of freedom were estimated by the number of nonzero parameters. Table 3 lists the goodness-of-fit indices, selected tuning parameters, and number of nonzero loadings. Both orthogonal and oblique models yielded similarly high goodness-of-fit indices, indicating that both models might fit the observed data. However, the number of nonzero parameters in the orthogonal model was more than twice that in the oblique model. Thus, the oblique model yielded a much sparser loading matrix than the orthogonal model. Furthermore, the estimated factor correlation matrix was   1.00 0.72 0.49 0.71 0.72 1.00 0.71 0.79 8= , 0.49 0.71 1.00 0.63 0.71 0.79 0.63 1.00 indicating that the four factors were relatively well correlated. Table 4 lists the factor loadings estimated by MC+ based on the orthogonal and oblique models. In the orthogonal model, all elements of the second column of the estimated loading matrix were relatively large. The four factors may be interpreted as (i) spatial visualization, (ii) comprehensive ability, (iii) speed, and (iv) memory. On the other hand, the second column of the loading matrix became sparse in the oblique model. In this model, the second factor may be interpreted as ‘‘logical verbal reasoning’’. The remaining three factors were identically interpreted to the orthogonal model. In addition, we compared the performances of our method and the traditional two-step approach (i.e., maximum likelihood estimation followed by the rotation technique). Here, the rotation criterion was the promax rotation criterion, which is widely used to estimate oblique structures. We found that our method yielded estimates similar to those obtained by the promax rotation technique when (ρ, γ ) = (0.02, 4.0). The loading matrices estimated by both methods are listed in Table 5. 1 Available at http://cran.r-project.org/web/packages/fanc.

128

K. Hirose, M. Yamamoto / Computational Statistics and Data Analysis 79 (2014) 120–132

Fig. 1. Comparison of orthogonal and oblique models. Performances were evaluated by four indices (MSE, Sparsity, TNR, and TPR). The x-axis indicates φ . The horizontal dashed lines in the Sparsity results indicate the true sparsity.

A significant feature of our technique is that, merely by changing the tuning parameters, we can obtain sparser or denser solutions than those listed in Table 5. Using the model selection criteria and goodness-of-fit indices, the analyst can select an appropriate model from these solutions. For this reason, our method potentially offers greater flexibility than the promax rotation technique when evaluating this dataset.

K. Hirose, M. Yamamoto / Computational Statistics and Data Analysis 79 (2014) 120–132

129

Fig. 2. Comparison of various penalized likelihood procedures. Performances were evaluated by four indices (MSE, Sparsity, TNR, and TPR). The x-axis indicates φ . The horizontal dashed lines in the Sparsity results indicate the true sparsity.

6. Concluding remarks In exploratory factor analysis, the lasso based on the orthogonal model frequently fails to approximate the oblique structure. We identified the source of this disadvantage as the rotation problem of factor loadings. To resolve the problem, we

130

K. Hirose, M. Yamamoto / Computational Statistics and Data Analysis 79 (2014) 120–132

Table 3 The goodness-of-fit indices, selected tuning parameters, and the number of nonzero loadings (active) for 24 psychological test data.

Orthogonal: Oblique:

GFI

AGFI

CFI

RMSEA

SRMR

ρ

γ

Active

0.86 0.81

0.80 0.77

0.86 0.80

0.13 0.18

0.05 0.06

0.14 0.21

1.10 1.10

58 28

Table 4 Factor loadings estimated by MC+ based on orthogonal and oblique models in analyses of 24 psychological test data. Items

Orthogonal model

Visual perception Cubes Paper form board Flags General information Paragraph comprehension Sentence completion Word classification Word meaning Addition Code Counting dots Straight curved capitals Word recognition Number recognition Figure recognition Object number Number figure Figure word Deduction Numerical puzzles Problem reasoning Series Completion Arithmetic problems

0.57 0.38 0.52 0.42 0.00 0.00 0.00 0.12 0.00 −0.32 0.00 0.00 0.25 0.00 0.00 0.37 0.00 0.22 0.18 0.30 0.26 0.30 0.36 0.00

0.45 0.28 0.29 0.40 0.78 0.77 0.79 0.69 0.79 0.52 0.51 0.42 0.52 0.37 0.32 0.35 0.40 0.39 0.37 0.56 0.51 0.55 0.61 0.65

Oblique model 0.04 0.00 0.00 0.00 −0.18 −0.32 −0.25 0.00 −0.37 0.64 0.33 0.61 0.36 0.00 0.00 0.00 0.00 0.27 0.00 0.00 0.32 0.00 0.00 0.26

0.00 0.00 0.00 0.00 0.00 0.00 −0.16 0.00 0.00 0.00 0.26 0.00 0.00 0.46 0.46 0.42 0.50 0.35 0.24 0.00 0.00 0.00 0.00 0.00

0.67 0.44 0.81 0.54 0.00 0.00 0.00 0.00 0.00 −0.88 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.65 0.00 0.65 0.73 0.00

0.00 0.00 0.00 0.00 0.80 0.83 0.83 0.53 0.85 0.00 0.00 −0.39 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 −0.36 0.00 0.00 0.00 0.00 0.24 0.00 1.42 0.62 0.90 0.64 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.63 0.00 0.00 0.70

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.52 0.60 0.61 0.65 0.50 0.00 0.00 0.00 0.00 0.00

Table 5 Loading matrices estimated by two different procedures in analyses of 24 psychological test data. The first estimation procedure is the promax rotation technique with the maximum likelihood method; the second is the penalized likelihood approach using the MC penalty. The tuning parameters were (ρ, γ ) = (0.02, 4.0). Items

Promax rotation

Visual perception Cubes Paper form board Flags General information Paragraph comprehension Sentence completion Word classification Word meaning Addition Code Counting dots Straight curved capitals Word recognition Number recognition Figure recognition Object number Number figure Figure word Deduction Numerical puzzles Problem reasoning Series completion Arithmetic problems

0.83 0.53 0.71 0.62 −0.02 −0.01 0.01 0.24 −0.02 −0.32 −0.04 0.20 0.48 −0.16 −0.04 0.38 −0.14 0.23 0.16 0.34 0.37 0.34 0.49 −0.02

−0.09 −0.03 −0.03 0.09 0.79 0.82 0.89 0.53 0.88 0.06 0.01 −0.19 −0.02 0.10 −0.01 −0.16 −0.00 −0.22 0.00 0.26 −0.03 0.24 0.20 0.25

Proposed method

−0.04 −0.07 −0.24 −0.08 0.11

−0.02 −0.02 −0.01 −0.08 −0.04

−0.09

0.09

0.05 0.11 −0.12 0.97 0.47 0.76 0.45 −0.06 −0.08 −0.18 0.10 0.20 0.02 −0.08 0.33 −0.07 0.05 0.44

−0.14 −0.07 0.08 0.03 0.31 −0.09 −0.14 0.65 0.61 0.57 0.66 0.45 0.36 0.20 0.07 0.20 0.07 0.18

0.74 0.47 0.65 0.53 0.00 0.01 0.00 0.21 0.00 −0.32 0.00 0.13 0.38 0.00 0.09 0.45 0.00 0.28 0.22 0.35 0.33 0.35 0.45 0.00

0.00 0.00 0.01 0.12 0.73 0.78 0.82 0.50 0.83 0.00 0.00 −0.19 0.00 0.09 0.00 −0.11 0.00 −0.18 0.00 0.27 0.00 0.26 0.23 0.23

0.02 0.00 −0.16 0.00 0.18 0.00 0.12 0.18 −0.01 0.93 0.49 0.74 0.47 0.00 0.00 −0.09 0.15 0.25 0.07 0.00 0.37 0.00 0.13 0.48

0.00 0.00 0.00 −0.04 0.00 0.11 −0.07 0.00 0.09 0.13 0.33 0.00 −0.04 0.56 0.51 0.48 0.57 0.41 0.33 0.19 0.12 0.19 0.10 0.22

proposed maximum penalized likelihood factor analysis based on the oblique model. The effectiveness of our modeling strategy was investigated by Monte Carlo simulations and real data analyses. Simulation results showed that the mean squared

K. Hirose, M. Yamamoto / Computational Statistics and Data Analysis 79 (2014) 120–132

131

error and true negative rate were much smaller than those yielded by penalized likelihood factor analysis using the orthogonal model. Furthermore, the MC+ often produced sparser solutions than the lasso, enabling the reconstruction of the true factor structure by imposing a nonconvex penalty. In analyses of Harman’s psychological data, the second column vector of the loading matrix was dense in the orthogonal model, whereas our procedure produced a completely sparse loading matrix. As an interesting future research topic, we could extend our modeling procedure to structural equation modeling, such as LISREL, which expresses much more complex covariance structures between observable variables and common factors. Acknowledgments The authors would like to thank two anonymous reviewers for the constructive and helpful comments that improved the quality of the paper considerably. Appendix. Derivation of complete-data penalized log-likelihood function in the EM algorithm Prior to applying the EM algorithm, we first regard the common factors fn as missing values and maximize the following complete-data penalized log-likelihood function:

ℓCρ (3, 9, 8) =

N 

log f (xn , fn ) − N

p  m 

n=1

ρ P (|λij |),

i=1 j=1

where f (xn , fn ) is the density function defined by



(xn − 3fn )T 9−1 (xn − 3fn )



(xni − λ 2ψi

f (xn , fn ) = (2π )−p/2 |9|−1/2 exp −

=

p  

(2π ψi )−1/2 exp −

i=1

2

)

T 2 i fn



 T −1  f 8 fn · (2π )−m/2 |8|−1/2 exp − n 2





· (2π )−m/2 |8|−1/2 exp −

fnT 8−1 fn 2



.

Next, the expectation of ℓCρ is computed with respect to the distribution of f (fn |xn , 3, 9, 8): E [ℓCρ (3, 9, 8)] = −

p N 

2 i=1

log ψi −

 N  1

− tr 2

n =1

N 2

log |8| −

p N 1   x2ni − 2xni λTi E [Fn |xn ] + λTi E [Fn FnT |xn ]λi

ψi

2 n=1 i=1

 8−1 E [Fn FnT |xn ] − N

p  m 

ρ P (|λij |).

i =1 j =1

For given values 3old , 9old , and 8old , the posterior f (fn |xn , 3old , 9old , 8old ) is normally distributed with E [Fn |xn ] = M−1 3Told 1 1 −1 T −1 9− + E [Fn |xn ]E [Fn |xn ]T , where M = 3Told 9− old xn and E [Fn Fn |xn ] = M old 3old + 8old . Then, we have N 

E [Fn ]xni =

n =1 N 

N 

1 1 −1 T M−1 3Told 9− 3old 9− old xn xni = NM old si ,

n=1

E [Fn FnT ] =

n =1

N  1 T −1 −1 (M−1 + M−1 3Told 9− ) old xn xn 9old 3old M n=1

1 −1 −1 = N (M−1 + M−1 3Told 9− ). old S9old 3old M 1 1 −1 −1 −1 + M−1 3Told 9− be bi and A, respectively. The expectation of ℓCρ in (5) can then be Let M−1 3Told 9− old si and M old S9old 3old M derived.

References Adachi, K., 2013. Factor analysis with EM algorithm never gives improper solutions when sample covariance and initial parameter matrices are proper. Psychometrika 78 (2), 380–394. Akaike, H., 1987. Factor analysis and AIC. Psychometrika 52 (3), 317–332. Anderson, J., Gerbing, D., 1984. The effect of sampling error on convergence, improper solutions, and goodness-of-fit indices for maximum likelihood confirmatory factor analysis. Psychometrika 49 (2), 155–173. Breheny, P., 2013. grpreg: Regularization paths for regression models with grouped covariates. R package version 2.5. URL http://cran.r-project.org/web/ packages/grpreg/. Breheny, P., Huang, J., 2011. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat. 5 (1), 232. Choi, J., Zou, H., Oehlert, G., 2011. A penalized maximum likelihood approach to sparse factor analysis. Stat. Interface 3 (4), 429–436. Clarke, M., 1970. A rapidly convergent method for maximum-likelihood factor analysis. British J. Math. Statist. Psych. 23 (1), 43–52. Das, U., Gupta, S., Gupta, S., 2014. Screening active factors in supersaturated designs. Comput. Statist. Data Anal. 77, 223–232. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., 2004. Least angle regression (with discussion). Ann. Statist. 32, 407–499.

132

K. Hirose, M. Yamamoto / Computational Statistics and Data Analysis 79 (2014) 120–132

Fan, J., Li, R., 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96, 1348–1360. Friedman, J.H., 2012. Fast sparse regression and classification. Int. J. Forecast. 28 (3), 722–738. Friedman, J., Hastie, H., Höfling, H., Tibshirani, R., 2007. Pathwise coordinate optimization. Ann. Appl. Stat. 1, 302–332. Friedman, J., Hastie, T., Tibshirani, R., 2010. Regularization paths for generalized linear models via coordinate descent. J. Statist. Software 33. Harman, H., 1976. Modern Factor Analysis. University of Chicago Press. Hendrickson, A., White, P., 1964. Promax: a quick method for rotation to oblique simple structure. Br. J. Stat. Psychol. 17 (1), 65–70. Hirose, K., Kawano, S., Konishi, S., Ichikawa, M., 2011. Bayesian information criterion and selection of the number of factors in factor analysis models. J. Data Sci. 9 (1), 243–259. Hirose, K., Yamamoto, M., 2014. Sparse estimation via nonconcave penalized likelihood in a factor analysis model. Stat. Comput. http://dx.doi.org/10.1007/ s11222-014-9475-z. in press. Hu, L.-t., Bentler, P.M., 1999. Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives. Struct. Equ. Model. 6 (1), 1–55. Jennrich, R., Robinson, S., 1969. A Newton–Raphson algorithm for maximum likelihood factor analysis. Psychometrika 34 (1), 111–123. Jöreskog, K., 1967. Some contributions to maximum likelihood factor analysis. Psychometrika 32 (4), 443–482. Jöreskog, K.G., Sörbom, D., 1996. LISREL 8 user’s reference guide. Scientific Software. Kaiser, H., 1958. The varimax criterion for analytic rotation in factor analysis. Psychometrika 23 (3), 187–200. Kano, Y., 1998. Improper solutions in exploratory factor analysis: Causes and treatments. In: Advances in Data Science and Classification: Proceedings of the 6th Conference of the International Federation of Classification Societies (IFCS-98). Università’’ La Sapienza’’, Rome, 21–24 July. Springer Verlag, p. 375. Konishi, S., Ando, T., Imoto, S., 2004. Bayesian information criteria and smoothing parameter selection in radial basis function networks. Biometrika 91 (1), 27–43. Lawley, D., Maxwell, A., 1971. Factor Analysis as a Statistical Method, Vol. 18. Butterworths, London. Martin, J., McDonald, R., 1975. Bayesian estimation in unrestricted factor analysis: a treatment for Heywood cases. Psychometrika 40 (4), 505–517. Mazumder, R., Friedman, J., Hastie, T., 2011. Sparsenet: coordinate descent with nonconvex penalties. J. Amer. Statist. Assoc. 106, 1125–1138. Mulaik, S., 2010. The Foundations of Factor Analysis, second ed. Chapman and Hall/CRC, Boca Raton. Ning, L., Georgiou, T.T., 2011. Sparse factor analysis via likelihood and ℓ1 regularization. In: 50th IEEE Conference on Decision and Control and European Control Conference. pp. 5188–5192. Rubin, D., Thayer, D., 1982. EM algorithms for ML factor analysis. Psychometrika 47 (1), 69–76. Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58, 267–288. Van Driel, O., 1978. On various causes of improper solutions in maximum likelihood factor analysis. Psychometrika 43 (2), 225–243. Ye, J., 1998. On measuring and correcting the effects of data mining and model selection. J. Amer. Statist. Assoc. 93, 120–131. Zhang, C., 2010. Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38, 894–942. Zhao, P., Yu, B., 2007. On model selection consistency of lasso. J. Mach. Learn. Res. 7 (2), 2541. Zou, H., 2006. The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101, 1418–1429. Zou, H., Hastie, T., Tibshirani, R., 2007. On the degrees of freedom of the lasso. Ann. Statist. 35, 2173–2192.