Computers Ops RCS. Vol. 18, No. 8, pp. 655461, Printed in Great Britain. Al1 rights reserved
030%0548/9i $3.00+ 0.00 Copyright 6 1991Pergamon Press pit
1991
AN ITERATIVE LINEAR PROGRAMMING SOLUTION TO THE EUCLIDEAN REGRESSION MODEL TOM
M. CAVALIER*$ and BRIAN J. MELLOYt
Department of Industrial and Management Systems Engineering and Graduate Program in Operations Research, 207 Hammond Building, Pennsylvania State University, University Park, PA 16802, U.S.A.
(Received
September
1990: in
revised form
April
1991)
Scope and Purpose-Many regression methods exist for estimating the multivariate linear model: the general superiority of one method versus another remains unresolved. Minimum sum of absolute errors (MSAE) regression. for example, is resistant to outliers and also provides a good starting solution for certain robust regression procedures [IQ]. This method, though, is only appropriate when the response of one or more variables is a function of independent predictor variables. The analog of this method when no such independent/dependent relationship exists is known as Euclidean regression; so named because it minimizes the sum of the normal Euclidean distances. While it is theorized that Euclidean regression will possess the same advantages as MSAE regression, its actual characteristics are unknown, because no practical method has existed for its solution. However, in this paper, a solution procedure is proposed which is based on the well-known linear programming algorithm. In the future, this solution procedure will enable the evaluation of the relative performance of this regression method, and ultimately, its widespread application. Abstract-A mathematical programming approach is employed to solve the n-dimensional linear Euclidean regression problem, in which the objective is to determine a hyperplane which minimizes the sum of the normal Euclidean distances to a given set of data points. Although the problem is a nonlinear nonconvex programming problem, an efficient solution algorithm is developed which solves a succession of linear programming problems.
I.INTRODUCTION
The objective of many scientific investigations is the estimation of a curve or surface which is defined by the collective behavior of several variables. This surface is frequently determined by an ordinary least-squares (OLS) regression procedure, which estimates the response of one or more variables as a function of a set of independent predictor variables. However, in some instances, a dependent relationship may not exist among the variables. For example, “straightness” is a geometric feature of many part families, which is naturally represented by a line. There is, however, no physical evidence which supports predicting the y coordinates of the edge as a linear function of the x coordinates, or conversely, predicting the x coordinates as a linear function of the y coordinates. Thus, in this case, a conventional OLS regression model would not be appropriate. If the choice of a dependent variable is not clear, or if any of the other assumptions of OLS regression [I] are violated, another modeling procedure should be employed. Examples of alternative least-squares procedures include OLS-bisector, orthogonal and reduced major axis regression, and examples of procedures which are not least-squares based include Chebyshev, minimum sum of absolute errors (MSAE) and averaging methods. (Detailed descriptions and
*Tom M. Cavalier is an Associate Professor of Industrial Engineering at The Pennsylvania State University, and Chairperson of the Operations Research Program. He received a B.Sc. degree in Electrical Engineering from West Virginia Institute of Technology, an M.A. in Mathematics from West Virginia University and a Ph.D. in Industrial Engineering and Operations Research from Virginia Polytechnic Institute and State University. His research interests include mathematical programming. applied optimization and location theory. *Brian J. Melloy is an Assistant Professor of Industrial Engin~~ng at The Pennsylvania State University. Prior to his appointment at Penn State. he was an instructor at the University of South Florida, where he completed his Ph.D. in Industrial Engineering and Operations Research. He has also worked as a telecommunications analyst with GTE Service Corporation. His research interests include quality assurance and reliability engineering. $All correspondence should be addressed to: Dr T. M. Cavalier, College of Engineering, 207 Hammond Building, Pennsylvania State University, University Park, PA 16802, U.S.A. 655
TOMM. CAVALIER and BRIANJ.
656
MELLOY
comparisons of these methods may be found in Akritas et al. [Z] and Branham [33, respectively.) Although these methods are relatively less well-known, they are applicable to a variety of disciplines, including astronomy (e.g. Branham [43), biology (e.g. Gingerich and Smith f5]), chemistry (e.g. Krug et al. [6]), geology (e.g. Jones [7]), geometric tolerancing (e.g. Murthy and Abdin [S]), and physics (e.g. Miller and Dunn [9].) OLS regression minimizes the sum of the squares of the deviations in one coordinate direct&, whereas MSAE regression minimizes the sum of the absolute deviations. Orthogonal regression, which minimizes the sum of the squared Euclidean distances, is the analog of OLS regression. Euclidean regression, which is the subject of this paper, minimizes the sum of the Euclidean distances, and hence is the analog of MSAE regression. The general superiority of one regression method versus another remains an open issue, as evidenced by ongoing research in this area (e.g. [2]). MSAE regression, for example, offers several advantages: it is resistant to outliers and error distributions with long tails, and it also provides a good starting solution for certain robust regression procedures [lo]. It is conjectured, therefore, that the analogous Euclidean procedure will also afford these advantages. Nevertheless, the objective of this investigation is to provide a solution procedure for this problem so that it may be explored in conjunction with the other methods, not to establish its relative advantages or disadvantages. The Euclidean problem is considerably more difficult than similar models (e.g. an orthogonal model), because it cannot be solved directly by finding a solution to the system of extremal equations (generated by taking the partial derivatives). It is for this reason that, to the knowledge of th&e authors, this method is not currently considered as a practical alternative to comparable regression techniques. The focus of the investigation herein will be on n-dimensional linear Euclidean regression models. In a related study, Megiddo and Tamir [i l] have proposed a solution procedure for the 2-dimensional problem which is O(n log2 n). This method is based on the result that the line which minimizes the sum of the Euclidean distance passes through at least two data points. Thus, the optimal solution can be found by essentially enumerating the solutions corresponding to each pair of data points. The 2-dimensional problem has an analog in location theory (access to a linear resource, such as a highway or utility), and hence the solution is referred to as a “l-line median”. The solution method developed in this paper is applicable to this problem as well as the general n-dimensional problem. This paper is organized as follows. In the following section, we provide a formal statement of the linear Euclidean model, develop a solution procedure for the resulting nonlinear nonconvex program and present an illustrative example. Some interesting observations concerning the dual problem are provided in Section 3, while Section 4 gives a statement of conclusions.
2. ANALYSIS Let xi, i = 1,. . ., m be a given set of points in E,, Euclidean n-space, and consider the problem of finding that hyperplane, p’x = a, (line in E,, plane in E3) which minimizes the sum of the normal Euclidean distances from the hyperplane to each xi, i = 1, . . . , m. To formulate this as a mathematical programming program, it is convenient to derive a concise representation of the distance from each xt to the hyperplane, p’x = a. Suppose that P’Xi= fi < a for some i, and recall that p/IIpII is the normal unit vector to p’x = j?. (The notation I(vI(represents the Euclidean norm of a vector v.) Then, as illustrated in Fig. 1, the point, xi + d,p/llpll lies on p’x = a if and only if d, is the normal Euclidean distance from xi to p’x = a. Thus, P’(xi$dlP/llPll)=Prxi
+4P’P/llPll
=:P%+4llPll
=“a,
(1)
and therefore, di = (a - p’WllPII.
(2)
d, = @‘xl - a)/IIPII.
(3)
Similarly, if prXi= B > a, then
657
A solution to the Euclidean regression model
Fig. 1. The normal Euclidean distance from x, to p’x = a.
Therefore, in general, the normal Euclidean distance from a point given by di = IPrxi - al/llPII.
Xi
to the hyperplane p’x = a is (4)
Thus, the problem of determining the hyperplane which minimizes the sum of the normal Euclidean distances from the hyperplane to each Xi, i = 1, . . . , m can be expressed as follows: minimize f
(5)
IP'Xi-al/llpII
i=l p,
a unrestricted,
(6)
or equivalently, (P)
minimize f
(7)
Ip'Xi- aI
i=l
subject to llpll = 1 p, a
(8)
unrestricted.
(9)
This problem is not only nonlinear but also nonconvex, and as such, is difficult to solve. Unlike other regression problems which can be solved by simply finding a solution to the extremal equations generated by taking the partial derivatives, a solution of the extremal equations of (5) cannot be found directly. However, due to its special structure, an effective and reliable heuristic solution procedure can be developed. Before presenting the details of the algorithm, it is necessary to transform the problem into a more amenable form. First, the absolute values in the objective function can be eliminated in the usual way by introducing additional nonnegative variables u = (ur, . . . , u,)’ and v = (or, . . . , II,,,)‘. Problem (P) then becomes (PI)
minimize f
(Ui+ ui)
(10)
i=l
subjecttop’x,-a=r+-v,
fori=l,...,m
(11)
IIPII = 1
(12)
u,v>,o
(13)
p, a unrestricted.
(14)
Now let A be the mxn matrix whose ith row is Xi and let e = (1, 1, . . . , 1)’ E E,. Then (Pl) can be recast in the following form: (P2)
minimize e’u + e’v subject to Ap-ea-u+v
(15) =0
IIPII = 1
(16)
(17)
TOMM. CAVAL~BR and hAN
658
J. M~LL~Y
u,v>o
(18)
p, o! unrestricted.
(19)
Finally, note that since [IpI/ = 1 in (17), then equivalently, p’p = j/p(/’ = 1. Therefore (17) can be replaced by p’p = 1 to form the equivalent problem, (P3)
minimize e’u + e’v
(20)
subject to Ap-ea-u+v=O p’p = 1
(22)
u,v20
(23)
p, a unrestricted.
(24)
Observe that (P3) is a linear programming problem except for constraint linearize constraint (22), consider the problem: (P4)
(21)
(22). In an attempt to
minimize e’u + e’v
(25)
subject to Ap-ea-u+v=O
(26)
q’p = 1
(271
u,v$O
(28)
p, a unrest~cted.
(2%
Note that if q in problem (P4) could be chosen a priori as the optimal solution p* of problem (P3), then the two problems are equivalent. Thus, the basic strategy for solving problem (P) is to solve problem (P4) repeatedly and continuo~ly update the vector q. An initial vector q is dete~in~ by computing an OLS equation, q’x = fl, in which an arbitrary xf is chosen as the dependent variable. Thus, the algorithm is summarized as follows: Step 1. Given x1E E,, i = 1, . . ., m, compute q’x = fi using OLS techniques,
where an arbitrary Xj is chosen as the dependent variable. (Note that 4j= 1.) Normalize q, i.e. replace q by q/l[qll, and go to Step 2. Step 2. Solve problem (P4) to get the optimal solution p, 6, II, ii. If lIpI/= 1, stop with & h, as the optimal solution to problem (P). Otherwise, go to Step 3. Step 3. Replace q by q = p//llplland return to Step 2.
The algorithm was coded in Fortran and run under normal batch load conditions on an IBM 3900-600s series computer. Using randomly generated test problems with up to n = 5 and m = 100, the algorithm required no more than 8 set CPU time to converge to the optimal solution, in a maximum of 3 iterations. This CPU time includes constructing the coefficient data as required in problem (P4), as well as generating a starting solution using OLS regression. Also, to help confirm that the algorithm was converging to the global optimal solution, the OLS regression solution was computed using each x,, j = 1, . . ., n, in turn, as the dependent variable. Each of these solutions was used as a starting solution for the algorithm, and in every case, the algorithm converged to the same optimal solution. Thus, the algorithm appears to be insensitive to the initial solution. Example I Let xi = (1,2)‘, x2 = (2,4y, xJ = (3,3)‘, xq = (4,3)‘, x5 = (5,6?, and consider the problem of finding that hyperplane (line), p’x = plx, + p2x2 = a, which minimizes the sum of the normal Euclidean distances. Then problem (P4) becomes minimize i
(Ui+ Vi)
(30)
i= 1
subject to
p1 + 2p, -a - IQ + v1 = 0
(31)
A solution to the Euclidean regression model
Fig. 2. Results of Example I. -,
Euclidean regression;---, OLS regression with x,dependent: OLS regression with x2 dependent.
---
2~~ + 4p2 - a-uu,+o,=O
(32)
3p,+3p2--a-u,+v,=o
(33)
4PI + 3P, - a-uu,+u*=O
(34)
Sp,+fip,--a-u,+v,=O
(35)
q’p = 41 Pl f q2 P2 = 1
136)
u,v>o
(37)
p, a unrestricted,
(38)
Using OLS regression with x1 as the dependent variable yields the line, x1 - 0.76~~ = 0.26, whereas the corresponding line with x2 as the dependent variable is -0.7~~ + x2 = 1.5. Now, using either q= (I, -0.76)’ or q = (-0.7, 1)’ as an initial vector in (36), the preceding algorithm yields the Euclidean regression line, 0.7071x, - 0.7071X2 = -0.7071. These solutions are contrasted in Fig. 2.
3. THE
DUAL
PROBLEM
An interesting observation can be made concerning the solution to problem (P4) by examining the dual problem. Let x = (xl,. . ., K,)’ be the dual vector for constraints (26) and let 1 be the dual variable for constraint (27). Then the dual problem is (D4)
maximize 1
(39)
subject to A’a + 2q = 0
(W
e’n
=O
(41)
x
2 -e
(42)
rI
G
A,
e
1 unrestricted.
(43) (44)
Note that this dual problem is one of assigning a weight n, to each point xi. In particular, (41~(43)
TOM M. CAVALIER and BRIAN J. MELLOY Table 1. The dual solution and tbc weighted sums of Example
I
I
(1.2)
-0.5
-0.5
-1.0
2 3 4 5
(2.4) (3.3) /4,3) 661
-1.0 1.0
-2.0 3.0 4.0 _2.5
-4.0 3.0 3.0 --3.0
I .o _0.5 0.0
2.0
-2.0
simply mean that f
xi=0
(45)
fori=l,...,m.
(46)
i=l
-I~ni~l
Thus, the dual problem is determining weights RYE[ - 1, 1J which sum to zero. Also, by complementary slackness, if p’x, > a, then the corresponding dual variable Ki = 1, whereas if P’Xi< a, then rri = - 1. Only those points Xi with PfXi= a can have a dual variable with IliE(-1, I). Finally, let xi = (xii, . . ., xi”)’ and note that (40) can be rewritten as forj=l,...,n.
i~~Xij*i=-~pl
(47)
This asserts that the weighted sums of each component are proportional to the p vector, with proportionality constant -ii. (Note that /, is the optimal objective value, that is, the sum of the normal Euclidean distances.) The results of the dual problem for Example 1 are summarized in Table 1. Observe that conditions (45) and (46) are satisfied with only the points xi and x5 lying on the resulting hyperplane (see Fig. 2). Also noting that 1= 2.8284, then clearly condition (47) is satisfied since (2.0, -2.O)=
-2.8284(-0.7071,0.7071),
(48)
that is, (
i i=l
xilni*
2
xi2ni)
=
A(Pl7
P2).
(49)
i=l
4. CONCLUSION
In this paper, the multiple linear Euclidean regression problem was investigated. This is a technique for fitting a hyperplane to a set of data points in which a dependent relationship does not exist among the variables. The method minimizes the sum of the normal Euclidean distance from each data point to the hyperplane. Although the problem is a nonlinear nonconvex programming problem, an efficient solution procedure was developed which utilizes a sequence of linear programming problems. Since the Euclidean model is the analog of MSAE regresssion, it is conjectured that it will afford the same advantages. The development of the solution procedure herein enables the evaluation of this supposition, and ultimately, the use of Euclidean regression as an alternative to other comparable methods. Acknowtedgemenrs-The authors wish to acknowledge Dr Steven F. Arnold and Michael G. Akritas, both of whom are Associate Professors of Statistics at the Pennsylvania State University, for their constructive comments and insights. REFERENCES 1. 1. W. Tukey, Instead of Gauss-Markov least-squares, what? In Applied Statistics (Edited by R. P. Gupta), pp. 351-372. Noah-HoIIand, Amsterdam (1975). 2. M. G. Akritas, G. J. Babu, E. D. Feigeison and T. Isobe. Linear regression in astronomy. I. Astrophys. I. 364, 104-113 (1990).
A solution to the Euclidean regression model
661
3. R. L. Branham Jr, Alternatives to least-squares. Astr. J. 87,928-937 (1982). 4. R. L. Branham Jr, The FK4 equinox from meridian observations of minor planets. Astr. J. 84, 1399-1401 (1979). 5. P. D. Gingerich and B. H. Smith, Allometric scaling in the dentition of primates and insectivores. In Size und Scaling in Primate Bio/ogy (Edited by W. L. Jungers), pp. 257-272. Plenum Press, New York (1985). 6. R. R. Krug, W. G. Hunter and R. A. Greiger-Block, Enthalpy-entropy compensation: an example of the misuse of least squares and correlation analysis. In Chemomerrics: Theory and Applicarion (Edited by B. R. Kowalski), pp. 192-218. American Chemical Society, Washington (1977). 7. T. A. Jones, Fitting straight lines when both variables are subject to error. Moth. Geol. 11, 1-25 (1979). 8. T. S. R. Murthy and S. Z. Abdin, Minimum zone evaluation ofsurfaces. Inr. J. Much. Tool Des. Res. 20,123-136 (1980). 9. B. P. Miller and H. E. Dunn, Orthogonal least-squares line fit with variable scaling. Computers Phys. 2,59-61 (1988). 10. S. C. Narula, The minimum sum of absolute errors regression. J. Quul. Technol. 19, 37-45 (1987). 11. N. Megiddo and A. Tamir, Finding least-distances line. SIAM J. Algebr. Disc. Merh. 4, 207-211 (1983).