NORTH- HOLLAND
Artificial Neural Networks A New Methodology for Industrial Market Segmentation
Kelly E. Fish James H. Barnes Milam W. Aiken Neural networks are a type of artificial intelligence computing that have generated considerable interest across many disciplines during the past few years. The authors explore the potential of artificial neural networks in assisting industrial marketers faced with a segmentation problem by comparing their classification ability with discriminant analysis and logistic regression. The neural networks achieve higher hit ratios on holdout samples than the other methodologies. A marketer in a business-tobusiness situation may be able to segment a market more accurately, thereby improving efficiency for sales forces and other promotional activities, by using artificial neural networks.
Address correspondence to Kelly E. Fish, Department of Economics and Business Administration, Rhodes College, 2000 North Parkway, Memphis, TN 38112.
INTRODUCTION Neural networks have generated considerable interest across such disciplines as psychology, engineering, business, medicine, and computer science. Initial research indicates that they appear well suited to problems of classification, as well as pattern recognition, nonlinear feature detection, and nonlinear forecasting [25]. The purpose of this study is to explore the potential of neural networks as another methodology for market segmentation in an industrial marketing setting- specifically, a situation where each potential customer may be classified into a group for targeting purposes. Our results suggest that neural networks are more accurate in classification than either discriminant analysis or logistic regression and can improve industrial marketers' ability to segment markets.
Industrial Marketing Management 24, 431-438 (1995) © Elsevier Science Inc., 1995 655 Avenue of the Americas, New York, NY 10010
0019-8501/95/$9.50 SSDI 0019-8501(95)00033-7
BACKGROUND A neural network is a nonlinear type of model that receives its inspiration from the neural architecture of the human brain [25]. A single neuron from the brain has three basic components-dendrites that provide for input from neighboring neurons, a soma for processing, and axons for output to other neurons. Each neuron may be connected to thousands of neighboring neurons via this network of dendrites and axons. This network allows the brain to achieve speed and power through massive parallel processing [7]. Similarly, an artificial neural network contains interconnections analogous to dendrites, processing nodes similar to somas in many regards, and output connections that represent the axons. The first published attempt to model the brain's neural system in a computational sense came in the 1943 work of McCulloch and Pitts [15]. Their nets appeared to be reliable computational devices, yet they lacked the ability to learn. Subsequently, Hebbian learning theory [8] specifically addressed the question of how the human brain's neurons facilitate learning. Inspired by Hebb's work, Rosenblatt introduced the p e r c e p t r o n in 1958 and demonstrated that this simple, single-layered neuron model, when used in concert with a learning rule, could solve intriguing problems [19]. Due to exaggerated claims and unrealistic expectations, the perceptrons stirred controversy during the time of their emergence. The end result was that the field of perceptrons and neural networks was almost completely abandoned by the late sixties. The final blow was apparently provided by demonstrating that the perceptron was not computationally complete because a single-layered perceptron could not solve a linearly nonseparable problem such as the XOR problem [16]. Shortly thereafter, funding for neural network research ceased. During the years that followed, a few die-hard scientists continued research efforts focusing primarily on mul-
KELLY E. FISH is a Marketing Instructor at Rhodes College, Department of Economics and Business Administration, Memphis, Tennessee. JAMES H. BARNES is a Professor of Marketing and Pharmacy Administration and holder of the Morris Lewis Lectureship at the University of Mississippi. MILAM W. AIKEN is an Assistant Professor of Management and Marketing (MIS) at the School of Business Administration, University of Mississippi.
432
tilayered perceptrons (neural networks). The source of their inspiration was largely the 1957 work of the Russian mathematician Kolmogorov [12]. The superposition theorem of Kolmogorov along with its advancements [9, 11, 14, 21] posits that any continuous function can be computed by using linear summations combined with a nonlinear, continuously increasing, function of one variable. Computer application inquiry [17, 20, 24] resulted in a methodology (backpropagation), which successfully incorporates the Kolmogorov theorem by advocating a multilayer framework with the first layer providing a linear summation, which is then transformed through a nonlinear, continuously increasing function (usually sigmoidal) in the second layer, to produce an output.
CLASSIFICATION METHODOLOGIES One of the objectives of discriminant analysis is to classify objects into one of two or more mutually exclusive and collectively exhaustive categories by a set of independent metric variables. The dependent variable, group membership, is determined by the discriminant score Si, which is a linear function of the independent variables: Si = bo + b,Xl + . . .
+ b,X,
where ~ is the ith entity's value of the jth independent variable, bj is the coefficient of the jth variable and Si is the discriminant score for the ith entity. Classification for a two group problem is determined as follows: Classify entity i as belonging to Group 1 if S~ > s,~,, Classify entity i as belonging to Group 2 if S~ < Scr,. with S,r, as the critical value (or cutting value) for the discriminant score, which is defined for groups as: s, ri, -
N, Sj + N2S2 N1 + N 2
(2)
where S, and $2 are centroids for groups 1 and 2 respectively, with N~ and N2 representing the number in groups 1 and 2 respectively. There are four basic underlying assumptions involved with the discriminant analysis: 1. predictor variables are multivariate normally distributed 2. equal covariance matrices for the groups 3. independence of predictor variables 4. linear functional form The violation of any assumption may adversely effect the classification results.
A neural network receives its inspiration from the neural architecture of the human brain. Logistic analysis offers a more robust alternative to classification by relaxing assumptions 1 and 2 above. Yet logistic regression assumes independence among predictor variables as well as logistic functional form and violations may negatively effect performance [1, 22]. The logit model may be defined by letting {y,} (t = 1, 2 . . . . . n) be an independent series of binary random variables taking values of 1 and 0, then: Pr(y, = llx,) = [(1 + exp(/3, + 13~x,)]-'
(3)
where/3, is a scalar unknown parameter,/32 is a k vector of unknown parameters, and x, is a k vector of independent variables. In a two group classification problem: Classify entity x, into Group 1 if Pr(y, = llx,) > 0.5, Classify entity x, into Group 2 if Pr(y, = 0Ix,) > 0.5. One of the prime advantages to neural network modeling is that there are no a priori assumptions made about the relationship being modeled. Rather, the neural network examines data involving the phenomena in question and maps an approximation of the relationship (i.e., the underlying function). The most popular method in which the neural network teaches itself about the relationship is backpropagation. In this method the input-output relationship of each node is represented by the following equation: k
y = ( b ( Z w~r, - O) / =
(4)
I
where the output y is derived from the xi inputs, w, connection weights, O the activation threshold, and the differentiable function ~. In a standard feedforward network, vectors of data are fed in at each input node (each input node usually represents an independent variable). The data are multiplied by various connection weights and then summed at each hidden node. When each hidden node reaches a specified activation level, the summed totals are then transformed by the nonlinear transfer function to provide an output, either to another hidden node layer or a final output destination (see Figure 1).
The next phase involves a backward pass through the network where it uses a gradient descent method to adjust the connection weights in order to optimize some objective function, usually (and in the case of this study) sum of squared errors (SSE). The final output values computed by the algorithm are compared against the actual output values, with the SSE for any population member being the squared difference between the corresponding output values computed by the algorithm and the actual values. This adjustment of weights is often referred to as the "learning" phase of the process giving it an artificial intelligence quality. More precisely, the connection weight is changed by an amount proportional to the product of the error signal, b, available to the node receiving input and, the output of the node with which it is connected. This learning is governed by the generalized delta rule: = ,7 jO i
(5)
where ApWji is the change made to the weight connecting the ith to thejth node following the presentation of pattern p (input vector). The learning rate, ~/, has a notable effect on the network performance. A small value (< 0.2) means that the network will make a large number of iterations, but, too large a value (> 0.8) might result in the network missing the actual minima. The symbol 6pj represents the error signal o f p at thejth node, and Opj is thejth element of the actual output pattern produced given presentation ofp. The task now is to determine what the error signal should be at each node in order to optimize the objective function. This involves a recursive computation that can be implemented by propagating the error signals backward through the network (thus backpropagation). We begin by computing di for each output node, which is simply the difference between the actual and desired output multiplied by the derivative of the transfer function:
6pj = (tpj
-
-
o~)~b~(netpj)
(6)
where ~b'/(netpj) is the derivative of the transfer function
433
There are no a priori a s s u m p t i o n s m a d e about the relationship b e i n g mo deled.
Output Connection
Y
Transfer Function
Processing Node
Linear Summation
/
Weighted Input Connections
-T
000 Wl
W2
Xl
FIGURE 1.
X2
Wn X3
Xn
Hidden layer node.
for the flh node, evaluated at the net input (netpj) to that node, t~j is the target output for the jth component of p, and opj is thejth element of the actual output pattern produced by the presentation ofp. This allows the computa434
W3
tion of weight changes for all connections that feed into the final (output) layer. Next, the neural network computes ~'s for all hidden nodes, again using a recursive formula where k is the kth node:
6,,~ = ~b'j(netpj) ~6pjWkj
(7)
this allows for weight changes to all hidden nodes, which have no target outcomes. The above represents a gradient decent method for determining the weights in any feedforward network with a differentiable transfer unit [20]. Classification for a two group situation is: Classify entity i as belonging to Group 1 if Y~ > 0.5, Classify entity i as belonging to Group 2 if Yi < 0.5. Where Y, is the network output for the ith entity.
EXPERIMENTAL STUDY Data We chose previously published data sets for our comparisons. This offers some advantages in that both data sets already have yielded verifiable classification results-one is used in a popular textbook, whereas the other is found in a comparison study with logistic analysis and can also be located in a well-known statistical text [22]. The use of these data sets facilitates replication with neural networks and other methodologies. The first data set (N = 100) [6] involves a hypothetical industrial supplier known as HATCO. The data is used for both a two-group classification problem (purchasing approach used by a f i r m - total value analysis versus specification buying) and a three-group classification problem (type of buying situation facing the f i r m - n e w task versus modified rebuy versus straight rebuy). Both problems use seven metric predictor variables as potentially important for classification: X~ Delivery s p e e d - t h e amount of time it takes to deliver the product once an order has been confirmed Price level-the perceived level of price charge by product suppliers X3 Price flexibility-the perceived willingness of HATCO representatives to negotiate price X4 Manufacturer's i m a g e - t h e overall image of the manufacturer/supplier X5 Service-the overall level of service in maintaining a satisfactory relationship between the supplier and the purchaser X6 Sales force's image-the overall image of the manufacture/supplier's sales force Xz Product quality-the perceived level of quality of a particular product
The survey is of existing customers collected for segmentation purposes for a business-to-business situation [6]. The respondents, purchasing managers of firms that buy from HATCO, rated HATCO on each of the variables with a metric bipolar scale. The second data set (N = 50) is found in a study comparing discriminant analysis with logistic regression [18]. The variables concern demographic data for each state in the United States, with the two-group classification problem involving percent change in population from the 1960 census to the 1970 census (below versus above, the median change for all states). Such data might be used in a study to determine potential new markets. The problem includes five predictor variables, three metric and two nonmetric: X~ I n c o m e - p e r capita income X2 Births-birth rate X3 Deaths-death rate X4 U r b a n - 0 or 1 as population is less than or greater than 70% urban X5 C o a s t - 0 or 1 in absence or presence of coastline The data were obtained from census records.
Network Configuration Although any continuous mapping or measurable function can be approximated by a network with one hidden layer, given sufficiently many hidden nodes [5, 10], one may argue for the use of a two hidden layer configuration by asserting that with a single hidden layer there is a tendency for the nodes therein to interact globally, making it difficult to achieve improvement of approximation at one point, without worsening it elsewhere. But, a configuration with two hidden layers allows the first layer to partition the input space into regions, whereas the second hidden layer can then compute the desired function within each region. With this configuration the effects of the nodes are isolated and the approximations in different regions can be adjusted independently of each other [3]. We chose such a configuration for our experiment, using NeuroForcaster, a neural network development program from NIBS Limited of Singapore (see Figure 2). The HATCO data set has seven input nodes, whereas the other data set used five. Notice the inclusion of a bias node on each hidden layer in our network. The node provides a fictitious input value of 1 to its connection and participates in the learning process like any other connection weight. AI-
435
Maps an approximation of the underlying function. Output Layer
Hidden Layer
Bias Hidden Layer
Bias Input Layer FIGURE 2.
Network design for the HATCO data.
though its inclusion in the network is optional, we have found that it sometimes aids convergence of the weights to a satisfactory solution. The program was run on a 486/PC, 50 MHz; network specifics for each problem are shown in Table 1.
RESULTS For the HATCO data, all methodoligies trained on a randomly selected 60% of the data, using the remaining 40%
436
as a validation sample. The holdout sample hit ratios are shown in Table 2 for both the two-group and three-group problems. The discriminant analysis results are those reported originally [6], with X1, X~, X6, and X7 included in the discriminant function for the two-group problem and X~, X2, X~, and X7 included in the three-group problem. The logistic regression results are achieved using BMDP's Dynamic, release 7.0, LR and PR programs respectively. In logistic stepwise regression Xj, X2, X3, X6, and Xz en-
Achieved higher hit ratios than the others on all classification problems. TABLE 2 Holdout Sample Hit Ratios
TABLE 1 Neural Network Configuration
Classification Problem
Hidden Layers
HATCO two-group 2;6 nodes each HATCO three-group 2;6 nodes each Census data two-group 2;6 nodes each
Training Error Learning Transfer Iterations (%) Rate Function 44,500
4.5
0.6
Sigmoid
104,000
4.29
0.6
Sigmoid
95,800*
16.24'
0.6
Sigmoid
HATCO two-group HATCO three-group Census data two-group (means)
Discriminant Analysis (%)
Logistic Regression (%)
Neural Network (%)
92.5
92.5
97.5
67.5
50
80
68
72
74
* Average over five runs.
tered the equation for the two-group problem, and Xj and
X2 entered for the three-group. The neural network used data input from all seven variables but ignored data that did not contribute to convergence by giving their connections very low weightings. With the census data, we used a type of jackknifing designed for smaller data sets [13]. The three methodologies trained on 40 observations (states) with a holdout sample of ten states. This procedure was done on five separate occasions to achieve projections for each of the states. Again, the results are shown in Table 2 with both the discriminant analysis and logistic regression results taken from the original study [18]. All five variables were included for all methodologies.
CONCLUSIONS AND MANAGERIAL IMPLICATIONS The results involving discriminant analysis and logistic regression are somewhat predictable. The first data set includes all metric independent variables, therefore the assumption of multivariate normality may be tenable and discriminant analysis outperforms logistic regression. However, when nonmetric independent variables are introduced in the second data set, the same assumption cannot hold. Additionally, the original study states that none of the metric variables are normally distributed either. It is not surpris-
ing then, that logistic regression offers better results than discriminant analysis given the circumstances. The results of the neural network are encouraging for several reasons. First, this methodology achieved higher hit ratios than the others on all classification problems. Since hit rates on holdout samples are deemed as a good check on the external validity of the classification function [22], we feel that neural networks can improve industrial marketers' ability to correctly segment their markets. More accurate classification of accounts can improve sales force effectiveness, as well as other promotional activities. Secondly, these results were achieved with relatively small samples. Some researchers feel that methods that make no rigid assumptions about the functional form require large databases, such as scanner panels, to be accurate [1, 2]. Yet our neural networks made no such assumptions and did not require a large database to make accurate predictions. Thus, firms throughout the channel, with large or small customer bases, may take advantage of the classification accuracy that neural networks possess. It is also promising to note that although backpropagation is the most popular neural network modeling technique, it may not be the most accurate. Because the gradient search process proceeds in a point-to-point fashion and is intrinsically local in scope, convergence to a local rather than global error minima is a distinct possibility [9, 23]. Neural network researchers are now developing genetic algorithms to address this problem. A genetic algorithms's search of the error surface sweeps from one population of
437
More accurate classification of accounts can improve s a l e s force effectiveness. points to another. By searching the parameter space in many directions simultaneously, the probability of convergence to a local optima is greatly reduced. Neural networks, or their perceptron forerunners, are computing processes that were introduced over 30 years ago. Their acceptance was hindered in part by the inabilities of computers of that day to handle their intensive computational demands. However, the development of high speed personal computers has rekindled interest in them during the past decade. We feel that they do indeed possess potential to aid industrial marketers in many of the issues that they face, including market segmentation.
forward Networks Are Universal Approximators, Neural Networks 2, 359-366 (1989). 11. Irie, Bunpei, and Myake, Sei, Capabilities of Three Layer Perceptrons, IEEE Second International Conference on Neural Networks I, 641-648 (1988). 12. Kolmogorov, Andrei N., On the Representation of Continuous Functions of Many Variables by Superposition of Continuous Functions of One Variable and Addition, Doklady Akademii Nauk USSR 114, 953-956 (1957). 13. Lachenbruch, Peter A., An Almost Unbiased Method of Obtaining ConfidenceIntervals for the Probability of Missclassificationin Discfiminant Analysis, Biometrics 23,639-645 (1967). 14. Lorentz, George G., The 13th Problem of Hilbert, Proceedings of Symposia in Pure Mathematics American Mathematical Society, 28, 419-430 (1976). 15. McCulloch, Warren S., and Pitts, Walter, A Logical Calculus of the Ideas Immanent in Nervous Activity, Bulletin of Mathematical Biophysics 5, 115-133 (1943).
REFERENCES 1. Abe, Makoto, A Moving Ellipsoid Method for Nonparametric Regression and Its Application to Logit Diagnostics with Scanner Data, Journal of Marketing Research 28, 339-349 (1991). 2. Bult, Jan Roelf, Semiparametric Versus Parametric Classification Models: An Application to Direct Marketing, Journal of Marketing Research 30, 380-390 (1993). 3. Chester, Daniel L., Why Two Hidden Layers Are Better Than One, IEEE International Conference on Neural Networks I, 265-268 (1990). 4. Freeman, James A., and Skapura, David M., NeuralNetworks: Algorithms, Applications, and Programming Techniques. Addison-Wesley, Reading, MA, 1992.
16. Minsky, Marvin, and Papert, Seymour, Perceptrons. MIT Press, Cambridge, 1969. 17. Parker, D., Learning Logic, Technical Report TR-87, Center for Computational Research in Economics and Management Science, MIT, Cambridge, 1985. 18. Press, S. James, and Wilson, Sandra, Choosing Between Logistic Regression and Discriminant Analysis, Journal of the American Statistical Association 73, 699-705 (1978). 19. Rosenblatt, Frank, The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Psychological Review 65, 386-408 (1958).
5. Funahashi, Ken-ichi, On the Approximation of Continuous Mappings by Neural Networks, Neural Networks 2, 183-192 (1989).
20. Rumelhart, David E., Hinton, Geoffrey E., and Williams, Ronald J., Learning Internal Representation by Error Propagation, in Parallel Distributed Processing: Exploration in the Microstructures of Cognition, D. E. Rumelhart and J. L. McClelland, eds., MIT Press, Cambridge, 318-362. 1986.
6. Hair, Joseph F. Jr., Anderson, Rolph E., Tatham, Ronald L., and Black, William C., Multivariate Data Analysis With Readings, 3rd ed. Macmillan, New York. 1992.
21. Sprecher, David A., On the Structure of Continuous Functions of Several Variables, Transactions of the American Mathematical Society 115,340-355 (1965).
7. Hawley, Delvin D., Johnson, John D., and Raina, Dijjotam, ArtificialNeural Systems: A New Tool for Financial Decision-Making, Financial Analysts Journal November-December, 63-72 (1990).
22. Stevens, James, Applied Multivariate Statisticsfor the Social Sciences, 2nd ed. Lawrence Erlbaum Associates, Inc., Hillsdale, NJ, 1992.
8. Hebb, Donald O., The Organization of Behavior. Wiley and Sons, Inc., New York, 1949. 9. Hecht-Nielson, Robert, Kolmogorov's Mapping Neural Network Existence Theorem, IEEE First International Conference on Neural Networks III, l 1-14 (1987). 10. Homik, Kurt, Stinchcombe, Maxwell, and White, Halbert, Multilayer Feed-
438
23. Wasserman, Philip D., Neural Computing: Theory and Practice. Van Nostrand Reinhold, New York, 1989. 24. Werbos, Paul J., Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences, Dissertation, Harvard College, 1974. 25. White, Halbert, Some Asymptotic Results for Learning in Single HiddenLayer Feedforward Network Models, Journal of the American Statistical Association 84, 1003-1013 (1989).