Computers Ops Res. Vol. 21, No. 3, pp. 249-263. 1994 copyri8ht 0 1994 Elscvicr scicncc Ltd Printed in Great Britain. All rights -cd 0305-0548/94 S6.00+ 0.00
PREDICTING GRADUATE STUDENT SUCCESS: A COMPARISON OF NEURAL NETWORKS AND TRADITIONAL TECHNIQUES BILLC. HARDGRAVE,‘?
RICK L. W~SON~#§
and KENT A. WALSTRO&~~
‘Computer Information Systems and Quantitative Analysis, College of Business Administration, University of Arkansas, Fayetteville, AR 72701, ‘Department of Management, College of Business Administration, Oklahoma State University, Stillwater, OK 74078 and ‘Department of 05ce and Information Systems, College of Business Administration, Central Michigan University, Mount Pleasant, MI 48859, U.S.A.
(Received September 1992; accepted February 1993) Scape aad Purpose-As the number of applicants to graduate programs increases, the importance of adequately evaluating each applicant also increases. Specifically, an indication of the applicant’s potential success in the graduate program is a key element in the acceptance decision. This paper evaluates the ability of five different models-least squares regression, stepwise regression, discriminate analysis, logistic regression, and neural networks-to predict the academic success of graduate students. The last one, neural networks, has not been previously examined as a prediction device. Neural networks have proven effective in situations where both discriminant analysis and regression have previously been used, especially when the data violates restrictive assumptions of statistical models (e.g. normally distributed data). The prediction of academic success appears to be a good test of a neural network’s ability to outperform incumbent statistical models. AImtract-The decision to accept a student into a graduate program is a difficult one. The admission decision is based upon many factors which are used to predict the success of the applicant. Regression analysis has typically been used to develop a prediction mechanism. However, as is shown in this paper, these models are not particularly effective in predicting success or failure. Therefore, this paper explores other methods of prediction, including the biologically inspired, non-parametric statistical approach of neural networks, in terms of their ability to predict academic success in an MBA program. This study found that (1) past studies may have been addressing the decision problem incorrectly, (2) predicting success and failure of graduate students is difficult given the easily obtained quantitative data describing the subjects that are typically used for such a purpose, and (3) non-parametric procedures such as neural networks perform at least as well as traditional methods and are worthy of further investigation.
1. INTRODUCTION
In the 2 yr period from 1989 to 1991, applicants to graduate schools increased 10-H%. This trend is expected to continue in the near future [l]. Due to the recent recession, many new college graduates are entering graduate school instead of the job market as a way to get ahead of the competition and wait until the market improves [l]. The decision on whether to admit a student into a graduate program is important for both the student and institution. For the student, the decision determines whether he/she has an opportunity for advanced education and, potentially, a brighter future. For the institution, quality students tB. C. Hardgrave is an Assistant Professor in the Computer Information Systems and Quantitative Analysis Department at the University of Arkansas. He has published several articles in the areas of information systems and operations research. His current research interests include prototyping information systems, object-oriented technology, and neural network applications. $R. L. Wilson is currently an Assistant Professor of Management Science and Information Systems at Oklahoma State University. He received his Ph.D. in MIS from the University of Nebraska-Lincoln. Dr Wilson has published in journals such as Decision Support Systems, Znfomwtion and Management, International Journal of Production Research, Zntemational Joumal of Production and Operations Management, among others. His current research interests include neural networks, decision support systems, integrated management science applications, and the use of database technology. #Author for correspondence. TK. A. Walstrom is an Assistant Professor in the Department of 05ce and Information Systems at Central Michigan University. He has published in the areas of information systems and operations research. His current research interests include executive information systems, strategic information system planning, and neural network applications. 249
250
B.C. HARDGRAVE
et ai.
may impact a school’s reputation; admitting poor performing students could have an adverse effect [Z]. However, the process of the admission decision is difficult. Qualifications of a student, which include both qualitative and quantitative data, are typically examined in an attempt to predict the academic success of the student. A prediction of academic success is normally used as an indicator of admission potential. Various factors, including undergraduate grade point average (GPA), GMAT scores, letters of reference, and work experience, have been used as potential predictors of academic success [3]. The most prevalently cited statistical procedures in this prediction problem have been standard least squares and stepwise regression. Discriminant analysis has also been used. However, predictors of academic success in this setting have not performed at a desired level of accuracy. Several explanations have been provided for the poor performance exhibited by statistical procedures. First, restriction of range in both undergraduate GPAs as an inde~dent variable and graduate GPA as a dependent variable. Theoretically, GPA ranges from 0 to 4, but in reality the majority of undergraduate GPAs tend to fall between 2.00 and 3.50, and graduate GPAs fall between 3.00 and 4.00, creating an extremely skewed distribution [4]. Second, since studies in graduate GPA prediction are limited to those students who were accepted, enrolled, and received grades, a biased sample is created [3]. Third, applicants who had earned low scores on the GMAT, undergraduate GPAs, and such, would not likely apply for admission, which would tend to bias the sample [4]. Neural networks represent a technology based on mimicking the human brain’s process of learning. In essence, they can be thought of as biologically inspired, non-parametric statistical tools. These systems learn by example and dyn~i~ally modify themselves to fit the data presented. Neural networks have proven effective in situations where both discr~inant analysis and regression have previously been used in spite of inherent data problems. Thus, they represent a comparative alternative to the incumbant techniques. It is the purpose of this paper to compare the effectiveness of various statistical procedures and neural networks in predicting the academic success of entering students in a graduate program. The application setting used in this article will be predicting the success of student applicants in an MBA program. The paper is organized as follows. Section 2 discusses previous attempts at predicting academic success. Neural networks are also discussed in this section. Section 3 presents the methods used in this study. Section 4 examines the results of both the basic experimental study and an auxiliary study undertaken to provide further insight into the performance of the comparative methods. Section 5 identifies implications of this study to the researcher and decision maker. Section 6 concludes the paper.
2. BACKGROUND
2.1. Application problem While most graduate programs are growing, graduate schools of business have witnessed the largest growth. In 1990, some 700 American business schools awarded approx. 75,000 MBAs; this was equivalent to a quarter of all master degrees awarded. Considering that business schools turn away an average of three applicants for every one they accept, business schools may have evaluated around 3~,~ applications before deciding to accept those 75,000 [S]. The primary criteria for admission is generally acknowledged to be the combination of factors which best predict academic success in the program. Academic success is measured as the first-year-average (FYA) GPA, completed GPA, or some combination thereof. Some of the factors used as predictors include GMAT scores, undergraduate GPA, undergraduate major, CPA in major, essays, letters of reference, interviews, and work experience [3]. Traditionally, researchers interested in predicting academic success have applied statistical procedures such as least squares regression, stepwise regression, and discriminant analysis. However, these models have not been effective. Correlation coefficients between combinations of predictors and actual performance are often in the range of 0.2-0.4 [3]_
Predicting graduate student success: a comparison of neural networks and traditional techniques
251
2.2. Statistical procedures 2.2.1. Least squares regression analysis studies. Least squares multiple regression is well suited for determining the predictive power of several independent variables of a dependent variable [6]. This procedure is widely used for predicting GPA in MBA programs. However, least squares regression requires certain assumptions about the data to be true before the technique is considered valid. The most restrictive is the assumption of normally distributed data. A violation of this assumption may severely limit the ability of this model. Most students involving a least squares regression model investigate several combinations of factors in an attempt to find the combination that explains the most variance (i.e. highest R’). Fisher and Resnick [3] found the best combination of factors to be undergraduate GPA and GMAT total. The resulting R2 was 0.08. Deckro and Woundenberg’s [7] study concluded that the combination of GMAT total, undergraduate GPA, junior/senior GPA, hours required in program, age, sex, ethnic background, and attendance (full-time/part-time), provided the best predictor of academic success (R2 =0.15). Gayle and Jones [8] reported an R2 of 0.17 when using GRE scores (GMAT was not available at the time of their study), age, and undergraduate GPA. 2.2.2. Stepwise regression analysis studies. Stepwise regression is used for situations in which several potential variables are available, but it is uncertain whether all of them are valuable and necessary [9]. Stepwise regression is bound by the same restrictive assumptions as least squares multiple regression. Studies using stepwise regression to predict MBA GPA have been concerned with identifying the significant predictor variables. Remus and Wong [2] found GMAT total, GMAT verbal, place of student’s residence and undergraduate education, and assistantship status, to be significant predictors. Together, these factors explained 16% of the variance in GPA (i.e. R2 =0.16). Of eleven variables tested in Paolillo’s [lo] study, oniy three proved to be significant: GMAT total, junior/senior GPA, and attendance (full-time/part-time). The combination of these factors provided an R2 of 0.19. Baird’s [l l] study demonstrated undergraduate GPA, self-confidence, and awards received, as significant predictive factors (R2 =0.17). Sobol [12] found only two factors, undergraduate GPA and GMAT total, to be significant contributors. The resulting R2 value was 0.19. Deckro and Woundenberg [7] found GMAT total, sex, undergraduate GPA, junior/senior GPA, attendance status, and hours required in program, to be significant predictors. Together, the factors explained 15% of the variance in GPA. Graham [13] found the only significant factor to be GMAT total; the R2 was 0.17. 2.2.3. Discriminant analysis studies. Discriminant analysis (DA) analyzes the differences between groups and provides a way to classify any case into the group which it most closely resembles. Discriminant analysis formally requires a number of restrictive assumptions to be met, the most significant that the data be multivariate normally distributed [9]. Nonetheless, for problems involving categorization, this is a widely applied statistical method. Categorization techniques, such as DA, would appear to be more appropriate than regression analysis for predicting MBA success since this method could predict an MBA applicant into a category, such as successful/not successful, rather than trying to predict a specific GPA. However, only one study has used discriminant analysis in this capacity. Remus and Wong [2] used GMAT total, assistantship status, GMAT verbal, and previous admission status, to build a DA model to predict whether a student would earn less than a 3.00 GPA (failure), or greater than a 3.00 GPA (success). Their model accurately predicted 64% of the cases. 2.2.4. Logistic regression. Although previous studies did not reveal any use of logistic regression models in predicting GPA, it does seem an appropriate technique. Much like DA, logistic regression is used to classify data into groups based upon characteristics of the data. However, logistic regression does not require strict normality of the data as DA does, and has other desirable characteristics that, some researchers postulate, make it a better categorization tool than DA (e.g. see [14-la]). Logistic regression is applicable for group classification problems of more than just two groups, as is DA. Details of logistic regression can be found in [15, 161. 2.3. Neural networks There are numerous different types of neural network paradigms that have been proposed for
B. C. HARDGRAVE et al.
252
OUTPUT NODE
0
'NPUT NODES Fig. 1. Multi-layered perception.
distinct problem domains (e.g. see [17-191). An appropriate neural model that has been previously used for forecasting, prediction, and general decision making, has been the multi-layered, feed-forward perceptron model. Multi-layered networks have continuously-valued neurons or processing elements, are trained in a supervised manner, and consist of one or more layers of nodes (called hidden layers) between the input and output nodes [17]. A typical multi-layered, feed-forward perceptron model is shown in Fig. 1. Input nodes are where information is presented to the network, output node(s) provide the decision made by the neural network, and the hidden nodes, in essence, contain the information regarding proper mapping of inputs to proper decisions (outputs). For supervised learning, the back-propagation algorithm has become the de facto standard [20]. Back-propagation is an iterative gradient-descent algorithm designed to minimize the mean squared error between the actual output of a node and the desired output as specified in the training set [17]. Weight adjustment starts at the output nodes where the error measure is readily available, and then proceeds by propagating the error measure back through the layers toward the input nodes [18]. More detailed information regarding neural networks can be found in summary papers such as [18, 19, 211. Several successful applications of neural networks have appeared in the literature, including: (1) financial market applications, such as bankruptcy prediction [22], bank failure prediction [23], and stock market prediction L-241;(2) manufacturing problems, such as quality control [25], and machine diagnosis [26]; and (3) multi-criteria decision making [19]. These are just a few of the many applications in which neural networks have been applied. In general, neural networks can work fairly well wherever traditional statistical procedures, such as regression analysis and discriminant analysis, have been used. Neural networks are not subject to the same assumptions as statistical models, thus, relieving the model builder from testing assumptions that are critical to the validity of traditional statistical techniques [27]. When should a neural network application be considered? Some criteria for making this decision include: (1) the application is data intensive [26, 281; (2) the data contains complex relationships between many factors [25]; (3) the data is ‘noisy’ [26,28]; (4) underlying distributions are unknown [26]; and (5) other technologies are not adequate [28]. Thus, on the basis of these factors, predicting GPA appears to be an ideal neural network application. 3. METHOD
3.1. Sample Two samples of entering MBA students from a major southwestern university were used in this
Predicting graduate student success: a comparison of neural networks and traditional techniques
253
study. The first sample consisted of 156 students entering the program for the 1986-1987 school year. The second sample contained 141 entering MBA students for the 1989-1990 school year. These samples represent the same information used by the Graduate Management Admissions Council in its two most recent validity studies [29, 301. Since this experiment is designed to test the predictive effectiveness of various models, two data sets are required: a training set, and a testing (i.e. cross-validation) set. A training set is used to generate the prediction models, while the testing set is used to assess the predictive accuracy of the models. In this study, the 1986-1987 data set was used as the training set (i.e. to build the prediction models); while the 1989-1990 data set was used as the testing set. 3.2. Mependent variables The literature review provided a broad list of factors that have been used as predictors of academic performance. From this list, the variables commonly used by GMAC for their validity study have been included in this study [29,30]. These variables include: GMAT total, GMAT verbal, GMAT quantitative, undergraduate GPA, sex, attendance (full-time/part-time), and work experience. These variables have all been extensively used in prior studies, as indicated in Section 2.2. Another variable, age, was added to this list because of its inclusion in several previous studies. 3.3. Dependent va~~ble In this study, the dependent variable is the academic success of the student. A surrogate representation of academic success is the FYA GPA. As shown in the literature review, most studies have used CPA as a continuous variable. However, the decision maker is interested in predicting the likelihood of student success, not a continuous, numerical GPA. Consultation with the key decision makers at the university where this study took place indicated interest in identifying three different general categories of individuals: (1) high-risk (low-likelihood of academic success); (2) questionable-risk (borderline individual); and (3) no-risk. This information would be useful for two purposes: (1) the identification of those individuals in the high-risk category which would affect the admit/no-admit decision and (2) identification of those individuals of questionable-risk who may need more attention and counseling as they embark on their graduate career. Therefore, treating FYA GPA as a categorical variable from a decision maker’s perspective seems most appropriate. In this study, for ~m~rative reasons, we will approach the problem from the two different perspectives; using first year GPA as both a continuous and categorical variable. The three categories used to describe FYA GPA are: (1) ~3.00; (2) >3.00; ~3.30; and (3) >3.30. While these choices for category delineation are somewhat arbitrary, for the particular decision setting in our study they represent the high-risk (< 3.00), questionable-risk (3.00-3.30), and no-risk (> 3.30) categories mentioned above. Choice of cut-offs will obviously depend on the specific application. Table 1 identifies the composition of the training and testing sets by category. 3.4. Implementation of models All of the different approaches to this prediction problem were implemented via personal computer based tools. SYSTAT [31] and Quattro Pro were used to develop the regression based predictive models (both ordinary least squares and stepwise eli~na~on). A value of 0.15 for a was used for inclusion and removal decisions during the stepwise analysis. SYSTAT was also used to implement the discriminant analysis and logistic regression models
Table 1. Compositionof trainingand testingsets by category category
1
B. C. HARDGRAVE et al.
254
in this study. SYSTAT fits the standard multivariate general linear model and uses information regarding prior probability of the training set in determining the discriminating functions. In developing the polytomous multinomial logistic regression models, SYSTAT uses standard non-linear approximation techniques (Newton-Raphson) in determining a maximum likelihood estimation for the required model parameters. BRAINMAKER is an implementation of a multi-layer perceptron neural network which utilizes back propagation in training. Each network generated had eight input nodes, one representing each independent variable, one hidden layer of ten nodes, and either one or two output nodes. The choice of network structure is based upon the existing heuristic design guidance provided by studies such as [28, 32-361. Suggestions for the number of appropriate nodes in the hidden layer range from one-half the number of input nodes [34] to two times the number of input nodes plus one [36]. For this particular study, ten hidden nodes were chosen since this number is approximately halfway between the two aforementioned design ‘suggestions’. The network was implemented with one hidden layer, primarily to keep the configuration as simple as possible. The use of one hidden layer is also consistent with most heuristics [28, 361. Obviously, one could add multiple hidden layers. Because the purpose of our study is to compare the predictive performance of different techniques, we opted not to experiment with the hidden layer structure of our neural networks. Another consideration was the number of output nodes to use in the problem. The results of two different configurations, one and two output nodes, are represented in this study. For the one output node case, depending upon whether the network was used to predict a continuous or categorical FYA GPA, the output represented either the predicted GPA or was used to differentiate between the three categories. Thus, a value of 0 was indicative of high-risk students, a value of 0.5 indicative of students of questionable-risk and a value of 1 represented students of no-risk. With two output nodes, specific node values identified high-risk (HR) students (HR = 1, NR = 0) and no-risk (NR) students (HR = 0, NR = 1). Node output of HR = 0.5 and NR = 0.5 was used to indicate students of questionable-risk. Because the questionable-risk category represented cases which were between the two extreme categories of high-risk and no-risk, it was felt that the two configurations (one and two output nodes) best represented this situation. One could also argue for using three output nodes (one for each category). Such a network was also evaluated, with results comparable to the one and two output node networks. For parsimony, only the one and two output node network results will be discussed further. Since there are no established design guidelines for optimal neural network structure, there may be better configurations for this problem. As such, the predictive results of the neural networks in this study provide a lower bound of their potential performance. In this particular implementation of back propagation, a training tolerance is one significant parameter which must be specified. Output node values are examined by the training procedure as training cases (and the desired classification) are presented to the network. For instance, in predicting with a single categorical output node, consider a high-risk training case that had the output node valued at 0.85. When training a neural network, a certain amount of variation away from desired values (indicating the correct category) is typically allowed at the output layer when determining whether weight adjustment to the network should occur via back-propagation. Such variation is referred to as the training tolerance. Thus a training tolerance of 0.2 would allow 20% variation of each output node away from the desired value before it would modify network weights. The previous example would satisfy the training tolerance of 0.2 (i.e. l-O.85 ~0.2) and would not initiate weight correction. This training tolerance parameter is also applicable when training networks with two output nodes. The networks represented in this study were trained with a tolerance level of 0.10 until no further improvements occurred in training set classification. The training tolerance employed is based on heuristic guidelines (e.g. [34]). None of the techniques, including neural networks, classified with 100% accuracy on the training set. 3.5. Assessing predictive
accuracy:
correct predictions
The principal information to be gathered in this study is the determination of the ability of each model to predict 1989-1990 student applicant performance on the basis of models developed using
Predicting graduate student success: a comparison of neural networks and traditional techniques
255
the 1986-1987 student database. Therefore, the number of students predicted in the proper risk category is the chief measure to be used in comparing the various approaches. In the first set of results comparing regression-based models and the continuous neural network model, a continuous GPA is predicted. Thus, for both approaches, if the predicted GPA falls within the same category as the actual GPA, it was treated as a correct prediction. For the categorical approaches, regression is used to predict the numeric category, while Mahalanobis distances are used in the SYSTAT discriminant analysis to calculate posterior probab~ties for each case, indicating the likelihood of group membership. As previously mentions, prior probabilities were incorporated based upon the composition of the training set. The group with the highest posterior probability, therefore, is used as the discriminant analysis prediction for that case. Thus, correct or incorrect predictions are determined in this manner. The predictive accuracy for logistic regression is done in much the same manner as discriminant analysis. The group for which the log-likelihood probability is greatest is used as the prediction for that case, and correct and incorrect predictions determined appropriately. When evaluating predictive capability of neural networks in the categorical approaches, a testing threshold, similar to training tolerance, is specified. This testing threshold identifies how stringent the allowable variation in output nodes can be when predicting group membership. In this study, a testing threshold of 0.249 was used. This value allowed clear delineation between the three different categories, irrespective of the number of output nodes. It is on this basis that correct and incorrect classifications were determined for the neural network models. As previously mentioned, none of the techniques classified with 100% accuracy in the training set. Ultimately, the true measure of any forecasting or prediction model is how well it generalizes for cases it has not previously seen [37]. Thus, the measure of predictive accuracy in our study is concerned only with the cross-validation accuracy. Besides comparing the predictive accuracy between models, it is also relevant to ascertain whether the prediction models provide any value added above pure chance [37,38]. The underlying concept in comparing a prediction technique to pure chance is to consider what can be done simply by guessing at the predictions. For instance, in the training set used in this study, 10.9% of the cases are high-risk, 30,1% are questionable-risk, and 59.0% are no-risk. If this was representative of the proportions in the testing set, one could achieve 59% accuracy simply by ‘predicting’ each case to be no-risk! In this study, the concern is not only overall predictive accuracy, but category accuracy as well. Therefore, a similar concept, the proportional chance criteria, is an appropriate measure to determine whether a model predicts significantly better than chance. This criteria says that a pure chance model will ‘predict’ correct classifications of test cases with the same percentage as the proportion of cases in the training set. Thus, the expected number of correct classifications in the testing sample of category g by chance, e,, is equal to the number of cases of the category n, multiplied by the proportion of those cases in the training set, b,. Letting og represent the observed number of correct classifications of category g, one can calculate a normal test statistic [37]:
to determine whether a prediction model si~ifi~ntly differs from pure chance (i.e. does it add any value in the decision process?). For this study, one could expect (by pure chance) 1.635 (10.9%) correct classifications of the high-risk group, 15.35 (30.1%) correct classifications of the questionable-risk group, and 44.25 (59’/) correct classifications of the no-risk group (61.235 total, or 43.3% overall). These values will be used as expected correct prediction by chance in determining the statistical significance of the models under comparison.
4. RESULTS
This section provides the correct percentage cl~sifi~ations for the different methods employed in this graduate student success prediction problem. As previously stated, the 19861987 data was
256
B.C. HARDGRAVE
et al.
used to generate the predictive models, while the 1989-1990 data was utilized in deriving the predictive classification percentages. 4.1. Predictive accuracy-continuous
models
Since the first regression equation generated produces a continuous GPA value, it was necessary to place the predicted GPA value in its proper GPA category. Table 2 contains the results of applying the regression equation to the testing data, and the subsequent percentage of correct categorizations. Table 3 provides similar information regarding the use of neural networks as a ‘regression tool’. As seen in Table 2, the regression equation predicted 74 of the 141 data points (52%) correctly. Of those, the model did not correctly predict any from the high-risk category. While the neural network did much better (see Table 3) at predicting the high-risk category, it did so only at the expense of accurately predicting the other two categories. Thus, it appeared that the neural network continually underpredicted GPA, while the regression model predictions were mostly in the upper two categories. A second regression analysis was undertaken to determine whether performance would be better if the model was designed to predict a specific category rather than a continuous GPA. To do this, the GPAs from the training data were transformed to 0, 1, and 2, to reflect the high-risk, questionable-risk and no-risk categories, respectively. The resulting regression equation was then used to predict placement in the category from the testing data. This model produced the same percentage results as shown in Table 2 for the first model. The only difference between the two models is the R2 value, as the model using specific GPA prediction had an R2 of 0.1534, while the non-continuous predictive model had an R2 of 0.1075. Finally, a stepwise procedure was undertaken to predict student GPA. This approach includes Table2. Ordinary least squares regression results
Table3. Neural network (continuous GPA value) results Predicted
< 3.00
3.00 to 3.30
> 3.30 2
Actual Total
A
< 3.00
10
3
15
:
3.00 to 3.30
27
17
7
51
> 3.30
22
23
30
75
Predicted Total
59
43
39
141
67%
33%
40%
U
a 1
Correct Total
%
Correct:
(10 + 17 + 30) / 141 = 40%
Predicting graduate student success: a comparison of neural networks and traditional techniques
251
Table 4. Stepwise regression results
Table 5. Discriminant analysis results
only strong predictors of GPA. Significant predictor variables included in this regression model were GMAT quantitative, and undergraduate GPA. Together, these variables accounted for 13.2% of the variance (i.e. R*). As with the first full regression model, the stepwise model was used to predict a specific GPA, which was then categorized and evaluated. The results are given in Table 4. Not surprising, the stepwise regression model produced results similar to the regression model. The overall prediction accuracy was 55%-just slightly better than the full regression model. This model also failed to correctly predict any cases high-risk category. Interestingly, excluding some predictor variables marginally improved the overall predictive accuracy. 4.2. Predictive accuracy-categorical
models
Table 5 indicates the correct predictive percentages for the discriminant analysis model. The DA model correctly predicted 53% of the GPAs, but failed to accurately predict any cases in the high-risk category. Note how DA appears to be more accurate than the regression model in predicting the no-risk cases (81% compared to 67%), but less accurate in predicting those of questionable risk (27% compared to 53%). The predictive accuracy results of the logistic regression model are shown in Table 6. Overall, the logistic model had a prediction rate of 50%. It, too, was unable to correctly predict any in the high-risk category. Its results were comparable to those of discriminant analysis. 4.3. Neural network model accuracy Table 7 identifies the predictive results of the neural network approach to this problem, both for the one and two output node case. The two output node case appeared to perform the best, correctly predicting 78 of the 141 cases (55%). For the questionable-risk category, 47% were
258
B.C.HARDGRAVE
eta!.
Table6. Logistic regression results Predicted
< 3.00
3.00 to 3.30
> 3.30
Actual Total
A
< 3.00
0
2
13
15
c t u
3.00 to 3.30
2
14
35
51
L a
> 3.30
5
I
1
Predicted Total Correct Total
%
Correct:
1
13
7
29
0%
27%
1
57
105
1
75
1
141
76%
(0 + 14 + 57) J 141 = 50%
Table7. Neuralnetwork(categorical) results (a)
One Output
Neuron
Predicted
, Total
Correct:
(0 + 27 + 45) / 141 = 51%
(b)
Two Output
Neurons
Predicted
:1 A
< 3.00
3.00 to 3.30
> 3.30
Actual Total
< 3.00
0
a
7
15
3.00 to 3.30
a
24
27
51
a
21
54
75
141
u
a 1
I
> 3.30
I
Predicted Total Correct
Total
%
Correct:
I a
53
88
0%
47%
72%
(a + 24 + 54) / 141 = 55%
correctly predicted; 72% were correctly predicted from the no-risk category. As with the previous models, the neural network model did not correctly predict any cases in the high-risk category. The performance of the one output neural network was not quite as good, but still comparable. It performed marginally better at classifying the questionable-risk students, but a little worse with the no-risk students and, ultimately, overall. Table 8 provides an overall summary of the performance of each of the models. Also indicated in Table 8 are those instances where the model provided predictions significantly better than pure
Predicting graduate student success: a comparison of neural networks and traditional techniques
259
Table 8.Comparison of models Grade Point Categories
Model Least Squares Regression Stepwise Regression
I
High Risk (< 3.00)
(3.00 to 3.30)
No Risk (> 3.30)
Correctly Predicted
Correctly Predicted
Correctly Predicted
0%
49% p ( 0.01
65%
0%
53% p 5 0.001
67%
I
Neural Network (continuous)
Questionable
67% p ( 0.001 I
Total Correctly Predicted 52% p ( 0.05 55%
I p ( 0.01 40%
33%
I
40%
I
I
Discriminant Analysis
0%
27%
81% p 5 0.001
Logistic Regression
0%
27%
76% p ( 0.01
___I 53% p ( 0.01
50% p 5 0.05
chance, both by groups and aggregately. Again, neural networks, as a comparative methodology, performs as well as the other approaches. 4.4. Auxiliary experiment: efect
oftraining set composition
It appears that all approaches suffered from a composition bias, due to the skewed composition of the students in the training set. The neural network model was the only approach that provided predictions significantly different from chance in three categories (questionable, no-risk, and overall). Nonetheless, the neural network still showed signs of being affected by the uneven distribution of categories in the training set. Past research in both the traditional and neural network areas have shown that predictive accuracy tends to increase or improve as one ‘smooths’ or balances the composition of cases in the training set (e.g. [22, 391). To further investigate and gain insight into the behavior of the different models, an auxiliary experiment was undertaken under conditions where the training set comprised of an equal number of each category. From the original training data set of 156 cases, ten distinct training sets were randomly created using Monte Carlo resampling techniques. Each of the ten sets had the 17 cases from each category (51 total). Each of the five comparative methods then used the multiple training sets individually to generate a predictive model. The accuracy of the resultant model was assessed by testing each model on the entire original test set. The average correct classification rates for the 10 different trials, for each of the comparative methods, is shown in Table 9. As before, significant differences from pure chance prediction are hi~~~t~ (note that 33.3% predictive accuracy could occur by pure chance given the composition of the testing sets). As is seen, the techniques employing categorization approaches-DA, logistic regression and the neural network technique-differ significantly from pure chance predictions, while the regression based approaches do not. Closer inspection of the regression functions generated (as seen through the correct classifications) show that the regression equations predicted most cases in the middle range. Thus, they obtained their correct classifications by predicting students to be assigned in the middle category. Illustrative of the poor ~~0~~~ of regression in this problem, some of the stepwise regression equations resulting from a particular test set included only a constant (i.e. the model predicted one value for all test cases). This gives further credence to the contention that the regression models provided little insight into predicting student risk.
B.C.HARDGRAW
260
etal.
Table 9.Average classifications for10different trials
Total Correctly Predicted 10%
76% p ( 0.01
18%
38%
3%
84% p ( 0.01
11%
36%
Discriminant Analysis
44%
46% p ( 0.05
42%
Logistic Regression
42%
45% p < 0.05
41%
Neural Network (categorical) 1 output neuron
51%
Least Squares Regression Stepwise Regression
33%
50% p ( 0.01
The results of DA, logistic regression, and the two neural network models were quite similar. Note that, as when using the entire training set, there are differences in performance between the one and two output neural models. Ultimately, these models performed a little better overall, though not significantly, than the DA and logistic regression. 5. DISCUSSION Prediction of graduate student GPA and/or Success or failure risk is obviously a difficult process. As seen from the literature review, the highest RZ was 0.19 when using regression analysis, and the discriminant analysis study reported earlier classified 64% accurately [2]. However, the study reporting 64% accuracy used only two categories of students. Further, most past studies used the same sample to determine the model’s predictive accuracy. Thus, this present study addresses a more difficult problem, both from the number of student risk categories used and by a more rigorous (and useful) evaluation of prediction accuracy. Given this, the results of the case study used in this paper performed comparably to past studies. Apparent from the results is that none of the methodologies, other than neural networks used as a continuous predictor model, could accurately predict the high-risk students. However, the neural network regression model did such a poor job in the other categories, and overall, it probably is not the ‘best’ approach. While there is some ‘cost’ in misclassifying all students, not being able to differentiate the high-risk students would be a large flaw of any model. From observing the two different types of models, predicting continuous or categorical outputs, the regression approach appeared to predict values fairly close to 3.3 (the break point) in every case, which would result in seemingly good results at the upper categories by just ‘guessing’. However, the categorical based models performed comparably. The reasons identified in the Introduction offer some explanation for the poor performance of statistical procedures as predictors. The data obviously violates several assumptions required by statistical models (e.g. normally distributed data, independent data, etc.). When these assumptions are violated, statistical procedure models can be poor performers. Previous studies have shown that neural networks may be less susceptible to the assumptions required of statistical models. This is not apparent in this study. While neural networks perform as well as the incumbent techniques, they do not appear to overcome the anomalies possessed by the data. The first implication of this study’s results is that the quantitative data usually used when assessing an applicants ability to perform in a graduate program does not give much insight into one’s risk
Predicting graduate student sucuxs: a comparison of neural networks and traditional techniques
261
of academic failure. Because of the inherent shortcomings in the data, as described in the Introduction, this is not surprising. However, it does show to the graduate program director that more than just standardized scores, previous academic performance, and past work experience ultimately affects whether the candidate will be successful in the program. Thus, for decision situations similar to this case, a decision maker should work to expand the information included in the analysis above and beyond that which has been previously used. Second, regression is probably not the right approach to this decision problem. A decision maker is really only interested in the prediction of relative success or failure, not actual GPA. For universities similar in data composition to the case illustrated in this study, a categorical technique appears to be more appropriate. Regression tends to ignore the extreme cases (those students who do exceptionally well and those who do expectionally poor) in determining its predictive equation. Categorical approaches, when trained on a ‘balanced’ set of examples, were shown to provide more accurate predictions. However, as mentioned above, the need for more pertinent data supersedes in importance the impact of using a categorical technique. Third, neural networks, as an alternative classification methodology, show promise. While they performed with comparable predictive accuracy to the incumbent techniques, it was disappointing that they did not provide better results. One consideration is that the performance of the neural network models represents, in essence, a lower bound of performance. The plethora of different technical implementation factors that exist, and, at present, their unknown impact on neural network performance, offers the hope for increased performance through experimentation [36]. For instance, there did appear to be an impact of network performance when the number of output nodes were changed. While the point of this study was not to investigate the effect of different network architectures on performance, it did illustrate that these issues may impact predictive accuracy and represent a continuing active stream of research which should be pursued. An alternative viewpoint to the above discussion is that the relative poor performance of the methodology was a positive sign. Since the data does not capture pa~ic~~ly well the underling constructs which shape the success/failure continuum, neural networks do not create something from nothing. Since neural networks is a relatively new technology, it is insightful to know a neural model will not simply ‘fake’ a relationship when a strong one does, in fact, not exist. Neural networks are arguably easier for modelers to use. As seen in previous studies predicting GPA, the modeler tried combinations of variables until the ‘best’ combination was obtained. Additionally, the modeler would try several models until deciding on the ‘best’ model. For neural network modeling, this type of process is not required. Neural networks are able to determine which variables are important without being instructed by the modeler. In this respect, modeling with neural networks is much easier. It relieves the decision maker from: (1) deciding which variables to include; (2) deciding which model is the best performer; and (3) worrying about the parametric assumptions of traditional statistical approaches. Such an approach also has drawbacks: (1) information regarding individual variable significance is not as readily tested or understood as in the incumbent techniques; (2) the aforementioned wide variety of technical implementation factors make training neural networks an ambiguous and arduous task.
6. CONCLUSION
This paper has evaluated the ability of five different models-least squares regression, stepwise regression, discriminant analysis, logistic regression and neural networks-to predict the success or relative failure of graduate MBA students. Consistent with previous studies on graduate student success prediction, the models produced poor results, with the best model accurately predicting ~60% of the cases, Overall, the categorical models (DA, logistic regression and neural networks) appeared to outperlorm the regression models. Predictive accuracy among the categorical models were comparable. The primary reason for the poor performance of the models in this study is the composition of the data. The data used in this decision problem suffers from range restriction and is inherently biased. Thus, there is strong evidence that the data used as independent variables in this and other studies are not good predictors of academic success.
262
B. C. HARDGRAVE et al.
Previous studies on neural networks have shown that neural models may outperform other statistical methods, especially when the data does not conform to the assumptions required of many traditional techniques. In this study, however, neural networks did not significantly outperform the incumbant statistical approaches. While this might be disappointing to neural network researchers, this finding is also very valuable. Neural networks are not a panacea for all problems. Neural network models, in this case, could not learn variable relationships because such relationships did not exist. Also, since the study of neural networks has not reached a state of maturation, the performance of neural networks in this paper represents a lower bound for performance. Thus, as the field ages, and as more prescriptive guidance in technical implementation issues of neural networks becomes available, the performance of neural models can be expected to improve. This study also provides insight to those individuals who make graduate student admission decisions. Since neural networks are non-parametric in nature, they may represent an easier approach to model building. Previous studies in predicting graduate student success have been exercises in discovering the proper combination of independent variables and analysis tools. The use of neural networks relieves the decision maker from the task of selecting a proper method and variables, as the network itself decides what is important to the decision situation. An additional insight to decision makers from this study is that the data typically used to predict success or failure for graduate students is not sufficiently adequate. Less reliance on purely quantitative data, and the use of more qualitative data (such as personal interviews) may be in order. Evidence of the increased use of non-quantitative data can be seen at Harvard. As of 1985, Harvard uses an essay application, and no longer utilizes GMAT scores, in the admission decision cm1. Acknowledgements-The authors wish to thank the anonymous reviewers for their valuable comments. The authors would also like to thank Cynthia Gray and Jill Long for their assistance in the collection of data and consultation in MBA admission procedures.
REFERENCES I. D. E. Blum, Facing bleak job prospects, many recent graduates look to advanced degrees for competitive edge. Chron, Higk. Educ. 38, Al, A35-A36 (October 1991). 2. W. Remus and C. Wong, An evaluation offive models for the admission decision. College student J. 16,53-59.76 (1982). 3. J. B. Fisher and D. A_ Resnick, Standardized testing and graduate business school-admission: a review if i&es &d an analysis of a Baruch College MBA cohort. CoEZeueUniv. 55. 137-148 (Winter 19901. 4. J. Abe&, Predicting graduate-academic success fro& undergraduate academic perfor&ance: a canonical correlation study. Educ. psycho/. Measur. 51, 151-160 (1991). 5. P. Haynes, Management education: passport to prosperity. The Economics 318, 3-7 (March 2, 1991). 6. D. J. Hamilton, Multiple regression analysis and prediction of GPA upon degree completion. College student J. 24, 91-96 (1990). 7. R. F. Deckro and H. W. Woundenberg, M.B.A. admission criteria and academic success. Decis. Sci. f&765-769 (1977). 8. J. B. Gayle and T. H. Jones, Admission standards for graduate study in management. Decis. Sci. 4,421-425 (1973). 9. W. R. Klecka, I)iscriminant Analysis. Sage, Beverly Hills, Calif. (1980). 10. J. G. P. Paolillo, The predictive validity of selected admissions variables relative to grade point average earned in a master of business administration program. Educ. psyckol. Measure. 42, 1163- 1167 (1982). 11. L. L. Barid, Comparative prediction of first year graduate and professional school grades in six fields. Educ. psyckoi. Measur. 35,941~946 (1975). 12. M. G. Sobol, GPA, GMAT, and SCALE: a method for quantifi~tion of admissions criteria. Res. kigk~ Educ. 20, 77-88 (1984). 13. L. D. Graham, Predicting academic success of students in a master of business administration program. Educ. psycko2. Measur. 51, 721-727 (1991). 14. J. H. Aldrich and F. D. Nelson, Linear Probability, Logit, and Probit Models. Sage, Beverly Hills, Calif. (1984). 15. J. Anderson, Logistic discrimination. In Handbook of Statistics (Edited by P. Krishnaiah and L. Kanal), Vol. 2, pp. 169-191. North-Holland (1982). 16. S. J. Press and S. Wilson, Choosing between logistic regression and discriminant analysis. J. Am. statist. Assoc. 73, 699-705 (1978). 17. R. P. Lippman, An introduction to computing with neural nets. IEEE ASSP Mag. 4-22 (April 1987). 18. E. Masson and Y. Wang, Introduction to computation and learning in artificial neural networks. Eur. J. opl Res. 47, l-28 (July 1990). 19. J. Wang and B. Malakooti, A feedforward neural network for multiple criteria decision making. Computers ops Res. 19,151-167(1992). 20. D. E. Rumelhart, G. E. Hinton and R. J. Williams, Learning internal representations by error propagation in parallel distributed processing. In P~a~lel ~is~jbu~ed Processing (Edited by D. E. Rumelhart and J. L. M~lelland, Vol. 1, pp. 318-362. MIT Press, Cambridge (1986).
Predicting graduate student success: a comparison of neural networks and traditional techniques
263
21. F. Zahedi, An introduction to neural networks and a comparison with artificial intelligence and expert systems. Inter&es 21, 25-38 (1991). 22. R. L. Wilson and R. Sharda, Bankruptcy prediction using neural networks. De& support Syst. In press (1993). 23. K. Tam and M. Kiang, Managerial applications of neural networks: the case of bank failure predictions. Mgmt Sci. 38,926-947 (July 1992). 24. M. B. Fishman, D. S. Barr and W. L. Loick, Using neural nets in market analysis. Techn. anal. stocks Commod. 18-25 (April 1991). 25. W. VerDuin, Solving manufacturing problems with neural nets. Automation 54-58 (July 1990). 26. L. I. Burke, Introduction to artificial neural systems for pattern recognition. Computers ops Res. 18, 211-220 (1991). 27. H. W. Denton, M. S. Hung and B. A. Osyk, A neural network approach to the classification problem. Expert syst. Applic. 1, 417-424 (1990). 28. D. Bailey and D. Thompson, How to develop neural-network applications. AI Expert 38-47 (June 1990). 29. GMAC Validity Study Seruice. Educational Testing Service and Graduate Management Admission Council, Princeton, N.J. (March 1989). 30. GMAC Validity Study Seruice. Educational Testing Service and Graduate Management Admission Council, Princeton, N.J. (February 1992). 31. L. Wilkinson, SYSTAT: The System for Stutistics. SYSTAT Inc., Wvanston, Ill. (1989). 32. M. Caudill, Neural network primer: part III. AI Expert 28-33 (March 1988). 33. M. Caudill, Neural network training tips and techniques. AI Expert 56-61 (January 1991). 34. J. L. Lawrence, Zntroduction to Neural Networks, 2nd Edn. California Scientific Software, Grass Valley, Calif. (1991). 35. R. L. Wilson, The strategic organizational use of neural networks: an exploratory study. Unpublished dissertation, University of Nebraska (1990). 36. R. L. Wilson, Business implementation issues for neural networks. J. computer inform. Syst. 32, 15-19 (Spring 1992). 37. C. J. Huberty, Issues in the use and interpretation of discriminant analysis. Psychol. Bull. 95, 156-171 (1984). 38. J. F. Hair Jr, R. E. Anderson and R. L. Tatham, Multiuariate Data Analysis with Readings, 2nd Edn, MacMillan, New York (1987). 39. A. Jain and B. Chandrasekaran, Dimensionality and sample size considerations in pattern recognition practice. In Handbook of Statistics (Edited by P. Krishnaiah and L. Kanal), Vol. 2, pp. 835-855. North-Holland, Amsterdam (1982). 40. A. McGrath, GMAT flunks out at Harvard. Forbes 136, 199-200 (September 1985).