Intelligence 77 (2019) 101398
Contents lists available at ScienceDirect
Intelligence journal homepage: www.elsevier.com/locate/intell
Estimating the true extent of gender differences in scholastic achievement: A neural network approach Philipp Manuel Loesche
T
⁎
Zentrum für Empirische Pädagogische Forschung, University of Koblenz-Landau, Germany
ARTICLE INFO
ABSTRACT
Keywords: Gender differences Sex differences Neural networks Deep learning Scholastic achievement Large-scale assessment
In this study neural networks are employed to analyze individual item scores in a large-scale achievement test. They are able to correctly identify the participant's gender in 65.1% of cases, performing much better than a competing model based on differences in subject domain performances. It follows that substantial gender-related information is contained in the items, and comparisons based on performance can only provide a limited view of gender differences in scholastic achievement. An exploratory view of what the networks learn is presented and perspectives for further research are discussed.
1. Introduction There is a rich history of research into gender differences in scholastic achievement, with several meta-analyses and reviews providing insights into various subjects, for example, Hyde, Fennema, & Lamon, 1990; Hyde & Linn, 1988; Johnson, 1996; Voyer & Voyer, 2014; Willingham & Cole, 1997. Typically, these results were obtained by comparing the performance of girls and boys in different subjects, measured through aggregate scores across multiple items. This method provides a broad summary of the abilities to handle certain groups of tasks and of the gender differences in those abilities. However, gender differences in the way boys and girls approach specific items or item types cannot easily be accounted for and are likely to remain hidden. Such approaches might differ in various ways. Different mental representations of concepts and problems, for example, might result in tendencies to prefer certain algorithms for specific kinds of tasks. Some differences of this kind have been discovered, often in the mathematics domain. For example, it has been shown that children use different strategies to solve basic arithmetic problems (Siegler, 1987, 1988) and that the frequency with which a strategy is chosen is related to gender (Bailey, Littlefield, & Geary, 2012). Males seem to perform better when problems are unconventional in nature (Gallagher, Levin, & Cahalan, 2002; Gallagher & Lisi, 1994; Halpern, 2012), possibly indicating a better understanding of the underlying concepts. Gallagher and Lisi (1994) connected the difference in conventional vs. unconventional problem types to a difference in employment of conventional vs. unconventional strategies, with females relying more on the
former. Similarly, Fennema, Carpenter, Jacobs, Franke, and Levi (1998, p. 11) found that girls tend to “use more concrete strategies like modeling and counting” and boys tend “to use more abstract strategies that reflected conceptual understanding”. Another finding is that males regularly score higher in tests of spatial abilities (Linn & Petersen, 1985; Voyer & Voyer, 2014) which have been found to mediate some aspects of mathematics achievement (Casey, Nuttall, Pezaris, & Benbow, 1995; Ganley & Vasilyeva, 2011; Geary, 1996; Klein, Adi-Japha, & HakakBenizri, 2010). It is difficult to decide how relevant such differences are for overall scholastic achievement. Conventional methods can only account for effects that have already been discovered. There is also no standard way of integrating these effects into a bigger picture. Moreover, large-scale studies designed to measure scholastic achievement almost exclusively employ IRT models. These models assume performance to be determined by a single latent trait. Consequently, gender differences in item difficulties should only arise from gender differences in the measured trait. Items that deviate too much from this view are usually excluded. Thus, if girls and boys differ in the way they approach an item, the item will only be included in the assessment if these approaches lead to results that are roughly in line with the average gender difference in the measured trait. Naturally, this conceals the largest effects of gender differences in approaches, making it difficult to judge their extent. Additionally, item types that are reflective of such differences are more difficult to discover. There is, however, a way of assessing whether different approaches play a role, even without knowing what exactly they might consist of. In
⁎ Corresponding author at: Zentrum für Empirische Pädagogische Forschung (zepf), Universität Koblenz-Landau, Campus Landau, Bürgerstraße 23, 76829 Landau in der Pfalz, Germany. E-mail address:
[email protected].
https://doi.org/10.1016/j.intell.2019.101398 Received 23 January 2019; Received in revised form 22 August 2019; Accepted 14 September 2019 0160-2896/ © 2019 Elsevier Inc. All rights reserved.
Intelligence 77 (2019) 101398
P.M. Loesche
recent years, machine learning has been a thriving subject in both research and practical application. It involves using algorithms to analyze large amounts of data, looking for patterns and connections. In what is called supervised learning, the algorithm's ability to identify a target class is maximized. Logistic regression falls under this category, but far more powerful methods are available. Naturally, the algorithm's ability to correctly identify the target class hinges on the information that is available for analysis. Correct identifications are only possible if the data provides enough information. However, the reverse is also true. If the algorithm is able to identify the target class, it follows that sufficient information is contained in the data, enabling it to do so. We can apply this principle to the question at hand, namely whether boys and girls differ in ways that are not accurately represented by performance differences in broad domains of scholastic achievement. The degree to which an algorithm is able to identify children's gender by analyzing the response patterns in an achievement test gives us an idea about the extent of the gender differences contained in the responses. If it exceeds what would be expected from our current understanding of gender differences in measures of scholastic achievement, it follows that this understanding is limited and that there are aspects to those measures that are left to be discovered. In the present paper I will follow this line of thought by applying artificial neural networks to a large-scale school assessment in which elementary school children were tested for their mathematics, reading comprehension and listening comprehension skills. The next section gives a summary of what is known about gender differences in those subjects. It is followed by a short introduction to artificial neural networks.
Hoard, 2000). Males exhibit a strong advantage in some spatial tasks, most importantly ones that involve mental rotation (Linn & Petersen, 1985; Voyer, Voyer, & Bryden, 1995) and this translates into higher scores in areas of mathematics achievement that feature such tasks, for example, the space/shape domain in the PISA assessments (Else-Quest, Hyde, & Linn, 2010; Liu & Wilson, 2009). Geometry might seem related to spatial tasks and Hyde et al. (1990) reported a male advantage in this area but this is not supported by the TIMSS assessments where girls tend to have an advantage. As is the case with algebra, gender differences in geometry likely depend on the specific requirements of the task (Geary, 1996). It has to be emphasized that in large-scale studies of overall mathematics achievement, all of the mentioned differences relating to age and content domains are quite small. They are also often subject to considerable cross-country variation. The largest gender differences are probably found for tasks that require spatial abilities like PISA's space/ shape domain, but even in this case the effect sizes are smaller than d = 0.20. It has been fairly well established that there are more males than females among the top performers in mathematics. This effect is present even before adolescence and increases with the selectivity of the sample (Baye & Monseur, 2016; Benbow, 1988; Hedges & Nowell, 1995; Hyde et al., 1990; Wai, Cacchio, Putallaz, & Makel, 2010; Willingham & Cole, 1997). However, gender differences are often found to be non-existent at lower levels of achievement or even favor females (Baye & Monseur, 2016; Stoet & Geary, 2013; Willingham & Cole, 1997). Accordingly, this finding is usually mostly attributed to a greater male variability in performance, something that seems to hold true for a lot of, but not all, measures of cognitive ability (Baye & Monseur, 2016; Halpern, 2012; Halpern et al., 2007; Hedges & Nowell, 1995; Strand, Deary, & Smith, 2006; Willingham & Cole, 1997). Compared to mathematics, gender differences in verbal abilities display a somewhat reversed picture. When differences are found, they almost always favor females. As is the case with mathematics, the results vary with the specific abilities that are assessed. The largest gender differences relating to verbal abilities are usually found in writing performance (see, e.g., Halpern et al., 2007; Willingham & Cole, 1997) but writing was not assessed in the test at hand. Of all verbal abilities, reading achievement has received the most attention. Hyde and Linn (1988) found no gender difference in reading comprehension in their meta-analysis of verbal abilities. This finding is contradicted by large-scale assessments which almost universally show female advantages. Hedges and Nowell (1995) aggregated several of them and found effect sizes ranging from d = 0.0 to 0.15 in favor of females. In what is probably the most thorough review of US-based large-scale assessments, Willingham and Cole (1997) found an average effect size of d = 0.20. There was, however, considerable variation, with individual effect sizes ranging from d = 0.0 to 0.46, suggesting that the specifics of the assessed task play an important role. The review included studies ranging from fourth grade to twelfth grade; however, the age of the students did not seem to make a meaningful difference. The results of international assessments of reading comprehension are also in favor of girls. Baye and Monseur (2016) reviewed several years of PIRLS and PISA and reported average effect sizes of d = 0.23 and 0.40 for these assessments, both in favor of girls. Male variance in reading is typically 10% to 20% higher than female variance (Baye & Monseur, 2016; Willingham & Cole, 1997). In combination with the average female advantage, this results in a less pronounced advantage for women among the top performers. The flip side is that far more men score at the lower end of the spectrum (Baye & Monseur, 2016; Stoet & Geary, 2013; Willingham & Cole, 1997). Listening comprehension in general has received far less attention than reading comprehension and this is especially true for gender differences in these subjects. Johnson (1996) reported results from two national surveys. In the 1992 run of the Scottish Assessment of
1.1. Gender differences in scholastic achievement In comparison to differences in performance in other subjects taught in school, gender differences in mathematics performance have received the most attention. Through aggregate reviews and meta-analyses (Benbow, 1988; Hedges & Nowell, 1995; Hyde et al., 1990; Johnson, 1996) a look at the population as a whole has been presented, usually favoring males by a small margin. There are, however, several mediating factors, most importantly age and selectivity of the sample, as well as the specifics of the assessed task (Hyde et al., 1990; Willingham & Cole, 1997) that call the usefulness of such a general conclusion into question. The age of the assessed students seems to play a role in whether gender differences are found. In elementary school, differences in mathematics seem to be small. Integrating results from 67 studies, Hyde et al. (1990) found a miniscule advantage for girls in the elementary school age group (d = −0.06). In international studies, similarly small advantages favoring boys are usually found (Baye & Monseur, 2016; Johnson, 1996). After elementary school, boys start to perform slightly better. Over the course of the school years, that advantage slowly grows, peaking in high school. But even then the differences are usually still small (Baye & Monseur, 2016; Halpern et al., 2007;Hyde et al., 1990; Johnson, 1996; Willingham & Cole, 1997). Some content-related gender differences have been discovered. Girls regularly display a slight advantage in computation/arithmetic early in school (Hyde et al., 1990; Johnson, 1996; Willingham & Cole, 1997). Computation also makes up an important part of the TIMSS number domain. In this area however, boys consistently perform better in both the fourth- and eighth-grade assessments (Mullis, Martin, & Foy, 2008; Mullis, Martin, Foy, & Arora, 2012; Mullis, Martin, Foy, & Hopper, 2016; Mullis, Martin, Gonzalez, & Chrostowski, 2004). Another content domain where a female advantage has been reported is algebra (Halpern et al., 2007). This is supported by TIMSS, where girls performed better in algebra in all of the last four eighth-grade assessments. Interestingly enough however, this does seem to depend on the type of task, as boys have often been found to perform better on both algebraic and arithmetic word problems (Geary, 1996; Geary, Saults, Liu, & 2
Intelligence 77 (2019) 101398
P.M. Loesche
Achievement Programme (AAP), a sample-based national survey used to monitor scholastic achievement at various age levels, girls performed better when only auditory information was provided, but performed worse when a video was used. In the 1989 run of the Dutch PPON, a periodic assessment in primary education, eight-year-old girls did notably better in listening and reading comprehension than boys of the same age. However, the same was not true for 12-year-old students. In a more recent study, Lehto and Anttila (2003) compared listening comprehension of both expository and narrative passages in Grades 2, 4 and 6 and found consistent, but mostly small, performance differences in favor of girls. However, fewer than 40 children were tested at each grade level. An advantage for girls in listening comprehension is also supported by the similarities between listening and reading comprehension. Listening comprehension is usually seen as a precursor for reading comprehension and some scholars have even argued that both share the same underlying processes, with reading comprehension only involving an additional decoding stage (see Diakidoy, Stylianou, Karefillidou, & Papageorgiou, 2005 for a discussion; also Royer, Sinatra, & Schumer, 1990). Consequently, it seems reasonable to assume that at least some of the gender differences for reading comprehension are also present for listening comprehension. In summary, mean differences in mathematics and verbal abilities are quite small. The average advantage of boys in mathematics is almost negligible. Girls perform better on average in verbal abilities but the differences are still small. However, the extent of gender differences varies along the performance spectrum with the most pronounced differences being situated at the ends. The highest scorers in mathematics and the lowest scorers in verbal abilities are more likely to be boys. Girls are more prevalent among the highest scorers in verbal abilities. Apart from these cases, differences in performances should not be sufficient to infer children's gender much above chance level.
Hidden Layer Input Layer Output Layer
Fig. 1. A multilayer perceptron with a single hidden layer. Circles represent units, arrows represent weights.
weight of 0.5. The input unit takes a value of 1 for correct answers and 0 otherwise. The first hidden unit will receive an input of 1*0.5 = 0.5 for all test-takers that answered correctly and an input of 0*0.5 = 0 for all others. Weights can be thought of as being the ANN equivalent to the coefficients in a regression model. They define the relationship between inputs and outputs using real-valued numbers. Finding a good set of weights is the goal of the training process described below. In the hidden layer, each hidden unit sums up all the inputs it receives. Because all hidden units are connected to all input units, these sums only differ due to the different connection weights. In order to calculate its state, each hidden unit then applies a transformation to the sum of its inputs. Different transformations can be used. One of the most popular ones is to simply copy the sum if it has a positive sign and otherwise to set the state to zero.1 After the state is obtained, every hidden unit passes it on through a weighted connection to every unit in the next layer. This next layer can either be another hidden layer or the output layer. The design of the output layer reflects the dependent variables similarly to the way the input layer mirrors the independent variables. If the network is supposed to detect whether or not a case belongs to a certain class, for example, girl, a single output unit suffices. The output unit estimates the probability that a case belongs to the corresponding class. In order to calculate that probability, all incoming connections from the hidden layer are summed up and a transformation function is applied. A natural choice for that transformation function is the logistic function, which is used in a similar way in logistic regression,2 because it always results in a value between zero and one, making it interpretable as a probability. As described above, the number of input units and output units is directly related to the number of independent and dependent variables featured. This is not true for the number of hidden layers and units. In general, networks with more and bigger layers can unearth and model more complex relations. However, estimating the parameters becomes more difficult and sample size requirements also increase with network size. In practice it is common to try several arrangements. The quality of the network's predictions hinges on finding a good set of weights. This is done through an iterative process, usually referred to as training. Initially the weights are assigned random values. Then a single case is run through the network which estimates a probability of the case belonging to each target class. After comparing these predictions to the actual class, the weights are adjusted through a method called backpropagation. Backpropagation makes it possible to determine the way every single weight has contributed to a prediction. This information is then used to change the weights in a way that makes
1.2. Artificial neural networks Artificial Neural Network (ANN) is an umbrella term for a certain class of machine learning methods. They have been employed successfully for several tasks that were previously thought to require human intelligence, such as image recognition, Natural Language Processing, or playing Go at a high level (Graves, Mohamed, & Hinton, 2013; Krizhevsky, Sutskever, & Hinton, 2012; Silver et al., 2016). This section provides a basic introduction to them. In order to facilitate understanding and for the sake of brevity, some of the more technical details are left out. A thorough introduction to neural networks can be found in Goodfellow, Bengio, and Courville (2016). ANNs were originally inspired by assumptions about how information processing takes place in the brain. Their main building blocks are the neurons, usually called units, and the connections between them. One of the simplest ways to model an ANN is the Multilayer Perceptron (MLP), depicted in Fig. 1. The units in a MLP are organized hierarchically into layers. Units in adjacent layers are connected to each other. These connections are unidirectional. Units receive input from units in the layer above, calculate their status, and pass it on to units in the layer below. This creates a flow of information from top to bottom, or from input to output. The units in the input layer (input units) are equivalent to independent variables in a regression model. Each continuous explanatory variable is represented by one unit. The state of that unit corresponds to the measurement of that variable. Similar to regression, categorical variables need to be dummy-coded and spread over multiple units. Every unit in the input layer has a connection to every unit in the next layer, the hidden layer. These connections take the state of the input unit, multiply it by a weight that is specific to that connection, and pass the result as input on to the hidden unit. As an example, take an input unit denominating whether the first answer to a test was given correctly and a connection from that unit to the first hidden unit with a
1
A unit that applies this transformation is called a rectified linear unit (ReLU). A multilayer perceptron with no hidden layers and a logistic binary output unit is essentially identical to a logistic regression model. 2
3
Intelligence 77 (2019) 101398
P.M. Loesche
a correct prediction more likely.3 After the weights have been changed, the next case is run through the network. This process is usually continued until the quality of the predictions stops improving. Conventional statistical methods assume a certain predefined relationship between independent and dependent variables. In multiple regression, for example, it is assumed that the dependent variable is a linear function of the predictors. A multilayer perceptron makes no such assumption. It has a high degree of freedom in determining the relationship between input and output. This does not come without a cost, however. In classical statistical models the number of estimated parameters is usually very small, and the parameters can be interpreted within the scope of the models' assumptions. Neural networks often feature thousands or even millions of parameters. They are also highly interconnected. Due to this, it is not humanly possible to understand what a network has learned by interpreting its parameters. However, insight into its predictive capabilities can still be gained, for example, by analyzing the distribution of its predictions or by selecting the cases with the most extreme predictions and examining them for patterns. One major concern with neural networks is generalization. It is very easy to train a neural network that performs well on the data it was trained on, but will not generalize beyond that. Several techniques to improve generalization exist and they are often used in combination. The most important tool, however, is a rigorous segregation of the data used for training the networks and the data used to evaluate their performance. This is achieved by splitting the data into three sets. The training set contains the data the network uses to estimate the weights. The validation set is used to monitor the training process and to compare networks with different architectures, for example, networks with different numbers of hidden layers and units. It is used to measure prediction accuracy, but it never informs the network on how the weights need to be adjusted. The test set is only used to obtain a final measure of the network's prediction accuracy. As its data has never been used either in training or selecting the network's architecture it provides an unbiased estimation of the network's capability to detect the target class.
patterns in a large-scale assessment of scholastic achievement. The degree to which this is possible is a measure of the extent to which girls and boys differ in their response patterns. In order to create a perspective from which these results can be judged, a series of logistic regression models is trained to infer gender from domain performances. If the neural networks are able to detect gender with greater accuracy than the logistic regression models, it follows that girls and boys differ in ways that are not captured by performance comparisons. In that case, it will be interesting to explore what the neural networks have learned. Because the networks' parameters are not directly interpretable there is no easy way to answer this. However, the networks' predictions can be considered a summary of what they have learned. By exploring these predictions, some insight into their abilities can be gained. 2. Method 2.1. Data source and preparation The data stems from the 2016 run of the VERA (Vergleichsarbeiten) assessments. VERA is an annual nationwide survey in which German third-grade elementary school children are tested for their abilities in mathematics and usage of the German language. For each subject, two content domains are tested. The 2016 run featured Reading Comprehension (19 items) and Listening Comprehension (25 items) in the language part. Numbers and operations (17 items) and Patterns and Structures (12 items) were assessed in mathematics. A Rasch model is used to model performance in each content domain separately. The final items were select from a much larger pool of items, after they had been shown to possess adequate psychometric properties in a field trial. The test was administered by the children's teachers who also scored the items. A detailed scoring manual was provided for that purpose. The items could be scored as correct, false, or not answered, but the Rasch model did not differentiate between the last two. The teachers were also asked to record each child's gender, whether he or she had special needs requirements, and whether he or she had sufficient comprehension of the German language. Administration and evaluation of the VERA assessments lie within the responsibility of the 16 federal states of Germany. Data could be obtained from seven of the states, resulting in a base sample of 299,057 children. However, as most states mandated participation only in some of the content domains, close to half of the sample had to be excluded due to incomplete data. Children with special needs requirements or insufficient comprehension of the German language were also excluded, leaving a final sample of 149,465 children for analysis. Girls accounted for 49.6% of the sample, boys for 50.4%. As preparation for the prediction models, two-thirds of the sample was randomly assigned to the training set; the rest was evenly divided between the validation and test sets. As it was unclear which effect the randomness of the assignment might have on the outcome, the split was repeated ten times, creating ten partitions of training, validation, and test sets. The training and validation sets were only used for model building and selection. All subsequent analysis was performed on the test sets, repeated independently for each of the ten partitions. Because the test data were not used in any stage of the model building process this guarantees that overfitting to random patterns in the training data cannot possibly improve accuracy on the test sets.
1.3. Current investigation Most discussion of gender differences in scholastic achievement focuses on performance in broad domains. However, several studies have revealed gender-related differences in strategic approaches to certain item types. Such differences are interesting because they indicate differences in mental representations of problems or the cognitive processes involved in solving them. A better understanding of these differences could therefore not only lead to a better understanding of how gender affects scholastic achievement but also of the processes by which it does. Currently, it is not known to what extent girls and boys differ in their approaches to scholastic tasks. Performance differences in large-scale assessments of scholastic achievement are mostly small. However, this does not necessarily indicate that girls and boys approach tasks in a similar way, because similar performances might be achieved through different means. The main purpose of this study is thus to investigate the extent to which gender affects scholastic achievement in ways that are not captured by performance comparisons. To this end, a series of neural networks is trained to infer children's gender from their item-scoring
2.2. Analysis
3
The basic idea is to assign a cost to wrong predictions, for example, the squared error. Because a neural network is essentially a mathematical function that transforms inputs into outputs, this cost can be expressed as a function of the inputs, the weights, and the correct class. Backpropagation uses the chain rule to calculate the gradient of the cost with respect to the weights. Because a local minimum of a function can be found by following its negative gradient, the calculated gradient can be used to update the weights and bring the cost closer to a minimum.
In order to relate the gender differences in VERA to those found in other assessments of scholastic achievement, two statistics were computed for each of the four domains. The magnitude of gender differences in performance was compared using Cohen's d. Gender differences in variability were compared using variance ratios. Logistic regression was used to predict gender solely from the domain performances. One model was built for each of the ten sample 4
Intelligence 77 (2019) 101398
P.M. Loesche
partitions. The training set was used to estimate the model parameters. The model was then applied to the test set, resulting in a probability estimate of the children's gender. In order to compute a measure of prediction accuracy, these estimates were dichotomized according to which gender was predicted with greater likelihood. If the predicted gender matched the recorded gender, the prediction was considered correct. An accuracy score was obtained by calculating the percentage of correct predictions. Improvement over chance (Huberty & Olejnik, 2006) was also calculated. This index measures the proportional reduction in misclassification errors under a model compared to prediction by chance. Here, prediction by chance was modeled as always predicting the more frequent gender. This results in accuracy slightly over 50%. Artificial Neural Networks were used to predict gender from the individual item scores. Three input units were created for each item, one for each possible answer (correct, false, no answer). One output unit represented the predicted gender. For each sample partition, an automated search process was used to determine a well-performing network. The process repeatedly trained networks while varying the number of hidden layers and units along with some other network parameters.4 All networks were trained on the training sets. For each partition, the network that yielded the highest accuracy on the validation set was selected. These networks were then applied to the test sets to compute an accuracy score and improvement over chance. Some techniques were used to prevent the networks from overfitting to the data during training. Firstly, a penalty was applied to the squared size of the weights. This is called L2 Regularization and has the effect of keeping the weights small unless they reliably help improving prediction. Another technique that was used is to turn off randomly selected units during the training phase. This is known as Dropout (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014) and establishes a need for redundancy in the networks. Because it changes the input data, it can also be viewed as a way of preventing the networks from “memorizing” individual cases.
Table 1 Effect sizes (d) and variance ratios (VR) for gender differences in domain performance. d
VR
Reading comprehension Listening comprehension Numbers and operations Patterns and structures
0.18 0.15 −0.17 −0.22
1.03 0.92 1.13 1.09
Note: Effect sizes greater than zero indicate girls performed better. Variance ratios greater than one indicate higher variance in boys performance. Table 2 Sample size of test set (N) and percentage of children whose gender was identified correctly by logistic regression (LR) and neural network (NN) models for each sample partition. Sample partition
N
LR
NN
1 2 3 4 5 6 7 8 9 10
24,911 24,912 24,912 24,908 24,908 24,911 24,910 24,912 24,912 24,910
59.2% 59.2% 58.7% 58.9% 59.3% 59.5% 59.4% 59.0% 58.8% 59.1%
65.2% 65.2% 64.9% 65.1% 65.4% 65.0% 65.1% 64.7% 64.4% 65.2%
comparison of domain performances. Neural networks are evidently able to pick up patterns that are indicative of children's gender. Due to the complexity of the network models, there is no easy way to tell what those patterns consist of. However, some insight into what they learn can be gained by examining their predictions more closely, investigating how they differ from the prediction of the logistic regression models and by focusing on the cases they predict with a high certainty. The neural networks put out a probability estimate of how likely a child is a boy.5 These estimates are a summary of what the networks have learned about the gender of each child. As can be seen in Fig. 2, the estimated probabilities coincide almost perfectly with the true rates. Thus, the probabilities assigned to the subjects by the networks can be considered reasonably unbiased estimates of those subjects' gender.6 As such, they can be interpreted as the amount of evidence pointing towards a gender that the networks were able to gather. An effect size estimate of the gender differences unearthed by the networks can be constructed by comparing the probabilities assigned to girls and boys. Fig. 3 shows the distribution of these probabilities. The sample partition means ranged from 0.44 to 0.45 for the girls and from 0.55 to 0.57 for the boys. The pooled standard deviations ranged from 0.13 to 0.16, yielding Cohen's d effect sizes between 0.74 and 0.78. Because, in all likelihood, the networks were not able to recover all information indicating the children's gender, this can be considered a lower-bound estimate of the true differences. For contrast, a similar analysis of the logistic regression models resulted in effect size between 0.44 and 0.47. However, a further look into the spread of the predictions supports a view of gender differences that goes beyond simple group comparisons. Fig. 3 reveals both substantial differences and considerable overlap in
3. Results 3.1. Gender differences in domain performances The gender differences in the means are listed in Table 1. They followed typical patterns with boys performing better in mathematics and girls performing better in the German language domains. Table 1 also shows the boy-to-girl variance ratios for all four domains. As expected, the boys varied more in their performances in mathematics. This was not true for the performance in Listening Comprehension which showed higher variance for the girls. 3.2. Gender detection accuracy The accuracy scores for the logistic regression and neural network models are presented in Table 2. In the median over all 10 sample partitions, logistic regression models predicted gender correctly in 59.1% of cases, yielding a 17.6% improvement over chance. Using the same data, the neural networks performed considerably better with a median prediction accuracy of 65.1%. This corresponds to a 29.7% reduction in misclassification errors compared to prediction by chance. The variation in accuracy between sample partitions was similar for both types of models.
5 The decision on the gender used as the target class was arbitrary. Here, boy was chosen because of the way gender was coded in the dataset. As the probability of being boy and the probability of being girl add up to one, the probability of each gender can easily be derived from the other. 6 Consider the alternative where, for example, both a 60% and a 70% prediction turn out correct in 65% of cases. The networks' predictions would have the same quality on average, but the individual probabilities could not be well interpreted.
3.3. What do the networks learn? The above results constitute clear evidence that gender differences in scholastic achievement tests exist that are not captured by a 4
Domain
See the Appendix A for details on the networks and the building process. 5
Intelligence 77 (2019) 101398
P.M. Loesche
Fig. 2. Percentage of boys by estimated probability that a child is a boy. Groups were formed by rounding the estimated probabilities to single percentage values. The vertical axis shows the fraction of boys within these groups. Bars represent standard errors.
Fig. 3. Distribution of estimated gender probabilities by gender for each sample partition.
the probabilities assigned to girls and boys. In over 40% of cases, the networks assign a probability of less than 60% to the predicted gender, indicating a high degree of uncertainty. Correspondingly, only 56% of those predictions turn out to be correct, barely exceeding prediction by chance. On the other hand, there are also cases in which the networks assign a high probability to the outcome. Looking at the 10% of cases that the networks deem most likely to be a girl and the 10% they deem most likely to be a boy, 80% (girl) and 79% (boy) of those predictions turn out to be correct. It follows that there are some scoring patterns that are strongly indicative of a child's gender. However, only some children exhibit such patterns. This is not to say that certain groups emerged. As can be seen in Fig. 4, the predictions are generally distributed rather smoothly, closely resembling a normal distribution. This suggests that the networks base their predictions on a gradual accumulation of evidence, rather than relying on a few distinct patterns. Consequently, the removal of individual items should not affect the networks' predictions to a great
extent. A better understanding of the networks performance can be facilitated by contrasting their predictions to those of the logistic regression models and by looking for patterns in the cases they predict with a high certainty. To this end, the models corresponding to the fifth sample partition were examined more closely. A single partition had to be chosen because it was not clear what would be learned from a joint analysis of different models built upon different sets of data. The fifth sample partition was chosen because of the high accuracy of both the neural network (65.4%) and the logistic regression (59.3%) models. Both models predicted the same gender for 72.6% of the 24,908 children in the test set with a prediction accuracy of 67.0%. When the models disagreed (27.4%), the neural network predicted 61.2% of the cases correctly and the logistic regression model 38.8%. These are the most interesting cases because there is no reason why the network should be unable to use the information provided by the domain performances in a similar way as the logistic regression model. It follows 6
Intelligence 77 (2019) 101398
P.M. Loesche
Fig. 4. Frequency distribution of estimated gender probabilities by sample partition.
that in cases in which the models do not agree, the network was able to gather enough contradicting information about the children's gender to overcome the prediction suggested by the domain performances. In order to investigate whether these differing predictions appear at certain locations in the performance spectrum each of the content domain performances was binned into terciles of low, average and high performance. The combination of these terciles was used to assign each child to one of the resulting 81 performance groups. Table 3 shows the 10 performance combinations for which the models disagree the most. What all of these cases have in common is that the higher performance in one subject falls between the two performances in the other subject. Consequently, the logistic regression model was not able to accurately classify those cases. It only predicted the correct gender in 51.7% of them. The neural network on the other hand, was still able to pick up patterns that indicated the children's gender. The lack of a clear order in the domain performances only resulted in a minor drop of accuracy to 63.7%. The same lack of hierarchy was also found in an analysis of the performance combinations for which the network showed the highest gains in accuracy compared to the logistic regression model. Although this was a main driver of the differences between the network predictions and those of the logistic regression model, it was
not the only one. Almost half of the cases (47.3%) in the test set consisted of children for whom the performances in one subject exceeded those in the other subject. Both models were more accurate in classifying these cases but there was still considerable disagreement. The network predicted a different gender for 17.9% of these cases and it was more accurate when doing so (59.5% vs 40.5%). It follows that, even when the performances took a gender-typical shape, some children exhibited response patterns that allowed the network to accurately classify them differently. This would not be possible if gender differences in the aggregate performances were a sufficiently accurate reflection of the gender differences in the underlying tasks. However, an analysis of the cases the network was most certain about showed that they are still important for the networks predictions. In 6.2% of the cases, the network predicted the gender with a probability of more than 80%. Most, but not all of these cases showed a gender-typical performance pattern with boys scoring higher in mathematics and girls scoring higher in the verbal domains. Correspondingly, the logistic regression model usually predicted the same gender. A look into the performance constellations the network was least certain about revealed an interesting pattern. Most of these cases were children that scored in the highest terciles in all of the four content domains. About 9% of the cases in the test set fell into this group and only 61.2% were correctly classified by the network. This lack of ability to accurately discriminate at the high end of the performance spectrum could indicate that the best-performing children used similar approaches and thus exhibited similar response patterns, regardless of their gender. However, it might also be caused by a lack of items difficult enough to discriminate between proficient users of different strategies.
Table 3 The 10 combinations of Performance Terciles (PFT) that are most often classified differently (DA) by the neural network and the logistic regression model. PFT-RC
PFT-LC
PFT-NO
PFT-PS
N
DA
+ 0 0 0 + − − 0 − +
− − + + − 0 0 0 + −
0 − 0 − − − + − − +
0 0 + + + 0 − + 0 0
130 301 181 28 12 385 34 66 69 107
45.4% 44.2% 43.1% 42.9% 41.7% 41.3% 41.2% 40.9% 40.6% 40.2%
4. Discussion The results presented above strongly indicate that boys and girls differ in ways that are not captured by a comparison of aggregate performances in broad domains. Using information contained in the individual item scores, neural networks were able to correctly identify the gender of 65.2% of the children. Some of these children could be identified with high certainty. Differing performances in subject domains were a good indicator of gender, but only for the children that showed a gender-typical pattern. Even in those cases however, response
Note: (+, 0, −) denote (high, average, low) performance in Reading Comprehension (RC), Listening Comprehension (LC), Numbers and Operations (NO) and Patterns and Structures (PS). DA is the percentage of cases the models disagreed about. 7
Intelligence 77 (2019) 101398
P.M. Loesche
patterns were still helpful in identifying gender, allowing the neural networks to outperform the logistic regression models. Evidently, children differ in their approaches to at least some of the items, and those differences are in some way related to gender. In this context, the term ‘approach’ should be understood in its broadest possible meaning. For example, items might feature knowledge that is more prevalent in one gender or evoke certain gender-related stereotypes. The most obvious candidates, however, are differences in mental representations and solution strategies. Some gender differences of this kind relating to problem types and solution strategies were presented in the introduction. More research into this area could not only lead to a better understanding of the extent to which gender influences scholastic achievement, but also of the ways in which it does. In contrast, current discussion of gender differences in scholastic achievement is often focused on the size of performance differences between girls and boys. Such comparisons have their merit. They present a high-level summary of the skills required to solve certain types of tasks. As such, they are probably the most important indicator of gender differences in scholastic abilities and their effect on occupational choices and opportunities. However, the ability to solve a specific task is not solely determined by a one-dimensional underlying ability but is shaped by a combination of factors. And, as demonstrated here, gender differences in overall performance are only partly reflective of gender differences in those factors. This is something that has to be kept in mind when assessing the effects of socio-cultural variables such as gender equity. They might affect scholastic abilities in various ways without necessitating greater parity in overall performance. Likewise, even in the absence of gender differences in performance, such variables could still have an effect on the underlying abilities.
5.1. Narrowing down the subject Both verbal and mathematics tasks were used as a basis for discovering gender in this study. It is not clear how much each of these subjects or their interaction contributed to the predictions. This could be clarified by training networks on each content domain separately. Differences in approaches to certain items should be easier to discover by focusing on a narrower set of related tasks, for example, mathematics or algebra. 5.2. Focus on extreme predictions Some children's gender was recognized with a high degree of accuracy. It follows that these children exhibited strong gender-typical scoring patterns. By focusing on these children, one could acquire an insight into what these patterns are and then hypothesize about why they are typical for the respective gender. An extension of this would be to change the responses of these children in a way that maximizes the predicted probability and thereby find prototypical scoring patterns for boys and girls. 5.3. Identify predictive response patterns A similar strategy would be to identify doublets or triplets of items that have high predictive power and analyze what they have in common. One way of doing this would be to compute interactions between all item pairings, then test their predictive accuracy using logistic regression models and investigate the pairing that yielded the highest accuracy. Due to the large number of possible pairings (e.g., 812 for the mathematics subtest used here), steps to eliminate Type I errors would need to be taken. Larger groups of items with high predictive power could be identified using decision tree methods, but only if the groups feature items that are predictive on their own.
5. Limitations and further research Naturally, the results presented here need to be replicated. VERA is a well-designed assessment that has been running for more than a decade, but until the differences discovered here can also be found in other surveys they should be treated with care. Particular caution is warranted because of the undefined nature of the patterns the networks based their predictions on. Neural networks are likely to fit to random correlations, but due to the separation of the data used to test the networks from the data used to build them this is not a concern. It would only make their predictions less accurate. Other non-replicable factors might still be an issue. Most importantly, items might feature knowledge that is more prevalent in one gender. If this has an effect on the likelihood to solve these items, the neural networks might use this information, regardless of whether this knowledge is considered to be part of the measured ability. However, such items also affect traditional measures and great care is usually taken in avoiding them. Thus, it is not likely that they are a main driver of the differences discovered here. It also has to be noted that replicable effects might not be limited to strategic approaches. For example, it has been discussed whether males perform better in multiple-choice tasks (Liu & Wilson, 2009). If this is true, then it might indicate that different response formats induce the use of different strategies. However, it might also be caused by genderrelated differences in risk-taking, not usually considered an element of scholastic achievement. The main drawback of the technique presented here is that it serves mainly to measure the extent of gender differences in scholastic achievement. This allows for a rough quantification of these differences, demonstrating how much is not understood about them. But, especially in light of the questions raised above, this is of limited use if the nature of these differences is not discovered. Understanding them will primarily involve traditional methods of research, guided by theories. Neural networks, or similarly powerful methods, could support this process by identifying items that are informative about a person's gender. Some ways one might go about this are proposed below.
5.4. Remove information Some items might not be informative of gender on their own but only by interacting with other items. One way of assessing how much such items contribute to gender detection would be to randomly redistribute the answers to an item among the subjects and measure the impact on prediction accuracy. As the answers are not tied to a person anymore, they cannot be informative of that person's gender. The resulting drop-off in accuracy would thus be a measure of their contribution to gender detection. Declaration of Competing Interest This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. Appendix A. Network design and selection process All networks were made up of rectified linear units in the hidden layers and a single sigmoid output unit, representing gender. The networks were trained in batches, using 128 cases at a time. Adam (Kingma & Ba, 2014) was used to minimize cross-entropy loss between predictions and target. Cross-entropy loss strongly penalizes very wrong predictions, making them less likely. In order to combat overfitting, L2 regularization and dropout (Srivastava et al., 2014) were used. The dropout-rates were 0.8 at the input units and 0.5 at the hidden units. Batch normalization (Ioffe & Szegedy, 2015) was used as an additional regularizer and to speed up training. The learning rate determines how strongly a network changes its parameters during training. It was multiplied with a factor of 0.8 every time the network had been trained on all cases (epoch), leading to more fine-grained adjustments as the training progressed. The training was stopped if validation accuracy did 8
Intelligence 77 (2019) 101398
P.M. Loesche
not improve in the last 10 epochs. The network was then reverted to the state in which it showed the highest validation accuracy. The setup of the hidden layers was determined by a search process. Ten possible setups of hidden layers and units were tried, featuring between 1 and 8 layers, and between 100 and 800 units per layer. In the first round of the search process, a hidden layer configuration was randomly drawn. Then a network using that configuration was trained and its validation accuracy was computed. This was repeated 600 times. Then the 400 networks that performed best on the validation set were selected and the fraction with which each setup was represented in that group was computed. This fraction then served as the probability by which each setup was to be drawn in the next round. This way, setups of hidden units that performed better were more likely to be tried again. This process was repeated several times with continuously decreasing numbers of trained and selected networks. In total, 2000 networks were trained for each sample partition. The initial learning rate and the size of the L2 penalty were similarly determined by this process. Both parameters were given a starting range from which values were drawn. In the selection stage this range was capped to only include the values that were present in the best performing models. For both parameters the range was tied to the hidden setup, so that for each hidden setup ranges of suitable parameters could be determined and subsequently used for training. For each sample partition, the network that performed best on the validation set was chosen for further evaluation.
https://doi.org/10.1111/j.1529-1006.2007.00032.x. Hedges, L. V., & Nowell, A. (1995). Sex differences in mental test scores, variability, and numbers of high-scoring individuals. Science, 269(5220), 41–45. Huberty, C. J., & Olejnik, S. (2006). Applied MANOVA and discriminant analysis. Wiley series in probability and statistics(2nd ed.). Hoboken, N.J: Wiley-Interscience. https:// doi.org/10.1002/047178947X. Hyde, J. S., Fennema, E., & Lamon, S. J. (1990). Gender differences in mathematics performance: A meta-analysis. Psychological Bulletin, 107(2), 139–155. https://doi. org/10.1037/0033-2909.107.2.139. Hyde, J. S., & Linn, M. C. (1988). Gender differences in verbal ability: A meta-analysis. Psychological Bulletin, 104(1), 53–69. https://doi.org/10.1037/0033-2909.104.1.53. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In F. Bach, & D. Blei (Eds.). Proceedings of the 32nd international conference on machine learning (pp. 448–456). PMLR. Johnson, S. (1996). The contribution of large-scale assessment programmes to research on gender differences. Educational Research and Evaluation, 2(1), 25–49. https://doi.org/ 10.1080/1380361960020102. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. ArXiv EPrints. Retrieved from https://arxiv.org/abs/1412.6980. Klein, P. S., Adi-Japha, E., & Hakak-Benizri, S. (2010). Mathematical thinking of kindergarten boys and girls: Similar achievement, different contributing processes. Educational Studies in Mathematics, 73(3), 233–246. https://doi.org/10.1007/s10649009-9216-y. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Vol. Eds.), Advances in neural information processing systems. Vol. 25. Advances in neural information processing systems (pp. 1097–1105). Curran Associates, Inc.. Retrieved from http://papers.nips.cc/paper/4824-imagenet-classification-withdeep-convolutional-neural-networks.pdf. Lehto, J. E., & Anttila, M. (2003). Listening comprehension in primary level grades two, four and six. Scandinavian Journal of Educational Research, 47(2), 133–143. https:// doi.org/10.1080/00313830308615. Linn, M. C., & Petersen, A. C. (1985). Emergence and characterization of sex differences in spatial ability: A meta-analysis. Child Development, 56(6), 1479–1498. https://doi. org/10.2307/1130467. Liu, O., & Wilson, M. (2009). Gender differences in large-scale math assessments: PISA trend 2000 and 2003. Applied Measurement in Education, 22(2), 164–184. https://doi. org/10.1080/08957340902754635. Mullis, I. V. S., Martin, M. O., & Foy, P. (2008). TIMSS 2007 international mathematics report: Findings from IEA’s trend in international mathematics and science study at the fourth and eighth grades. Chestnut Hill, Mass: TIMSS & PIRLS International Study Center. Mullis, I. V. S., Martin, M. O., Foy, P., & Arora, A. (2012). TIMSS 2011 international results in mathematics. Chestnut Hill, Mass: TIMSS & PIRLS International Study Center. Mullis, I. V. S., Martin, M. O., Foy, P., & Hopper, M. (2016). TIMSS 2015 international results in mathematics. Retrieved from http://timssandpirls.bc.edu/timss2015/ international-results/wp-content/uploads/filebase/full%20pdfs/T15-InternationalResults-in-Mathematics.pdf. Mullis, I. V. S., Martin, M. O., Gonzalez, E. J., & Chrostowski, S. J. (2004). TIMSS 2003 international mathematics report: Findings from IEA’s trends in international mathematics and science study at the fourth and eighth grades. Chestnut Hill, Mass: TIMSS & PIRLS International Study Center. Royer, J. M., Sinatra, G. M., & Schumer, H. (1990). Patterns of individual differences in the development of listening and reading comprehension. Contemporary Educational Psychology, 15(2), 183–196. https://doi.org/10.1016/0361-476X(90)90016-T. Siegler, R. S. (1987). The perils of averaging data over strategies: An example from children’s addition. Journal of Experimental Psychology: General, 116(3), 250–264. https://doi.org/10.1037/0096-3445.116.3.250. Siegler, R. S. (1988). Individual differences in strategy choices: Good students, not-sogood students, and perfectionists. Child Development, 59(4), 833–851. https://doi. org/10.2307/1130252. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., & Hassabis, D. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958. Stoet, G., & Geary, D. C. (2013). Sex differences in mathematics and reading achievement are inversely related: Within- and across-nation assessment of 10 years of PISA data. PLoS One, 8(3), e57988. https://doi.org/10.1371/journal.pone.0057988. Strand, S., Deary, I. J., & Smith, P. (2006). Sex differences in cognitive abilities test scores: A UK national picture. The British Journal of Educational Psychology, 76, 463–480. https://doi.org/10.1348/000709905X50906 Pt 3. Voyer, D., Voyer, S., & Bryden, M. P. (1995). Magnitude of sex differences in spatial abilities: A meta-analysis and consideration of critical variables. Psychological Bulletin, 117(33–2909), 250–270. https://doi.org/10.1037/0033-2909.117.2.250. Voyer, D., & Voyer, S. D. (2014). Gender differences in scholastic achievement: A metaanalysis. Psychological Bulletin, 140(4), 1174–1204. https://doi.org/10.1037/ a0036620. Wai, J., Cacchio, M., Putallaz, M., & Makel, M. C. (2010). Sex differences in the right tail of cognitive abilities: A 30 year examination. Intelligence, 38(4), 412–423. https://doi. org/10.1016/j.intell.2010.04.006. Willingham, W. W., & Cole, N. S. (1997). Gender and fair assessment. Gender and fair assessment. Mahwah, NJ, US: Lawrence Erlbaum Associates Publishers.
References Bailey, D. H., Littlefield, A., & Geary, D. C. (2012). The codevelopment of skill at and preference for use of retrieval-based processes for solving addition problems: Individual and sex differences from first to sixth grades. Journal of Experimental Child Psychology, 113(1), 78–92. https://doi.org/10.1016/j.jecp.2012.04.014. Baye, A., & Monseur, C. (2016). Gender differences in variability and extreme scores in an international context. Large-Scale Assessments in Education, 4(1), 541. https://doi.org/ 10.1186/s40536-015-0015-x. Benbow, C. P. (1988). Sex differences in mathematical reasoning ability in intellectually talented preadolescents: Their nature, effects, and possible causes. Behavioral and Brain Sciences, 11(2), 169–232. Casey, M. B., Nuttall, R., Pezaris, E., & Benbow, C. P. (1995). The influence of spatial ability on gender differences in mathematics college entrance test scores across diverse samples. Developmental Psychology, 31(4), 697–705. https://doi.org/10.1037/ 0012-1649.31.4.697. Diakidoy, I.-A. N., Stylianou, P., Karefillidou, C., & Papageorgiou, P. (2005). The relationship between listening and reading comprehension of different types of text at increasing grade levels. Reading Psychology, 26(1), 55–80. https://doi.org/10.1080/ 02702710590910584. Else-Quest, N. M., Hyde, J. S., & Linn, M. C. (2010). Cross-national patterns of gender differences in mathematics: A meta-analysis. Psychological Bulletin, 136(1), 103–127. https://doi.org/10.1037/a0018053. Fennema, E., Carpenter, T. P., Jacobs, V. R., Franke, M. L., & Levi, L. W. (1998). A longitudinal study of gender differences in young children’s mathematical thinking. Educational Researcher, 27(5), 6–11. https://doi.org/10.3102/0013189X027005006. Gallagher, A., Levin, J., & Cahalan, C. (2002). Cognitive patterns of gender differences on mathematics admissions tests. ETS research report serieshttps://doi.org/10.1002/j.23338504.2002.tb01886.x 2002(2), i-30. Gallagher, A., & Lisi, R. (1994). Gender differences in scholastic aptitude test: Mathematics problem solving among high-ability students. Journal of Educational Psychology, 86(2), 204–211. https://doi.org/10.1037/0022-0663.86.2.204. Ganley, C. M., & Vasilyeva, M. (2011). Sex differences in the relation between math performance, spatial skills, and attitudes. Journal of Applied Developmental Psychology, 32(4), 235–242. https://doi.org/10.1016/j.appdev.2011.04.001. Geary, D. C. (1996). Sexual selection and sex differences in mathematical abilities. Behavioral and Brain Sciences, 19(2), 229–247. Geary, D. C., Saults, S. J., Liu, F., & Hoard, M. K. (2000). Sex differences in spatial cognition, computational fluency, and arithmetical reasoning. Journal of Experimental Child Psychology, 77(4), 337–353. https://doi.org/10.1006/jecp.2000.2594. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Vol. 1). Cambridge, Massachusetts, London, England: MIT Press. Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. 2013 IEEE international conference on acoustics, speech and signal processinghttps://doi.org/10.1109/ICASSP.2013.6638947. Halpern, D. F. (2012). Sex differences in cognitive abilities (4th ed.). New York, NJ: Psychology Press. Halpern, D. F., Benbow, C. P., Geary, D. C., Gur, R. C., Hyde, J. S., & Gernsbacher, M. A. (2007). The science of sex differences in science and mathematics. Psychological Science in the Public Interest : a Journal of the American Psychological Society, 8(1), 1–51.
9