Technological Forecasting & Social Change 149 (2019) 119756
Contents lists available at ScienceDirect
Technological Forecasting & Social Change journal homepage: www.elsevier.com/locate/techfore
A novel machine learning approach for evaluation of public policies: An application in relation to the performance of university researchers
T
⁎
María Teresa Ballestara, , Luis Miguel Doncelb, Jorge Sainzc, Arturo Ortigosa-Blanchd a
ESIC Business & Marketing School, Spain Universidad Rey Juan Carlos, Spain c University of Bath, UK and Universidad Rey Juan Carlos, Spain d ESIC Business & Marketing School, Spain b
A R T I C LE I N FO
A B S T R A C T
Keywords: Research evaluation Machine learning Longitudinal clustering Incentive-based policies
Research has become the main reference point for academic life in modern universities. Research incentives have been a controversial issue, because of the difficulty of identifying who are the main beneficiaries and what are the long-term effects. Still, new policies including financial incentives have been adopted to increase the research output at all possible levels. Little literature has been devoted to the response to those incentives. To bridge this gap, we carry out our analysis with data of a six years program developed in Madrid (Spain). Instead of using a traditional econometric approach, we design a machine learning multilevel model to discover on whom, when, and for how long those policies have an effect. The empirical model consists of an automated nested longitudinal clustering (ANLC) performed in two stages. Firstly, it performs a stratification of academics, and secondly, it performs a longitudinal segmentation for each group. The second part considers the researchers’ sociodemographic, academic information and the evolution of their performance over time in the form of the annual percentage variation of their marks over the period. The new methodology, whose robustness is tested with a multilayer perceptron artificial neural network with a back-propagation learning algorithm, shows that tenure track researchers present a better response to incentives than tenured researches, and also that gender plays an important role in academia. These discoveries are relevant to administrations and universities for understanding the productivity of academics working under long-term incentive-based programs, the drawbacks and the inequalities for maximizing the generation of knowledge.
1. Introduction The design of incentives for academics and the evaluation of their performance is a pressing issue for higher education institutions and funding agencies. Over the last decades, academia has witnessed a shift in the balance between teaching and research in favor of the former. This process is heavily promoted by the increase in the internationalization of the students, the parallel emergence of international rankings (Shanghai, Times Higher Ed, European Commission, etc.) to reduce informational asymmetries, and the global competition for research funds from private and public organizations (Hicks, 2012). The logic behind this behavior is the belief, difficult to measure, that good research is related to good teaching. This is driving professors to actively promote their accomplishments and their visibility in the search for funding from financing bodies and the own institutions
(Auranen and Nieminen, 2010). The emergence of networks of knowledge rapidly translates into a differentiation over the researchers. Those who are in top tier universities play in a different league and have average or below average impact even if they have higher productivity (Taylor, 2011). To increase the dominance of their own researchers, different countries have built incentive systems meant to promote the quality of their institutions and researchers, aiming to increase their global impact. Countries learn from the policy experiments of their peers (Dobbin et al., 2007; Easterly and Levine, 1997). Two of the best known are the 2014 Research Excellence Framework (REF) and the 2008 Research Assessment Exercise (RAE) in the United Kingdom. It is difficult to understate their relevance; “with 0.88% of global population, 3.2% of global R&D expenditure, and 4.1% of global researchers, accounting for 9.5% of research downloads, 11.6% of citations, and 15.9% of the world's most
⁎
Corresponding author. E-mail addresses:
[email protected] (M.T. Ballestar),
[email protected] (L.M. Doncel),
[email protected] (J. Sainz),
[email protected] (A. Ortigosa-Blanch). https://doi.org/10.1016/j.techfore.2019.119756 Received 12 June 2019; Received in revised form 2 September 2019; Accepted 19 September 2019 0040-1625/ © 2019 Elsevier Inc. All rights reserved.
Technological Forecasting & Social Change 149 (2019) 119756
M.T. Ballestar, et al.
can be used in several policy issues with large social and economic implications that go further than policy management but also have a theoretical impact (Kleinberg et al, 2015; Kleinberg et al, 2018). While these methods clearly have many advantages, Athey (2017) points out that ML driven policies may deprive stakeholders of the knowledge about how and why policies are made, raising issues like transparency, interpretability, fairness, or discrimination. But as she concludes, the need for explanations of real-world policy issues through big data will bring positive new methodological advances and the possibility of its implementation for other policies based on data science. Our methods of choice, clustering and artificial neural networks, are recommended by Mjolsness and DeCoste (2001) in their seminal paper in Science about machine learning methods, given their ability to: “… improve their performance of a task on the basis of their own previous experience.” They have been widely used before on areas like ecology (Kattge et al, 2011), engineering (LaConte et al, 2005), employment, (Cvecic and Sokolic, 2018 or Wang, 2019) and Business (Ballestar et al, 2018a,b; 2019) and follow the validation scheme similar to the one designed by Pers et al. (2009). The application of these methods to higher education has been mainly on student and teacher evaluation (Sin and Muthu, 2015). Although this analysis, as any knowledge structure, is heavily dependent on the initial structure some predictions can be established. This is the case with academic tenure, which is a traditional topic in higher education research (Musselin, 2005; Chait, 2009; Epstein and Fischer, 2017) as it has a lasting effect on how academics behave, especially in Continental Europe, where, in many cases, contracts are linked to public sector job-security. Our analysis will provide interesting new results on the effect of different work contracts, which is relevant for designing new career paths, as European countries have already done, following the recommendations by the European Commission and the OECD (Wang et al, 2018). The analysis of the effects of evaluation on research strategies is a much less-travelled path. None of the previous contributions in the literature reviewed have the same number of evaluations to study the long-term effect of incentives on the behavior of academics. We test the performance of the program over time and how the characteristics of the researchers correspond to different responses to the incentives. To do so, we developed an innovative machine learning method which is able to process and learn from a longitudinal dataset which is processed in two different stages. Our contribution is dual. Firstly, it focuses on the development of a robust ML method and secondly it is used to evaluate the effects of economic incentives on research in long-term public policies. We focus on two main hypotheses that are relevant for policy design. First, employment status, in terms of contractual relationship, is a key in how incentives affect academics (H1) and, secondly, the effects of this type of policy dilute over time (H2). Both issues have already been discussed at some length in the analysis of the British system (Hobbs and Roberts, 2016) but the lack of data with extensive length of time and number of individuals makes our results very relevant to the discussions. For the purpose of robustness, we test the results of our ML model based on an Automated Nested Longitudinal Clustering method (ANLC), by using a completely newly developed ML model based on a multilayer perceptron (MLP) artificial neural network (ANN) with a back-propagation learning algorithm, as designed in Ballestar et al (2019) that will also allow for forecasting the results of future policies. Both hypotheses are documented in the literature on human resources (see the reviews by Jenkins et al, 1998 or Wright and Boswell, 2002) and in the experimental approach by Camerer and Hogarth (1999). The use of incentives has been the subject of a heated debate on its effects on academic integrity, reproducibility and focus (see for example Edwards and Roy, 2017; Chambers et al, 2015; Lakens et al 2018, etc.). To our knowledge, none have used a dynamic ML approach such as ours to focus on the dynamic environment which allows for real time traceability and reporting.
highly-cited articles” (Hobbs and Roberts, 2016). Some of the results of both exercises were as expected from the first theoretical models in the literature: increase in the hiring of quality academics, pressure for results which widely differ across disciplines and across universities, an improvement in the international rankings and higher visibility of UK universities, which have become international competitors (Taylor, 2011; Tietze, 2018). The competition between departments results in a “hunting season” for top tier academics across the globe, as REF and RAE periods approach. There were also unexpected side results. There is an acute case of short-termism in research. Much of the effort was put into the periods close to the evaluations, so academics were more focused on the volume of the output than on its quality, which affected its impact. Long-term projects are sidelined even if the expectations of their output are large because of the acute need for funding. Also, and although it was not generalized, there has been a surge in the cases of malpractice in academia that has affected the credibility of the whole profession and has increased the call for higher ethical standards for researchers, journals and the entire system (Pontille and Torny, 2010; Rothstein and Uslaner, 2005). To further analyze these issues, which are similar to other public policy assessments, we take advantage of a newly available dataset of an incentives program implemented in the Madrid Region to boost the productivity of the public universities from 2005 to 2010. Instead of using traditional econometrics we developed a data science method, consisting of an Automated Nested Longitudinal Clustering (ANLC) performed in two stages that allows us to find the main results of the project, its dynamic behavior and, through machine learning, facilitates the application of the method to the evaluation of public policies. Our results show that long-term incentive programs are more effective with tenure-track researchers, who benefit from them in two different ways at the same time: gaining tenure and obtaining the economic rewards. This finding is valid across areas of knowledge and universities. Also, we see that the effects of the program fade over time and that gender plays an important role in academia. The comparison of the performance between men and women varies depending on their contractual relationship with the university. Men are the ones that reach higher marks in the tenured researcher's group and women in the tenure-track researcher's group, when researchers are significantly younger. In the next sections we survey the literature of Artificial Intelligence (AI), specifically Machine Learning, and also the literature of public policies evaluation to later analyze our dataset by using a novel multilevel machine learning approach that applies stratification in the first stage and a non-supervised machine learning method in the second stage. We also validate the robustness of this novel method, analyze the result and propose further developments in the conclusions section. The validation of the robustness of the ANLC also represents an innovation, as we develop a complete new supervised machine learning method, which is a multilayer perceptron (MLP) artificial neural network (ANN) with a back-propagation learning algorithm, to test the results. 2. Theoretical framework The use of AI on Public Policy design is a relatively novel area of study. One of the AI instruments which is becoming widely used is Machine Learning (ML). Recently Athey and Imbens (2017) address the advantages of these methods, pointing out that “Machine learning methods provide important new tools to improve estimation of causal effects” which can reduce the “…reliance of these estimates on modeling assumptions…” enhancing the “…credibility of policy analysis”. Chalfin et al. (2016) point out some additional advantages, as ML allows for a trade-off between bias and variance, while with traditional econometric methods prediction errors are a function of variance as well as bias when looking for accuracy in out-of-sample predictions. These new ML methods are able to generate better predictions that 2
Technological Forecasting & Social Change 149 (2019) 119756
M.T. Ballestar, et al.
3. Data collection
from sexennium, 2.8 points from projects and 1.5 points from quinquennium. Researchers in the sample reach an average of 3.41 points for sexennium, 2.01 points for projects and 0.98 points for quinquennium. Therefore, the average of the total mark is 6.45, with a growth of 9.36% within the six years. b) Tenure track researchers can obtain a maximum of 4.1 points coming from accreditation, 4.4 points from projects and 1.5 points from quinquennium. Researchers in the sample reach an average of 1.27 points for accreditation, 3.11 for projects and 0.46 points for quinquennium. Therefore, the average of the total mark is 5.46, with a growth of 40.8% within the six years.
We analyze a monetary incentives program developed in the Region of Madrid (Spain) from 2005 to 2010, when it was terminated due to budgetary constraints related to the financial crisis. Also, after 2010 the national government imposed hiring restrictions on the universities which would make it impossible to track the diverse effects. The participants were the academics of the six public universities of Madrid, which consisted of around 25,000 individuals, some of whom have high positions in different fields of knowledge in international rankings (Aguion et al, 2010). We included in the sample only the individuals who participated during the whole six years of the program. By doing this we can properly analyze the evolution of their performance over time with our ML method, as most clustering methods do not allow for the presence of missing values (Yu et al., 2014). We applied the marginalization method, a well-known method for handling missing data without biasing the sample. Bias is one of the problems of other methods such as imputation. In the imputation method the estimation of the missing values is inherently less reliable than the observed data. Marginalization is a better solution as it doesn't create any new data values (Wagstaff, 2004). Hence, researchers who did not participated during the whole period were eliminated, leaving us with a sample of 5861. The recommendation concerning sample size in this method is 2^k cases (k = =number of variables) and preferably 5 * 2^k cases (Formann, 1984; Dolnicar, 2002) which would be a minimum of 640 individuals which is much less than the sample size in this research. As the sample size is big enough, we can model it without having to implement any imputation of missing data, therefore avoiding the bias that these methods may have. Before making the selection of the individuals in the sample we carefully checked year by year that the characteristics of the non-selected were fully represented in the data sample by checking the variables as follows. Researchers who joined the program were of two types depending on their contractual relationship with the university, tenured (permanent - civil servant contract) (4279 researchers, 73% of the sample) or tenure track (temporary or permanent but non-civil servant contracts) (1582 researchers, 26.99% of the sample). The criteria of the program to evaluate the productivity of the researchers vary according to type of contractual relationship with the university. Each year researchers who took part in the program got a total mark for their productivity in academia which is made up of three different aspects of their performance. Projects and quinquennium are common criteria for both tenured and tenure track researchers, but the third criterion, sexennium, is only for tenured researchers, while accreditation is focused on tenure track researchers. Projects consist of research projects within a delimited period of time whose aim is to increase the knowledge on a specific topic. These projects have to provide income to the institution where the researcher works. This means they are financed by public funds external to the university. Quinquennium consists of a positive assessment of the researcher's teaching activity over five years which may be full-time or part-time. Sexennium consists of a positive assessment of the research activity over a six-year period for tenured researchers. Contract non-tenured PhD professors need to obtain a favorable report from the National Agency for Quality Assessment and Accreditation of Spain (ANECA). This report is called accreditation. This accreditation is needed to apply for entry into the civil servant's university teaching bodies and become a tenured researcher. The overall mark the researcher can obtain each year goes from zero to ten and it is used to distribute the economic incentives among researchers up to the total budget available for the year which in 2010 was 15,000,000€. The overall annual mark is calculated as follows:
The data from the six years were aggregated at the researcher level in a single table with 5861 records, and 96 variables. The first 12 variables contain the researcher's characteristics and the following 84 were calculated to measure the researcher's performance within the years by using a longitudinal perspective. Of these, 14 were relevant for the empirical analysis. Descriptive analysis is detailed in the following sections. 3.1. Researchers’ characteristics The researchers’ characteristic variables used in the model were the type of contractual relationship with the university, gender, and area of knowledge. 3.1.1. Type of contractual relationship Researchers have two types of contractual relationship with the university, civil servants of university teaching bodies providing fulltime service, also called tenured researchers who represent 73.01% of the sample (4279 researchers) and tenure track researchers (some of whom have permanent relation with the university but are not civil servant and hypothetically may be furlough) who represent 26.99% (1582 researchers) of the sample. 3.1.2. Gender distribution Men accounted for 62.3% of the sample (mean age 52.05 years) and women accounted for 37.7% of the sample (mean age 50.12 years). This distribution varies depending on the type of contractual relationship the researcher has with the university. In the tenured group men accounted for 64.9% of the sample (mean age 54.00 years) and women for 35.1% of the sample (mean age 52.97 years). In the tenure track group men accounted for 55.3% of the sample (mean age 45.88 years) and women for 44.7% of the sample (mean age 44.06 years). 3.1.3. Area of knowledge Researchers belong to 168 different areas of knowledge, with a high concentration in biology and biomedicine 13.3%, followed by applied economics 3.5%, computer sciences and AI 3.4%, medicine 2.1%, and the rest of the areas of knowledge with weights less than 2%. 3.2. Researchers’ longitudinal performance The longitudinal performance of the researchers is calculated in the form of eleven numerical variables. The evolution of their performance over time is evaluated from two different perspectives, by analyzing the absolute value of the total marks obtained each year within the program and also by considering the annual percentage of variation of those marks. 3.2.1. Annual marks Six variables were calculated with the total marks as the addition of the individual marks obtained in each evaluated category on a yearly basis (from 2005 to 2010). Total marks for tenured researchers are calculated as the addition of sexennium, projects and quinquennium marks, while total marks for tenure track researchers are calculated as
a) Tenured researchers can obtain a maximum of 5.7 points coming 3
Technological Forecasting & Social Change 149 (2019) 119756
M.T. Ballestar, et al.
the addition of accreditation, projects and quinquennium marks.
to the last period where the p-value was 0.015.
3.2.2. Annual percentage variation of marks Five variables were calculated to measure the annual percentage variation of achieved marks from 2005 to 2010 at researcher level. The percentage of variation allows for classification of researchers according to the evolution of their performance over time independently of their baseline which could bias the results if we were using absolute values instead.
4.2. Second stage of the multilevel model: clusteringprocess In the second stage of the multilevel model, we apply a two-step cluster analysis to group researchers based on their characteristics and longitudinal performance (Heggeseth et al., 2015) within each of the two groups (tenured and tenure track researchers) identified as significantly different in the first stage of this analysis. Two-step analysis has several advantages over other clustering methods, such as being able to manage and analyze large datasets of categorical and continuous variables and select automatically the number of clusters (Jones and Nagin, 2007). When all variables are independent, and when categorical variables follow a multinomial distribution and continuous variables follow a normal distribution, we have found that the two-step clustering method gives the best results (Hox et al, 2017). The procedure is robust, according empirical internal testing to violations of both categorical and continuous assumptions. And so, “because cluster analysis does not involve hypothesis testing and calculation of observed significance levels, other than for descriptive follow-up, it's perfectly acceptable to cluster data that may not meet the assumptions for best performance” (Norušis, 2014; Ballestar et al. 2018a). Our two groups of data met the criteria described by Norušis (2014). On the one hand, the first Two-step cluster analysis is performed on the tenured researchers’ group with 4279 records, where we have one categorical variable (gender) and five continuous variables (annual percentage variation of achieved marks from 2005 to 2010). On the other hand, the second Two-step cluster analysis is performed on tenure track researchers’ group with 1582 records, where we have two categorical variables (gender and area of knowledge) and five continuous variables (annual percentage variation of achieved marks from 2005 to 2010). The two-step cluster analysis method follows these steps: Firstly, pre-clustering the raw data using the log-likelihood distance as the similarity criterion. Here, a sequential process was used where standardized data records were merged to an existing pre-cluster or a new precluster, whichever led to the largest log-likelihood; Secondly, there were combined the pre-clusters using agglomerative hierarchical clustering under the Schwarz criterion (BIC) which yielded to three clusters in each of the two groups. We used silhouette validation (Rousseeuw, 1987) to evaluate the consistency of the clustering structure. This measures cohesion between elements within a cluster and separation between clusters. The silhouette coefficient ranges from –1 to 1, where –1 means that the model is poor and 1 means that the model is optimal. Values greater than 0.5 indicate good model quality (Kaufman and Rousseeuw, 1990; Ballestar et al. 2018a). The silhouette coefficient was 0.9 for the tenured researchers’ group and 0.8 for the tenure track researchers’ group, so the two models were robust. The predictor importance relates to the importance of each variable of the model in making a prediction. It does not relate to the model accuracy or whether or not the prediction is accurate. In the tenured researchers ‘group the highest importance is held by the gender variable followed by the annual percentage variation over the researchers’ achieved marks from periods 5, 1, 2, 4 and 3. In the tenure track researchers’ group the highest importance is also held by the gender variable but followed by the annual percentage variation over the researchers’ achieved marks from periods 1, 2, 3, 5, 4 and finally the area of knowledge.
4. Empirical analysis and results This investigation applies data science methods to public policy analysis. In this case we developed a novel approach, a machine learning multilevel model consisting of an ANLC performed in two stages. The first stage performs a stratification of the academics depending on their contractual relationship with the university, a confounding variable. The main characteristic of a confounding variable is that it correlates with both the predictor of interest and also the outcome (Anderson et al., 1980; Frank, 2000). We use the confounding variable to perform the stratification method and later analyze the outcome groups independently in a second stage (Austin and Brunner, 2004; Austin, 2011). The second stage performs a longitudinal segmentation for each of these two groups taking into consideration not only the researchers’ characteristics such as sociodemographic and academic information but also the evolution of their performance over time in the form of the annual percentage variation of their marks over a period of six years. The aim is to identify the different groups of researchers depending on their characteristics and also their response to the incentive-based program within the six years that it lasted. 4.1. First stage of the multilevel model: stratification In the first stage of the multilevel model, a stratification method was used to classify and group the researchers’ sample into two groups depending on their type of contractual relationship with the university. The criteria to evaluate researchers’ performance are different for tenured researchers than tenure track researchers and these differences are expected to have an impact on the total marks’ researchers can achieve and in the evolution of their performance too. Therefore, the contractual relationship, a categorical variable, acts as a confounding variable and this stratification stage mitigates its effects when analyzing researchers’ longitudinal performance (Cochran, 1968; Anderson et al., 2009). Six one-way ANOVA tests were conducted to confirm significant differences between the two groups in terms of researchers’ average annual marks within the six years of the programme. The goal of the ANOVA testing was to confirm the suitability of the type of contractual relationship between the researcher and the University as confounding variable, and therefore, as variable for performing the stratification (Pourhoseingholi et al., 2012). As a result, the first group is made up of 4279 tenured researchers (73% of the sample) and the second group is made up of 1582 tenure track researchers (26.99% of the sample). The ANOVA tests revealed statically significant differences between the two groups in terms of average marks obtained within the six years which the program lasted. With the aim of guaranteeing the robustness of the findings, another five one-way ANOVA tests were conducted to confirm significant differences between the two groups, this time in terms of annual percentage variation over researchers’ achieved marks from 2005 to 2010, confirming that the difference between the two groups are not only regarding their total marks but also in their performance evolution over time. The ANOVA tests revealed significant differences between the two groups in terms of the five periods of annual percentage variation over researchers’ marks, all the p-values were 0.000 but one, corresponding
4.3. Clustering structure for tenured and tenure track researchers’ groups. On the one hand, tenured researchers were grouped into three clusters based on their gender and their annual percentage variation 4
Technological Forecasting & Social Change 149 (2019) 119756
M.T. Ballestar, et al.
Fig. 1. Cluster distribution chart for tenured and tenure track researchers’ groups. Table 1 Researchers’ average annual marks from 2005 to 2010.
Total Portfolio Tenured researcher Tenure track researcher
Total Portfolio Tenured researcher Tenure track researcher
Sample size
Annual Marks within the whole six year program year 2005 year 2006 year 2007 Mean Mean Mean
year 2008 Mean
year 2009 Mean
year 2010 Mean
Overall average Mean
5,861 4,279 1,582
5.62 6.09 4.37
6.39 6.61 5.79
6.51 6.67 6.10
6.52 6.66 6.15
6.19 6.46 5.46
5.88 6.25 4.86
6.22 6.49 5.51
Sample size
Std. Deviation
Std. Deviation
Std. Deviation
Std. Deviation
Std. Deviation
Std. Deviation
Std. Deviation Overall
5,861 4,279 1,582
3.06 3.04 2.75
3.03 3.04 2.77
3.01 3.03 2.84
2.99 3.01 2.87
2.99 3.02 2.89
2.97 2.99 2.91
3.01 3.02 2.84
Table 2 Annual percentage variation over researchers’ achieved marks from 2005 to 2010. Annual percentage variation 1 year 2006 vs 2005
Sample Size
Total portfolio Tenured researcher Tenure track researcher
Total portfolio Tenured researcher Tenure track researcher
5,861 4,279 1,582
variation of marks within the six years program variation 2 variation 3 variation 4 year 2007 year 2008 year 2009 vs 2006 vs 2007 vs 2008 2.66% 1.86% 5.21%
2.00% 0.91% 5.35%
variation 5 year 2010 vs 2009
Overall variation year 2010 vs 2005
0.09% -0.15% 0.78%
15.95% 9.36% 40.75%
4.52% 2.74% 11.20%
5.85% 3.71% 13.32%
Sample Size
Std. Deviation 1
Std. Deviation 2
Std. Deviation 3
Std. Deviation 4
Std. Deviation 5
Std. Deviation Overall
5,861 4,279 1,582
0.18 0.12 0.35
0.24 0.16 0.46
0.12 0.09 0.20
0.09 0.04 0.22
0.004 0.01 0.03
0.37 0.24 0.71
On the other hand, tenure track researchers were grouped into three clusters based on their gender, area of knowledge and their annual percentage variation over the achieved marks (Fig. 1). The percentage of the sample in each cluster was as follows Cluster 1(52.7%), Cluster 2 (4.2%) and Cluster 3 (43.1%). The smallest cluster (Cluster 2) had 66
over the achieved marks (Fig. 1). The percentage of the sample in each cluster was as follows Cluster 1 (63.4%), Cluster 2 (34.4%) and Cluster 3 (2.2%). The smallest cluster (Cluster 3) had 93 researchers, and the largest cluster (Cluster 1) had 2,715 researchers. Cluster profiles appear inTable 3. 5
Technological Forecasting & Social Change 149 (2019) 119756
M.T. Ballestar, et al.
Table 3 Cluster profiles: centroids of continuous variables and frequencies of categorical variables for tenured and tenured track of researchers. Tenured group of researchers gender
Cluster
1 2 3 Combined
Man Frequency
Percent
Women Frequency
Percent
2,715 0 61 2,776
97.80% 0.00% 2.20% 100.00%
0 1,471 32 1,503
0.00% 97.87% 2.13% 100.00%
Centroids
Cluster
1 2 3 Combined
perc_variation_1 Mean Std. Deviation
perc_variation_2 Mean Std. Deviation
perc_variation_3 Mean Std. Deviation
perc_variation_4 Mean Std. Deviation
perc_variation_5 Mean Std. Deviation
2.96 1.91 18.68 2.74
3.44 3.28 36.82 3.71
1.67 1.98 8.64 1.86
0.99 0.95 -3.95 0.91
-0.68 -0.16 27.20 -0.15
0.13 0.08 0.31 0.12
0.16 0.14 0.73 0.16
0.08 0.09 0.23 0.09
0.05 0.04 0.12 0.04
0.03 0.01 0.77 0.01
Tenured track group of researchers gender
Cluster
1 2 3 Combined
Man Frequency
Percent
Women Frequency
Percent
834 41 0 875
95.31% 4.69% 0.00% 100.00%
0 25 682 707
0.00% 3.54% 96.46% 100.00%
Area of knowledge
Cluster
1 2 3 Combined
biology and biomedicine Frequency Percent
applied economics Frequency Percent
computer sciences and AI Frequency Percent
medicine Frequency
Percent
rest of areas of knowledge Frequency Percent
128 16 69 213
34 4 17 55
38 2 14 54
17 1 15 33
51.50% 3.00% 45.50% 100.00%
617 43 567 1,227
60.10% 7.50% 32.40% 100.00%
61.80% 7.30% 30.90% 100.00%
70.40% 3.70% 25.90% 100.00%
50.29% 3.50% 46.21% 100.00%
Centroids
Cluster
1 2 3 Combined
perc_variation_1 Mean Std. Deviation
perc_variation_2 Mean Std. Deviation
perc_variation_3 Mean Std. Deviation
perc_variation_4 Mean Std. Deviation
perc_variation_5 Mean Std. Deviation
8.20 209.68 11.88 11.20
11.66 63.20 13.25 13.32
5.54 15.69 4.28 5.21
4.77 13.09 5.51 5.35
0.56 16.43 -0.0013 0.78
0.25 0.99 0.40 0.35
0.39 0.93 0.49 0.46
0.21 0.37 0.18 0.20
0.19 0.36 0.24 0.22
0.02 0.51 0.0001 0.03
of knowledge and researcher's longitudinal performance), and the output variable is the segment to which the researcher belongs, corresponding to one of the six Clusters calculated by the ANLC multilevel model (three clusters for tenured researchers and three clusters for tenure track researchers). According to this, the ANN has three layers: the input layer with 192 units (receiving values from eight independent/ input variables), a hidden layer (with seventeen units) and the output layer (with six units, one per each Cluster or researchers). In our MLP ANN, the hyperbolic tangent was the activation function for all units in the hidden layer, and the softmax function was the activation function for the six units in the output layer. This new model fulfills two different purposes at the same time. On the one hand, it allows for the validation of the ANLC multilevel model by using another completely independent Data science method. On the other hand, this is also a predictive model that can be implemented in real time for the classification of new samples of researchers that need to be evaluated (Ballestar et al., 2019). The overall classification accuracy of the ANN was 99.2% (an error rate of 0.8%). The confusion matrix in Table 4 shows the percentage of cases classified correctly and incorrectly of the six categories (Clusters of researchers) of the dependent variable. We used the AUC as the main classification performance indicator as it is even more accurate than the accuracy indicator under certain circumstances. The AUC ranges from 0.5 to 1, where 1 means that the model makes perfect classification and 0.5 means that the model makes random classification. This indicator
researchers, and the largest cluster (Cluster 1) had 834 researchers. Cluster profiles appear in Table 3. The centroids for the continuous variables appear in the top part, and the frequencies for the categorical variable appear in the bottom part. 5. Robustness of the model The robustness of the automated nested longitudinal clustering performed in two stages (ANLC) was tested by developing an additional machine learning model which consists of a predictive model based on a multilayer perceptron (MLP) artificial neural network (ANN) with a back-propagation learning algorithm. Artificial neural networks (ANN) are mathematical models which are able to manage the analysis of large datasets even when complex relationships exist between the input and output variables. In the model, the input variables are the independent ones, while the output variable corresponds to the dependent one (Ballestar et al. 2018a). This research uses a feed-forward MLP ANN model which is one of the most popular types of ANN. (Kavzoglu and Mather 2003; Hu and Weng 2009). The ANN was trained by using the same data sample as in the construction of the ANLC model, (partitioning the sample 69.7% (4080 researchers) for training and 30.3% (1777 researchers) for validation). The ANN input variables are also the same variables used to develop the ANLC multilevel model (type of contractual relationship, gender, area 6
Technological Forecasting & Social Change 149 (2019) 119756
M.T. Ballestar, et al.
Table 4 Confusion matrix. Classification Sample
Training
Testing
tenured_cluster_1 tenured_cluster_2 tenured_cluster_3 tenure_track_cluster_1 tenure_track_cluster_2 tenure_track_cluster_3 Overall Percent tenured_cluster_1 tenured_cluster_2 tenured_cluster_3 tenure_track_cluster_1 tenure_track_cluster_2 tenure_track_cluster_3 Overall Percent
Predicted tenured_cluster_1
tenured_cluster_2
tenured_cluster_3
tenure_track_cluster_1
tenure_track_cluster_2
tenure_track_cluster_3
Percent Correct
1879 0 12 0 0 0 46.3% 830 0 3 0 0 0 46.9%
0 1053 1 0 0 0 25.8% 0 415 1 0 0 0 23.4%
1 1 44 0 4 0 1.2% 1 0 20 0 2 0 1.3%
0 0 0 572 3 0 14.1% 0 0 0 258 2 0 14.6%
1 0 5 3 36 0 1.1% 0 0 7 0 18 0 1.4%
0 0 2 0 0 463 11.4% 0 0 0 0 2 218 12.4%
99.9% 99.9% 68.8% 99.5% 83.7% 100.0% 99.2% 99.9% 100.0% 64.5% 100.0% 75.0% 100.0% 99.0%
Dependent Variable: cluster_final
The clustering structures from both groups, tenured and tenure track researchers, are now described.
performance, between women of Cluster 2 and men of Cluster 1, tends to be reduced in periods 3 and 4, being even lower for Cluster 1 than Cluster 2 in the last period 5. (Figure 2). The smallest group of researchers is Cluster 3 and it is made up of women (32 researchers; 34.41% of the cluster) and men (61 researchers; 65.59% of the cluster). These researchers start in the program with the lowest performance of the three groups (average mark 2.35 in 2005) but they present the fastest evolution of their annual percentage variation of marks (average mark 5.06 in 2010). This represents an increase from the first year of the program to the last one of 115.53%, in comparison with the 8.61% in Cluster 1 and 8.18% in Cluster 2 (Fig. 2).
6.1. Tenured researchers
6.2. Tenure track researchers
Tenured researchers (4279 individuals) are classified into three clusters, based on their gender and their annual percentage variation over the achieved marks. Cluster 1 is made up of exclusively male researchers (2715; 63.45% of the sample), while Cluster 2 is made up of exclusively female researcher (1471; 34.38% of the sample) and finally, Cluster 3 is a small group made up of a mix of both male and female researchers (93 researchers; 2.17% of the sample). (Table 3). Researchers from Cluster 1 have the highest marks within the six years of the program (average mark 6.69), higher than Cluster 2 (average mark 6.21) and Cluster 3 (average mark 3.69). Cluster 1 also shows better evolution of their performance than Cluster 2 in periods 1 and 2 (from 2005 to 2007) reaching a peak of annual percentage variation of their performance of 3.44% in the second period (Fig. 2). Regarding the annual percentage variation of their performance Clusters 1 and 2 show similar trends. They reach a peak in the second period with 3.44% increase for the Cluster 1 and 3.28% for the Cluster 2, but the researchers are not able to keep increasing their marks at that peace, and the incremental gain start to decrease until the end of the program. Cluster 3 presents a more accentuated trend than Cluster 1 and 2, it also reaches a peak in the second period with 18.68% of increase of the performance, and then, the incremental gain starts to decrease rapidly in the following 2 periods up to −3.95% (from 2007 to 2009) to finalize with a sharp increase of 27.19% in the last period (Fig. 2). This means, that the Cluster 1 of men achieve better marks from the very beginning than the other two clusters, and is able of maintaining a higher level of improvement in their performance in the two first periods (from 2005 to 2007) than the women's cluster (Cluster 2). In this second period, the gap in their annual percentage variation of
Tenure track researchers (1582 individuals) are classified into three clusters, based on their gender and their annual percentage variation over the achieved marks. Cluster 1 is made up of exclusively male researchers (834 researchers; 52.72% of the sample), while Cluster 3 is made up of exclusively female researchers (682 researchers; 43.11% of the sample) and finally, Cluster 2 is a small group made up of a mix of both male and female researchers (66 researchers; 4.17% of the sample). (Table 3) Researchers from Cluster 3 have the highest marks within the six years of program (average mark 5.89), higher than Cluster 1 (average mark 5.28) and Cluster 2 (average mark 3.26). Cluster 3 also shows better evolution of their performance than Cluster 1 in the periods 1 and 2 (from 2005 to 2007) reaching a peak of annual percentage variation of their performance of 13.25% in the second period. (Fig. 2). Regarding the annual percentage variation of their performance Clusters 1 and 3 show similar trends. They reach a peak in the second period with 13.25% increase for Cluster 3 and 11.66% for Cluster 1, but the researchers are not able to keep increasing their marks at that peace, and the incremental gain start to decrease until the end of the program. Cluster 2 presents a different trend, it reaches the peak in the first period with 209.68% of increase of the performance, and then, the incremental gain starts to decrease rapidly up to 16.43% (Fig. 2). This means, that the Cluster 3 of women achieves better marks from the very beginning than the other two clusters, and is capable of maintaining a higher level of improvement in their performance in the two first periods (from 2005 to 2007) than men's cluster (Cluster 1). Finally, this gap in their annual percentage variation of performance, between women of Cluster 3 and men of Cluster 1, tends to be reduced in periods 3, 4 and 5 (Fig. 2). The smallest group of researchers is Cluster 2 and it is made up of
was calculated for each of the six clusters of researchers obtaining values between 0.987 and 1 meaning that the ANN model was very good (Hosmer et al., 2013). In conclusion, the prediction made by the ANN fits with the classification made by the ANLC model in the 99.2% of the cases, indicating a high convergence between the two methods. The robustness of the ANLC is strongly validated by the MLP-ANN. 6. Discussion and implications
7
Technological Forecasting & Social Change 149 (2019) 119756
M.T. Ballestar, et al.
Fig. 2. Average annual marks and annual percentage variation of marks for Clusters of the tenured researchers’ group and tenure track researchers’ group.
The findings also support H2, showing that incentive-based programs provide incremental growth up to the third year, when they start to decelerate, meaning that this type of program has a positive impact on researchers’ performance in the short term, especially for the tenured-track researchers (11.20% growth the first period from 2005 to 2006, 13.32% growth the second period from 2006 to 2007). They reach a saturation point within the third year which has a negative impact on the return of the program. In addition to this, this research shows that gender also plays an important role in academia, being a relevant variable when analyzing performance within each of two main groups. The comparison of the performance between clusters of men and women varies depending on their contractual relationship with the university. Men are the ones that reach higher marks in the tenured researchers’ group with an average mark in 2010 (last year of the program) of 6.86 and women in the tenure track researchers’ group with an average mark of 5.89. Our results are robust and significant, and the use of the multilayer perceptron (MLP) artificial neural network (ANN) with a back-propagation learning algorithm facilitates its use in forecasting and improving public policies.
women (25 researchers; 37.88% of the cluster) and men (41 researchers; 62.12% of the cluster). These researchers start in the program with a very low performance (average mark 0.66 in 2005) but they present a fast evolution of their annual percentage variation of marks (average mark 5.14 in 2010). This represents an increase from the first year of the program to the last one of 669.88%, in comparison with the 34.34% in Cluster 1 and 39.40% in Cluster 3 (Fig. 2). 6.3. Summary of findings The findings support H1, showing that the employment status of the researchers with the university, in terms of contractual relationship, is key in how incentives affect them. The main reason is that the characteristics and performance of the two groups correspond to professionals who are in very different stages of their careers. On the one hand, incentives have little impact on tenured researchers (who represent the 73.01% of the sample), as they just increase their productivity by 9.36% over the six years compared with the increase of 40.75% from tenured track researchers (who represent the 26.99% of the sample). This is consistent with previous literature as summarized by Dnes and Garoupa (2005). On the other hand, the baselines of their performance are also very different. Tenured researchers have an average of 6.08 points the first year of the program compared to the 4.36 points of the tenured-track researchers, suggesting that programs that evaluate researchers’ performance based on different criteria and later rank them according to the outputs to distribute the incentives can have unexpected outcomes that leads to inequalities and inefficiencies as also Rauber and Ursprung (2008) and Batterbury (2008) find out.
7. Conclusions This paper proposes a new ML method to measure the success and long-term effects of incentives on public policies. We use this model to assess the efficiency of long-term incentive-based programs in order to boost research productivity by analyzing an anonymized individuallevel data sample of 5,861 researchers who participate in a program in public universities in the Madrid Region from 2005 to 2010. 8
Technological Forecasting & Social Change 149 (2019) 119756
M.T. Ballestar, et al.
To our knowledge, this is the first research which focuses on researchers’ response to this kind of program with an extensive length of time and number of individuals. We have also shown the advantages of research in this area of using data science methods such as machine learning. In this case, we have developed an automated nested longitudinal clustering (ANLC) that performs first a stratification of researchers depending on their contractual relationship with the university and later, performs a longitudinal segmentation for each of the groups where their characteristics and performance over time are taken into account. Therefore, this paper bridges that gap and paves the way for new lines of research based on data analysis that can be readily implemented for the benefit of both organizations and researchers. One of the main benefits of this research is that it enables us to understand the behavior and response to incentives of heterogeneous groups of researchers. Thus, organizations will be able to optimize the design of programs maximizing the scientific production, and the development of the researchers’ path at the same time, knowing that research and innovation produce potentially large social benefits (Jaffe et al., 2005). Our results are in line with previous literature on incentives such as Jenkins et al. (1998), Camerer and Hogarth (1999), Wright and Boswell (2002), and also the recommendations with regard to the use of Machine Learning such as Chalfin et al. (2016) or Athey (2017) and Athey and Imbens (2017). They can be used in the heated debate on the reproducibility of research results for academic promotions (Lakens et al., 2018). The study has some limitations. Future research should further analyze the sample of researchers who did not participate during the whole period. Some of them do not enter at the very beginning and others drop off before its end. Also, the use of other data science methods would yield additional insight into this issue.
2060–2080. Dnes, A., Garoupa, N., 2005. Academic tenure, posttenure effort, and contractual damages. Econ. Inquiry 43 (4), 831–839. Dobbin, F., Simmons, B., Garrett, G., 2007. The global diffusion of public policies: Social construction, coercion, competition, or learning? Annu. Rev. Sociol. 33, 449–472. Dolnicar, S. (2002). A review of unquestioned standards in using cluster analysis for data-driven market segmentation. Easterly, W., Levine, R., 1997. Africa's growth tragedy: Policies and ethnic divisions. Q. J. Econ. 112 (4), 1203–1250. https://doi.org/10.1162/003355300555466. Edwards, M.A., Roy, S., 2017. Academic research in the 21st century: maintaining scientific integrity in a climate of perverse incentives and hypercompetition. Environ. Eng. Sci. 34 (1), 51–61. Epstein, N., Fischer, M.R., 2017. Academic career intentions in the life sciences: can research self-efficacy beliefs explain low numbers of aspiring physician and female scientists? PloS One 12 (9), e0184543. Formann, A.K., 1984. Die latent-class-analyse: Einführung in Theorie und Anwendung. Beltz. Frank, K.A., 2000. Impact of a confounding variable on a regression coefficient. Sociol. Methods Res. 29 (2), 147–194. Heggeseth, B., Harley, K., Warner, M., Jewell, N., Eskenazi, B., 2015. Detecting associations between early-life DDT exposures and childhood growth patterns: a novel statistical approach. PloS One 10 (6), e0131443. Hicks, D., 2012. Performance-based university research funding systems. Res. Policy 41 (2), 251–261. Hobbs, F. R., & Roberts, L. M. (2016). The stern review of the research excellence framework. Hosmer Jr., D.W., Lemeshow, S., Sturdivant, R.X., 2013. Applied Logistic Regression. Wiley, Hoboken. Hox, J.J., Moerbeek, M., Van de Schoot, R., 2017. Multilevel Analysis: Techniques and Applications. Routledge. Hu, X, Weng, Q, 2009. Estimating impervious surfaces from medium spatial resolution imagery using the self-organizing map and multi-layer perceptron neural networks. Remote Sens. Environ. 113 (10), 2089–2102. Jaffe, A.B., Newell, R.G., Stavins, R.N., 2005. A tale of two market failures: technology and environmental policy. Ecol. Econ. 54 (2-3), 164–174. Jenkins Jr., G.D., Mitra, A., Gupta, N., Shaw, J.D., 1998. Are financial incentives related to performance? A meta-analytic review of empirical research. J. Appl. Psychol. 83 (5), 777. Jones, B.L., Nagin, D.S., 2007. Advances in group-based trajectory modeling and an SAS procedure for estimating them. Sociol. Methods Res. 35 (4), 542–571. Kattge, J., Diaz, S., Lavorel, S., Prentice, I.C., Leadley, P., Bönisch, G., Cornelissen, J.H.C., 2011. TRY–a global database of plant traits. Global Change Biol. 17 (9), 2905–2935. Kaufman, L., Rousseeuw, P.J., 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, Hoboken, NJ, USA. Kavzoglu, T, Mather, P, 2003. The use of back propagating artificial neural networks in land cover classification. Int. J. Remote Sens. 24 (23), 4907–4938. Kleinberg, J., Ludwig, J., Mullainathan, S., Obermeyer, Z., 2015. Prediction policy problems. Am. Econ. Rev. 105 (5), 491–495. Kleinberg, J., Lakkaraju, H., Leskovec, J., Ludwig, J., Mullainathan, S., 2018. Human decisions and machine predictions. Q. J. Econ. 133 (1), 237–293. LaConte, S., Strother, S., Cherkassky, V., Anderson, J., Hu, X., 2005. Support vector machines for temporal classification of block design fMRI data. NeuroImage 26 (2), 317–329. Lakens, D., Adolfi, F.G., Albers, C.J., Anvari, F., Apps, M.A., Argamon, S.E., Buchanan, E.M., 2018. Justify your alpha. Nat. Hum. Behav. 2 (3), 168. Mjolsness, E., DeCoste, D., 2001. Machine learning for science: state of the art and future prospects. Science 293 (5537), 2051–2055. Musselin, C., 2005. European academic labor markets in transition. Higher Educ. 49 (1-2), 135–154. Norušis, M.J., 2014. SPSS 13.0 Statistical Procedures Companion. Prentice Hall. Pers, TH, Albrechtsen, A, Holst, C, Sørensen, TIA, Gerds, TA, 2009. The validation and assessment of machine learning: a game of prediction from high-dimensional data. PLoS One 4 (8), e6287. Pontille, D., Torny, D., 2010. The controversial policies of journal ratings: evaluating social sciences and humanities. Res. Eval. 19 (5), 347–360. Pourhoseingholi, M.A., Baghestani, A.R., Vahedi, M., 2012. How to control confounding effects by statistical analysis. Gastroenterology and Hepatology from bed to bench 5 (2), 79. Rauber, M., Ursprung, H.W., 2008. Life cycle and cohort productivity in economic research: the case of Germany. German Econ. Rev. 9 (4), 431–456. Rothstein, B., Uslaner, E.M., 2005. All for all: equality, corruption, and social trust. World Polit. 58 (1), 41–72. Rousseeuw, P.J., 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65. Sin, K., Muthu, L., 2015. Application of big data in education data mining and learning analytics – a literature review. ICTACT J. Soft Comput. 5 (4). Taylor, J., 2011. The assessment of research quality in UK universities: peer review or metrics? Br. J. Manage. 22 (2), 202–217. Tietze, S., 2018. Multilingual research, monolingual publications: management scholarship in English only? Eur. J. Int. Manage. 12 (1/2), 28–45. Wagstaff, K., 2004. Clustering with missing values: No imputation required. Classification, Clustering, and Data Mining Applications. Springer, Berlin, Heidelberg, pp. 649–658. Wang, D., 2019. International labour movement, public intermediate input and wage inequality: a dynamic approach. Econ. Res.-Ekonomska istraživanja 32 (1), 1–16. Wang, J., Lee, Y.N., Walsh, J.P., 2018. Funding model and creativity in science: competitive versus block funding and status contingency effects. Res. Policy 47 (6), 1070–1083. Wright, P.M., Boswell, W.R., 2002. Desegregating HRM: a review and synthesis of micro and macro human resource management research. J. Manage. 28 (3), 247–276. Yu, H., Su, T., Zeng, X., 2014. A three-way decisions clustering algorithm for incomplete data. International Conference on Rough Sets and Knowledge Technology. Springer, Cham, pp. 765–776.
References Aghion, P., Dewatripont, M., Hoxby, C., Mas-Colell, A., Sapir, A., 2010. The governance and performance of universities: evidence from Europe and the US. Econ. Policy 25 (61), 7–59. Anderson, S., Auquier, A., Hauck, W.W., Cakes, D., Vandaele, W., Weisberg, H.I., Bryk, A.S., Kleinman, J., 1980. Statistical Methods for Comparative Studies. Wiley, New York. Anderson, S.R., Auquier, A., Hauck, W.W., Oakes, D., Vandaele, W., Weisberg, H.I., 2009. Statistical Methods for Comparative Studies: Techniques for Bias Reduction Vol. 170 John Wiley & Sons. Athey, S., 2017. Beyond prediction: Using big data for policy problems. Science 355 (6324), 483–485. Athey, S., Imbens, G.W., 2017. The state of applied econometrics: causality and policy evaluation. . Econ. Perspect. 31 (2), 3–32. Auranen, O., Nieminen, M., 2010. University research funding and publication performance—an international comparison. Res. Policy 39 (6), 822–834. Austin, P.C., 2011. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behav. Res. 46 (3), 399–424. Austin, P.C., Brunner, L.J., 2004. Inflation of the type I error rate when a continuous confounding variable is categorized in logistic regression analyses. Stat. Med. 23 (7), 1159–1178. Ballestar, M.T., Grau-Carles, P., Sainz, J., 2018a. Customer segmentation in e-commerce: applications to the cashback business model. J, Bus. Res. 88, 407–414. Ballestar, M.T., Soriano, D.R., Sanz, J., 2018b. Es el big data el siguiente paso en la digitalización de la empresa? Economía Industrial (409), 47–56. Ballestar, M.T., Grau-Carles, P., Sainz, J., 2019. Predicting customer quality in e-commerce social networks: a machine learning approach. Rev. Manag. Sci. 1–15. Batterbury, S., 2008. Tenure or permanent contracts in North American higher education? A critical assessment. Policy Futures Educ. 6 (3), 286–297. Camerer, C.F., Hogarth, R.M., 1999. The effects of financial incentives in experiments: a review and capital-labor-production framework. J. Risk Uncertainty 19 (1-3), 7–42. Chait, R., 2009. The Questions of Tenure. Harvard University Press. Chalfin, A., Danieli, O., Hillis, A., Jelveh, Z., Luca, M., Ludwig, J., Mullainathan, S., 2016. Productivity and selection of human capital with machine learning. Am. Econ. Rev. 106 (5), 124–127. Chambers, C.D., Dienes, Z., McIntosh, R.D., Rotshtein, P., Willmes, K., 2015. Registered reports: realigning incentives in scientific publishing. Cortex 66, A1–A2. Cochran, W.G., 1968. The effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics 295–313. Cvecic, I., Sokolic, D., 2018. Impact of public expenditure in labour market policies and other selected factors on youth unemployment. Econ. Res.-Ekonomska Istraživanja 31 (1),
9