A comparison of random forest regression and multiple linear regression for prediction in neuroscience

A comparison of random forest regression and multiple linear regression for prediction in neuroscience

Journal of Neuroscience Methods 220 (2013) 85–91 Contents lists available at ScienceDirect Journal of Neuroscience Methods journal homepage: www.els...

1MB Sizes 2 Downloads 159 Views

Journal of Neuroscience Methods 220 (2013) 85–91

Contents lists available at ScienceDirect

Journal of Neuroscience Methods journal homepage: www.elsevier.com/locate/jneumeth

Basic Neuroscience

A comparison of random forest regression and multiple linear regression for prediction in neuroscience Paul F. Smith a,∗ , Siva Ganesh c , Ping Liu b a b c

Department of Pharmacology and Toxicology, The Brain Health Research Centre, University of Otago, Dunedin, New Zealand Anatomy, School of Medical Sciences, The Brain Health Research Centre, University of Otago, Dunedin, New Zealand Bioinformatics and Statistics, AgResearch Ltd., Palmerston North, New Zealand

h i g h l i g h t s • • • •

Multiple linear regression is often used for prediction in neuroscience. Random forest regression is an alternative form of regression. It does not make the assumptions of linear regression. We show that linear regression can be superior to random forest regression.

a r t i c l e

i n f o

Article history: Received 22 May 2013 Received in revised form 13 August 2013 Accepted 28 August 2013 Keywords: Regression Linear regression Regression trees Random forest regression l-Arginine metabolism Vestibular nucleus Cerebellum

a b s t r a c t Background: Regression is a common statistical tool for prediction in neuroscience. However, linear regression is by far the most common form of regression used, with regression trees receiving comparatively little attention. New method: In this study, the results of conventional multiple linear regression (MLR) were compared with those of random forest regression (RFR), in the prediction of the concentrations of 9 neurochemicals in the vestibular nucleus complex and cerebellum that are part of the l-arginine biochemical pathway (agmatine, putrescine, spermidine, spermine, l-arginine, l-ornithine, l-citrulline, glutamate and ␥-aminobutyric acid (GABA)). Results: The R2 values for the MLRs were higher than the proportion of variance explained values for the RFRs: 6/9 of them were ≥0.70 compared to 4/9 for RFRs. Even the variables that had the lowest R2 values for the MLRs, e.g. ornithine (0.50) and glutamate (0.61), had much lower proportion of variance explained values for the RFRs (0.27 and 0.49, respectively). The RSE values for the MLRs were lower than those for the RFRs in all but two cases. Comparison with existing methods: In general, MLRs seemed to be superior to the RFRs in terms of predictive value and error. Conclusion: In the case of this data set, MLR appeared to be superior to RFR in terms of its explanatory value and error. This result suggests that MLR may have advantages over RFR for prediction in neuroscience with this kind of data set, but that RFR can still have good predictive value in some cases. © 2013 Elsevier B.V. All rights reserved.

1. Introduction Linear regression is a part of the general linear model (GLM) that is often used to predict one variable from another in neuroscience. Simple linear regression can be expanded to include more than one predictor variable to become multiple linear regression. However, formal statistical tests of multiple linear regression, like simple linear regression, make assumptions regarding the distribution of the data, which cannot always be fulfilled. These assumptions are that

∗ Corresponding author. Tel.: +64 3 479 5747. E-mail address: [email protected] (P.F. Smith). 0165-0270/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.jneumeth.2013.08.024

the data are normally distributed, with homogeneity of variance, and that they are independent of one another (e.g. not autocorrelated) (Vittinghoff et al., 2005). Furthermore, the predictor variables should be numerical, although indicator variables can be used in order to include nominal variables (e.g., binary coding to represent male and female). The violation of the assumption of normality can sometimes be redressed using data transformation, which may also correct heterogeneity of variance, but other issues such as autocorrelation are not easily dealt with and may require methods such as time series regression (Ryan, 2009). Although modelling using regression trees has been used for over 25 years, its use in the neurosciences has been very limited. In regression tree modelling, a flow-like series of questions is asked

86

P.F. Smith et al. / Journal of Neuroscience Methods 220 (2013) 85–91

about each variable (‘recursive partitioning’), subdividing a sample into groups that are as homogeneous as possible by minimising the within-group variance, in order to determine a numerical response variable (Vittinghoff et al., 2005). The predictor variables can be numerical also, or they can be ordinal or nominal. By contrast with linear regression, no assumptions are made about the distribution of the data. The data are usually split into training and test data sets (e.g., 90:10) and the mean square error (MSE) between the model based on the training data and the test data is calculated as a measure of the model’s success. Variables are chosen to split the data based on the reduction in the MSE achieved after a split (i.e., the information gained). Unlike linear regression, interactions between different predictor variables are automatically incorporated into the regression tree model and variable selection is unnecessary because irrelevant predictors are excluded from the model. This makes complex, non-linear interactions between variables easier to accommodate than in linear regression modelling (Hastie et al., 2009). Breiman et al. (1984) extended the concept of regression trees by exploiting the power of computers to simultaneously generate hundreds of regression trees, known as ‘random forests’, which were based on a random selection of a subset of data from the training set. The various regression tree solutions are averaged in order to predict the target variable with the smallest MSE (Marsland, 2009). The aim of this study was to compare the results of a conventional multiple linear regression with those of random forest regression, using data on the expression of neurochemicals related to the l-arginine metabolic pathway in the rat hindbrain as an example. Two areas of the hindbrain concerned with the control of movement were investigated: the brainstem vestibular nucleus complex (VNC) and the cerebellum (CE), in young (4 month old) and aged (24 month old) rats (Liu et al., 2010). Chemical analyses were performed to determine the concentrations of 9 related neurochemicals that form a biochemical pathway that is critical for neuronal function (see Fig. 1): agmatine, putrescine, spermidine, spermine, l-arginine, l-ornithine, l-citrulline, glutamate and ␥-aminobutyric acid (GABA). Although Fig. 1 presents certain causal connections between some of these neurochemical variables, the mechanisms through which they interact with one another are not completely understood and additional pathways, particularly feedback pathways, are possible (Mori and Gotoh, 2004). It is therefore of interest to determine whether the concentrations of one part of this complex neurochemical pathway can be predicted from the other parts.

2. Methods 2.1. Data set and variables The data set was obtained from Liu et al. (2010). Male SpragueDawley rats (aged: 24 months old, n = 14; young: 4 months old, n = 14) were housed 3–5 per cage and maintained on a 12 h lightdark cycle and provided with ad lib. access to food and water. All experimental procedures were carried out in accordance with the regulations of the University of Otago Committee on Ethics in the Care and Use of Laboratory Animals. Animals were housed either in a standard rat cage or an enriched environment including toys and other novel objects, since enriched environments have been shown to reduce age-related memory impairment (Olson et al., 2006). Therefore, the sample sizes for the aged and young groups were divided according to the housing conditions. In order to achieve as large a sample size as possible, data from the VNC and CE were combined in the regression analyses, so that for each of the 9 neurochemical variables the total n was 58. This was considered to be a reasonable solution given the close physiological

Fig. 1. The arginine metabolic pathway showing the conversion of l-arginine to the neurotransmitter, nitric oxide (NO), and l-citrulline, by the enzyme, nitric oxide synthase (NOS), of which there are 3 isoforms; the conversion of l-arginine to agmatine by the enzyme, arginine decarboxylase (ADC), which is then converted to polyamines such as putrescine, spermidine and spermine by agmatinase and ornithine decarboxylase (ODC); and the conversion of l-arginine to l-ornithine by arginase, which is then converted to the same polyamines, which are essential for cell proliferation, differentiation and communication, including neuronal synaptic plasticity in the brain. The major excitatory neurotransmitter, glutamate, is one of the end products of l-arginine, and glutamate serves as a precursor for the synthesis of the major inhibitory neurotransmitter, GABA. Therefore, all of these neurochemicals are interconnected.

relationship between the VNC and CE (Liu et al., 2010). This meant that for the aged group with standard housing, n was = 13, aged with enriched housing, n was = 16; young with standard housing, n was = 14, and for young with enriched housing, n was = 15. These smaller sample sizes were less important because age and enrichment were categorical variables that were never the target variables, but they were included in the regression analyses as predictor variables. A previous study using the same data set analysed the data using multivariate analyses of variance (MANOVAs), linear discriminant and cluster analyses (Liu et al., 2010), but the main interest in the latter case was the prediction of the age of the brain tissue based on the other variables rather than predicting neurochemical concentrations using regression analyses. Determination of the concentrations of agmatine, putrescine, spermidine, spermine, l-arginine, l-ornithine, l-citrulline, glutamate and ␥-aminobutyric acid (GABA) was carried out using high performance liquid chromatography (HPLC) or a highly sensitive liquid chromatography/mass spectrometric (LC/MS/MS) method and expressed as ␮g/g of wet tissue weight (see Liu et al., 2008a, 2010 for details). The experimental design thus consisted of 2 main independent variables: age with 2 levels, 4 months old and 24 months old; and housing, with 2 levels, standard and enriched. There were 9 potential dependent variables corresponding to the concentrations of agmatine, putrescine, spermidine, spermine, l-arginine, l-ornithine, l-citrulline, glutamate and GABA. However, in any one regression analysis, only one of these continuous neurochemical variables was the target or y variable and the other 8 were included as predictor variables. Consequently, each analysis involved 10 predictor variables, i.e. 8 continuous variables and 2 categorical ones, and one dependent continuous neurochemical variable.

P.F. Smith et al. / Journal of Neuroscience Methods 220 (2013) 85–91

2.2. Statistical methods 2.2.1. Preliminary data inspection The variance of the glutamate concentrations was substantially different from that of the other neurochemicals and this raised a concern about whether the assumptions for normal parametric statistical analysis would be violated (Liu et al., 2010). Although some data mining methods such as random forest and neural network regression do not require that assumptions such as normality and homogeneity of variance be met, this is not true of multiple linear regression (Marcoulides and Hershberger, 1997; Manly, 2005). There were more measurements than dependent variables in all cases, i.e., 58 samples versus 9 variables (Tabachnick and Fidell, 2007). Inspection of normal probability plots (Q–Q plots) and residuals versus fitted value plots for the different variables suggested that the assumptions of at least univariate normality and homogeneity of variance were likely to be upheld with this sample size (see Fig. 2). Transformations were investigated for the multiple linear regression but none that was attempted (natural log, square root etc.) resolved the remaining problems. However, multiple linear regression is believed to be reasonably robust against violation of its assumptions provided that the sample sizes for the different dependent variables are reasonably large and nearly equal, which they were in this case (Marcoulides and Hershberger, 1997; Manly, 2005). The total sample size for the 9 neurochemical variables was n = 58 in most cases; therefore the central limit theorem should have provided some protection against violation of the assumption of multivariate normality (Marcoulides and Hershberger, 1997; Tabachnick and Fidell, 2007). Marcoulides and Hershberger (1997) have argued that if the assumption of multivariate normality is met, then the assumption of homoskedasticity is likely to be met also. 2.2.2. Multiple linear and random forest regression All analyses were conducted using the computer package R (2012). The data were split 90:10 into training and test data sets. Based on the considerations described above, multiple linear regressions (MLRs) were performed on the training data set, using one neurochemical variable at a time as the response variable, and the other 8 as predictor variables, in addition to the categorical predictor variables, age and housing. In all cases, the response neurochemical variable was a continuous variable, expressed as a concentration. The other 8 predictor neurochemical variables were also continuous variables, but age and housing were nominal and these were converted to binary indicator variables, where for age, young was = 0 and aged was = 1, and for housing, standard was = 0 and enriched was = 1. The success of the MLRs can be assessed by evaluating the magnitude of the adjusted R2 , the residual standard error (RSE) for the regression, the t test results for the individual predictor variables and the analysis of variance (ANOVA) for the regression. The validity of the regression can be investigated by inspecting the diagnostic plots for the residual versus fitted values, the normal Q–Q plots, the scale-location plots and the residuals versus leverage plots, including Cook’s distance (see Fig. 2). For formal significance tests, the ␣ rate (type I error rate) is usually set at 0.05 for all comparisons. The R software function lm was utilised when fitting MLRs. RFR modelling requires choosing m, the number of variables (a subset of available p predictor variables) used to determine the decision at a node of the tree. Since there were 10 predictor variables for any target neurochemical variable, it was decided to set m as the integer part of the square root of p, i.e. m = 3. The number of trees to be fitted was set at 1000. The optimum value for m was also determined using the tuneRF function of the R software, as an alternative to setting m = 3. The majority of the models resulted in tuneRF choosing m = 3 and in general, the overall results were very similar to the modelling under the choice of m = 3. Hence, only the

87

results associated with the latter approach are presented here. The function randomForest of the R software was utilised when fitting RFRs. The nature of RFR automatically provides tools for assessing its performance. Much of this information comes from using the “outof-bag” (OOB) cases in the training set that have been left out of the bootstrapped training set. The magnitude of the residual error and of the (pseudo) R2 can be computed for the OOB cases. RFR also provides a ‘proportion of variance explained’ for the overall model and a ‘variable importance’ score for each of the predictor variables. This can be regarded as comparable to the variable selection associated with stepwise MLR. It is clear that the ‘internal’ assessments of the two models, MLR and RFR, are incompatible in order to choose the better of the two models. One solution is to compute MSE and R2 via the predicted values of the response of each observation in the training data set using the fitted model. Here, MSE and R2 are defined as: 1  2 (yi − yˆ i ) and R2 = 1 − 58 58

MSE =

i=1

58

2 (y − yˆ i ) i=1 i 58 ¯ 2 (y − y) i=1 i



where, yi , yˆ i and y¯ are respectively, the observed and predicted responses of the ith observation and the mean of all responses. This approach may be regarded as ‘over-optimistic’ because MSE and R2 are obtained via ‘re-substitution’, where the regression model is built using all 58 observations and then each observation is predicted using the fitted model. An alternative solution is to use a common test data set to assess the fitted models. This can be achieved by dividing the given data into training and test sets using, for example, a 90:10 split. Alternatively, a leave-oneout cross-validation (LOO-CV) approach may be utilised. Here, each observation is removed from the training data, a model built based on n − 1 (or 57 in our case) observations and then the removed observation predicted using the fitted model. This approach, while producing unbiased estimates, is prone to high variation, especially when the sample size is small. It was therefore decided to use the R2 and residual standard error (RSE) criteria for MLR, and the proportion of variance explained and RSE criteria for RFR, based on a common 90:10 training:test data split, in order to evaluate the success of the regression analysis via the two modelling processes. This also makes it easier to compare the chosen important subsets of predictor variables by the two modelling processes. 3. Results 3.1. Multiple linear regression Table 1 shows the results of the MLRs. The R2 values ranged from 0.50 to 0.95. Although all of these regressions were statistically significant according to ANOVAs (data not shown), those with high R2 values, e.g. ≥0.7, were GABA, spermidine, spermine, l-arginine, agmatine and l-citrulline. The highest R2 (0.95) was for the prediction of l-citrulline from l-arginine, GABA and l-ornithine. However, this regression did not have the lowest RSE (12.37 compared to 0.49 for putrescine; see Table 1). Inspection of the diagnostic plots suggested that the data for most variables was fairly normally distributed, i.e. the data were closely distributed along the straight line in the Q–Q plot (see Fig. 2 for an example for l-citrulline). Furthermore, the residuals versus the fitted values plots suggested that the residuals were approximately randomly distributed (Fig. 2). Likewise, the scale-location and residuals versus leverage plots did not indicate any serious violation of the assumptions of MLR (Fig. 2). Therefore, it was concluded that the regression analyses were valid.

88

P.F. Smith et al. / Journal of Neuroscience Methods 220 (2013) 85–91

Fig. 2. Diagnostic plots for l-citrulline following multiple linear regression showing residuals versus fitted values, normal Q–Q, scale location and residuals versus leverage plots.

3.2. Random forest regression

3.3. Comparison of multiple linear regression and random forest regression

Table 2 shows the results of the RFRs. The proportion of variance explained values ranged from 0.27 for l-ornithine to 0.94 for spermine. The RSEs ranged from 0.54 for agmatine to 293.71 for glutamate. Fig. 3 shows the order of variable importance for the RFR for spermine and Fig. 4 the decrease in error as a function of the number of trees. Fig. 5 shows the predicted versus the observed values for the test data, based on only 6 observations (i.e. 10% of 58). The pseudo R2 was 0.98, although the sample size is very small.

In order to compare the results of the two different kinds of regression, the R2 values for the MLRs were compared to the proportion of variance explained values for the RFRs. Tables 1 and 2 show these and the RSE values for the 2 kinds of regressions for the 9 neurochemical variables, with the predictors listed in order of importance. For the RFRs, these variables were the ones that had the largest effect on the MSE (see Fig. 3) and for the MLRs, they

Table 1 Multiple linear regression.

R2 RSE Significant predictor variables

GABA

Put

Spd

Spm

Arg

Glut

Agm

Orn

Cit

0.78 32.03 glut*** cit**

0.68 0.49 ag*** age*

0.85 14.24 spm*** age***

0.93 6.38 spd*** glut**

0.92 16.97 cit***

0.61 258.5 GABA*** spm**

0.76 0.38 put*** age*

0.50 15.35 age*** cit*

0.95 12.37 arg*** GABA** orn*

Results of the multiple linear regression analyses (MLRs) showing the R2 values, the RSEs, and the significant input variables. *** P ≤ 0.0001. ** P ≤ 0.001. * P ≤ 0.05.

P.F. Smith et al. / Journal of Neuroscience Methods 220 (2013) 85–91

89

Fig. 3. Variables in order of importance for the RFR for spermine, which had the highest proportion of variance explained (0.94).

were the statistically significant variables, listed in order from the smallest to the largest P value. In order to facilitate comparison, the same number of variables is shown for the RFRs as for the MLRs, i.e. if only 2 variables were significant for the MLR, then only the 2 most important variables for the RFR are shown.

It was apparent that the R2 values for the MLRs were higher than the proportion of variance explained values for the RFRs: 6/9 of them were ≥0.70 compared to 4/9 for the RFRs. Even the variables that had the lowest R2 values for the MLRs, e.g. ornithine (0.50) and glutamate (0.61), had much lower proportion of variance explained

Fig. 4. Decrease in error as a function of the number of trees for the RFR for spermine, which had the highest proportion of variance explained (0.94).

Fig. 5. Predicted versus observed values for the spermine test data.

90

P.F. Smith et al. / Journal of Neuroscience Methods 220 (2013) 85–91

Table 2 Random forest regression.

Prop. Var. RSE Most important predictor variables

GABA

Put

Spd

Spm

Arg

Glut

Agm

Orn

Cit

0.66 39.40 cit glut

0.43 0.64 agm spm

0.72 19.01 spm arg

0.94 5.88 arg cit

0.92 16.59 cit

0.49 293.71 spm GABA

0.52 0.54 arg cit

0.27 18.40 age put

0.90 16.32 arg spm GABA

Results of the random forest regression models (RFRs) showing the proportion of variance explained values, the RSEs, and the input variables chosen by the stepwise process.

values for the RFRs (0.27 and 0.49, respectively). The RSE values for the MLRs were lower than those for the RFRs in all but two cases. The most important variables in the prediction of the target variables differed in some cases between the two types of regression. For l-citrulline: l-arginine and GABA were common to the two regression analyses (2/3). For putrescine: only agmatine was common (1/2). For spermidine: only spermine was common (1/2). For spermine: there was no common variable (0/2). For l-arginine: only l-citrulline was common to the two regressions (1/1). For glutamate: spermine and GABA were common (2/2). For agmatine: there was no common variable (0/2). For l-ornithine: age was common (1/2). Finally, for l-citrulline: l-arginine and GABA were common (2/3). The fact that the R2 values for the MLRs were higher than the proportion of variance explained values for the RFRs in 7/9 cases, and the RSEs were lower in 7/9 cases, suggested that the predictive value of the MLRs was greater than for the RFRs in the case of this data set. 4. Conclusions Experimental phenomena in biology in general, and in neuroscience in particular, usually involve the complex, non-linear interaction of multiple variables, and yet historically, statistical analysis has focussed on comparison between treatment groups, of one variable at a time. This approach not only tends to inflate the type 1 error rate as a result of large numbers of statistical analyses, but neglects the fact that changes may occur at the level of the interaction within a system of variables that cannot be detected in individual variables (Liu et al., 2010; Smith et al., 2013). Consequently, in areas such as the analysis of gene microarray data, protein interaction and medical diagnostics, multivariate statistical analyses and data mining approaches are now being employed in an attempt to understand complex interactions between systems of variables (e.g., Pang et al., 2006; Krafczyk et al., 2006; Ryan et al., 2011; Brandt et al., 2012; Smith et al., 2013). The process of ageing is associated with major neurophysiological and neurochemical changes, some of which result in neurological deficits such as memory loss and impaired motor control. A biochemical pathway responsible for l-arginine metabolism is critically involved in the production of several neurochemicals that are necessary for communication between neurons (e.g. the neurotransmitters glutamate, nitric oxide, and GABA, which is synthesised from glutamate) and is involved in their maintenance or degeneration (e.g., l-ornithine and the polyamines, spermine, spermidine and putrescine) (e.g., Liu et al., 2003a,b, 2004a,b, 2005, 2008a,b). The neurochemicals that make up this system interact in a complex, non-linear way that may have multiple positive and negative feedback loops. Although Fig. 1 summarises what is currently known of this system, there are almost certainly other interactions that occur. Therefore, it is not possible to provide a simple linear causal model of how one neurochemical affects another. The traditional approach has been to measure the concentrations of many of the variables in the biochemical system (e.g., agmatine, putrescine,

spermidine, spermine, l-arginine, l-ornithine, l-citrulline, glutamate and GABA) and then relate individual changes in them to ageing. Such studies have demonstrated that age-related neurological impairment is associated with, and probably caused by, changes in these neurochemical variables (e.g., Liu et al., 2003a,b, 2004a,b, 2005, 2008a,b); however, some of these studies have used multiple univariate analyses and may have been undermined by an escalating type I error rate (Quinn and Keough, 2006). More recent studies have used multivariate statistical analyses involving conventional approaches such as MANOVA and linear discriminant analyses (Liu et al., 2010). However, these statistical methods, which are part of the GLM, require that certain assumptions be met. Therefore, the aim of this study was to compare the results of a conventional multivariate approach using MLR, with RFR, which does not involve the assumptions of the GLM. MLR was found to be generally effective in predicting any one of the 9 neurochemicals from the other 8. However, for the RFRs, the proportion of variance explained values were lower than the R2 values for the MLRs in 7/9 cases, although the largest differences were for l-ornithine, agmatine and putrescine. The proportion of variance explained values were still >0.6 for 5/9 RFRs. However, in terms of the R2 and proportion of variance explained values, as well as in terms of the RSE values, the MLRs seemed to be superior to the RFRs for this set of data. In a previous study (Liu et al., 2010), linear discriminant functions were highly successful (100% accuracy for the VNC, based on 6/9 variables, Wilks’  significant at P = 0.000; 90% for the CE, based on only 2/9 variables, Wilks’  significant at P = 0.000) in predicting the age of the animals on the basis of a subset of the 9 neurochemical variables. Likewise, a MANOVA showed that age was a very important factor in determining the concentrations of these variables. Consistent with this result, the MLR analyses showed that age was an important predictor for 4/9 neurochemical variables (putrescine, spermidine, agmatine and l-ornithine). By comparison, the RFR analyses showed that age was an important predictor for only 1/9 neurochemical variables (l-ornithine) In summary, in the case of this data set, MLR appeared to be superior to RFR in terms of its explanatory value and error. This result suggests that MLR may have advantages over RFR for prediction in neuroscience with this kind of data set, but that RFR can still have good predictive power in some cases.

References Brandt T, Strupp M, Novozhilov S, Krafczyk S. Artificial neural network posturography detects the transition of vestibular neuritis to phobic postural vertigo. J Neurol 2012;259:182–4. Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. 1st Ed. Boca Raton: CRC Press; 1984. Hastie T, Tibshirani R, Friedman J. Elements of statistical learning: data mining, inference and prediction. 2nd Ed. Heidelberg: Springer Verlag; 2009. Krafczyk S, Tietze S, Swoboda W, Valkovic P, Brandt T. Artificial neural network: a new diagnostic posturographic tool for disorders of stance. Clin Neurophysiol 2006;117:1692–8. Liu P, Smith PF, Appleton I, Darlington CL, Bilkey D. Nitric oxide synthase and arginase expression and activity in the rat hippocampus and the entorhinal,

P.F. Smith et al. / Journal of Neuroscience Methods 220 (2013) 85–91 perirhinal, postrhinal and temporal cortices: regional variations and effects of aging. Hippocampus 2003a;13:859–67. Liu P, Smith PF, Appleton I, Darlington CL, Bilkey D. Regional variations and agerelated changes in nitric oxide synthase and arginase in the subregions of the hippocampus. Neuroscience 2003b;119:679–87. Liu P, Smith PF, Appleton I, Darlington CL, Bilkey DK. Potential involvement of nitric oxide synthase and arginase in age-related behavioural impairments. Exp Gerontol 2004a;39:1207–22. Liu P, Smith PF, Appleton I, Darlington CL, Bilkey DK. Age-related changes in nitric oxide synthase and arginase in prefrontal cortex. Neurobiol Aging 2004b;25:547–52. Liu P, Smith PF, Appleton I, Darlington CL, Bilkey DK. Hippocampal NOS and arginase and age-associated behavioural deficits. Hippocampus 2005;15: 642–55. Liu P, Chary S, Devaraj R, Jing Y, Darlington CL, Smith PF, et al. Effects of aging on agmatine levels in memory-associated brain structures. Hippocampus 2008a;18:853–6. Liu P, Smith PF, Darlington CL. Glutamate receptor subunit expression in memoryassociated brain structures: regional variations and effects of aging. Synapse 2008b;62:834–41. Liu P, Zhang H, Devaraj R, Ganesalingam G, Smith PF. A multivariate analysis of the effects of aging on glutamate, GABA and arginine metabolites in the rat vestibular nucleus. Hear Res 2010;269:122–33. Manly BFJ. Multivariate statistical analysis. A primer. 3rd Ed. London: Chapman and Hall/CRC; 2005.

91

Marcoulides GA, Hershberger SL. Multivariate statistical methods. A first course. Mahwah, New Jersey: Lawrence Erlbaum Assoc; 1997. Marsland S. Machine learning. An algorithmic perspective. Boca Raton: CRC Press; 2009. Mori M, Gotoh T. Arginine metabolic enzymes, nitric oxide and infection. J Nutr 2004;134, 2820S-2028S. Olson AK, Eadie BD, Ernst C, Christie BR. Environmental enrichment and voluntary exercise massively increase neurogenesis in the adult hippocampus via dissociable pathways. Hippocampus 2006;16:250–60. Pang H, Lin A, Holford M, Enerson BE, Lu B, Lawton MP, et al. Pathway analysis using random forests classification and regression. Bioinformatics 2006;22:2028–36. Quinn GP, Keough HJ. Experimental design and data analysis for biologists. Cambridge: Cambridge University Press; 2006. Ryan TP. Modern regression methods. New Jersey: Wiley; 2009. Ryan M, Mason-Parker SE, Tate WP, Abraham WC, Williams JM. Rapidly induced gene networks following induction of long term potentiation at perforant synapses in vivo. Hippocampus 2011;21:541–53. Smith PF, Haslett SJ, Zheng Y. A multivariate statistical and data mining analysis of spatial memory-related behavior following bilateral vestibular deafferentation in the rat. Behav Brain Res 2013;246:15–23. Tabachnick BG, Fidell LS. Using multivariate statistics. 5th Ed. Boston: Pearson Education Inc.; 2007. Vittinghoff E, Glidden DV, Shiboski SC, McCulloch CE. Regression methods in statistics: linear, logistic, survival and repeated measures models. New York: Springer; 2005.