Computer Methods and Programs in Biomedicine, 31 (1990) 207-213
207
Elsevier COMMET 01084 Section I. M e t h o d o l o g y
CURT: a randomization test for statistical comparison between experimental curves Maurizio Rocchetti a and Giuseppe De Nicolao 2 i Istituto di Ricerche Farmacologiche "Mario Negri" Biomathematies and Biostatistics Unit, 20157 Milan, Italy, and 2 Centro Teoria dei Sistemi - CNR, c / o Dipartimento di Elettronica, Politecnico di Milano, 20133 Milan, Italy
In this paper, a new nonparametric method for testing the difference between two groups of time series is considered. A difference index between the two groups, based on the mathematical notion of norm, is introduced. Then, the statistical significance of the observed difference is assessed by means of a randomization procedure. The percentages of a and/3 errors are evaluated by means of computer simulation. The method is also applied to a set of experimental data and the results are compared with those obtained by means of Student's t-test. Time series; Randomization test; Statistical test
I. Introduction A possible result of a biological experiment is the production of two sets of time series (or, roughly speaking, 'curves'), resulting f r o m differently treating two groups of subjects. In such a context one has to face the p r o b l e m of determining whether there is a statistically significant difference between the two groups of curves. W h e n it is possible to describe the experimental data by means of a suitable mathematical model, the usual procedure consists in applying consolidated statistical methods to the estimated model parameters. In fact, these parameters often have a biological and physiological interpretation that can be useful to verify whether and how the different experimental conditions between the groups affected the results. Pharmacokinetic and p h a r m a -
Maurizio Rocchetti, Istituto di Ricerche Farmacologiche 'Mario Negri', Biomathematics and Biostatistics Unit, 20157 Milan, Italy. Correspondence:
c o d y n a m i c studies are typical examples of this approach. However, in m a n y situations it is impossible to apply mathematical models with a general physiological meaning. This can be due to low signal-tonoise ratio in the measurements, p o o r knowledge of the physiological mechanisms, inappropriate experimental design, etc. The statistical analysis of this kind of data is a cause of major misunderstanding between statisticians and biomedical researchers and occasionally offers outstanding exa m p l e s of misuse of statistical methods in biology. W i t h o u t going deeper into the problem of the use and abuse of statistics, it is worth noticing that the low n u m b e r of curves usually available in these studies raises further problems. In this situation, as it is impossible to verify the severe assumptions on which classical statistical methods are based, the use of Student's t-test, analysis of variance and other parametric tests should be avoided. Moreover, even n o n p a r a m e t r i c tests, such as point-topoint comparisons between the two mean curves can give strange and incomprehensible results due
0169-2607/90/$03.50 © 1990 Elsevier Science Pubfishers B.V. (Biomedical Division)
208
to the different variability. F o r example, it m a y h a p p e n that the significance of the differences varies from point to point in an erratic way [1-3]. Although there is not a universal recipe for testing the difference between two groups of curves, in this p a p e r a general-purpose n o n p a r a metric test is p r o p o s e d which is based on a randomization technique and the mathematical notion of norm. T h e first key idea behind our test is to use the notion of n o r m in order to obtain a measure of the 'distance' between two curves. This distance is then used to work out an index of the difference between two sets of curves. The second key idea consists in assessing the statistical significance of the latter index by means of a randomization procedure. The test is entirely n o n p a r a m e t ric, in that no a s s u m p t i o n is made a b o u t the statistical distribution of the data. It is only required that the time series to be c o m p a r e d consist of the same n u m b e r of points. The p a p e r is organized as follows. In Section 2, we concisely recall the notion of norm, by m e a n s of which, in Section 3, a suitable index of difference between groups of curves is introduced. Section 4 is devoted to the explanation of the randomization procedure, which completes the test. In Section 5, the new technique is applied to simulated and real data. We end the p a p e r with some concluding remarks (Section 6) and the hardware and software specifications concerning the c o m p u t e r p r o g r a m that p e r f o r m s the test (Section 7).
W e recall that the n o r m II x II is a real functional, defined on any element x of R ", such that: (a) II x II >- 0, Vx, and II x II = 0 if and only if x=0. (b) II a x II = I a I" 11x II, where a is any real number and I a l denotes the absolute value of a. (c) LIx + Y ll <_ ll x ll + ll Y ll. F r o m point (a) it follows that I I x - y l l > 0 , Vx, y, and IIx-yll = 0 if and only if x = y . Moreover, point (c) is the abstract formulation of the well k n o w n triangular inequality, which holds between the sides of any triangle. Since the n o r m can be seen as an extension to n-dimensional spaces of the geometric notion of distance, it is natural to consider II x - y II as a measure of the 'distance' between the curves x and y. N o t e that a p a r a m e t e r such as the difference between the m e a n s of two curves does not share the properties of a n o r m and should not be used to measure the difference between two curves. In fact, it is very well k n o w n that curves having completely different shapes m a y have the same mean, i.e. the difference between the m e a n s does not satisfy point (a). Various n o r m s can be defined in R". In the sequel, the following three n o r m s will be considered: (1) 1-Norm:
II x Ill = ~
I Xk I
k=l
(2) 2 - N o r m (Euclidean N o r m ) :
2. 'Distance' between curves: the notion of Norm T w o time series, x ( t i ) and y(ti), i = 1 . . . . . n, can be represented as two n-dimensional vectors of data: (3) Infinity N o r m ( M a x i m u m N o r m ) :
P
x = [X(tl), x(t2),...,
x(t,)]
,
y = [y(tl), y(t2) ..... y(/,)]'. In the following, we will e m p l o y the term 'curve' when referring to such sequences of ordered data. Since x, y ¢ R ", the 'distance' between x and y can be measured b y means of any n o r m in R ".
IlXlIM= m a x lxk I The differences between these n o r m s will be illustrated below b y m e a n s of two simple examples. F o r the sake of intelligibility, curves consisting of only two points will be considered.
209
2.1. Example 1 Let x = [1 1]', y = [2 2]' and z = [3 1]'. Since II Y - x l h = 2 , I I z - x l l l - - 2 and I l y - x l l E - - ¢ ~ , IIz - x liE=2, it follows that, by the 1-Norm, the distance between x and y is equal to the distance between x and z. In contrast, by the Euclidean Norm, y is closer to x than z is. In general, the 1-Norm gives more weight to differences distributed on a number of points than differences concentrated on few points.
2.2. Example 2 Let x = [ 1 1]', y = [ 4 5 ] ' and z = [ 1 5 ] ' . Then, Ily-xlh=7, IIz-xlh=4, Ily-zllE--5, IIzxlle=4 and I l y - x l l ~ t = 4 , IIz--xlIM=4. Therefore, by the Maximum Norm, the distance between x and y is equal to the distance between x and z, while, by the 1-Norm and the 2-Norm (though in a different degree) y is farther from x than z is. It is apparent that, being based on the maximum difference, the M a x i m u m N o r m emphasizes a local phenomenon and does not take into account the global behaviour. Then, the M a x i m u m N o r m should be used when one wants to stress the presence and location of one or few peaks in the curves to be analyzed. The above considerations are synthesized in Fig. 1. In a two-dimensional space, where two-element curves are represented by points in the plane, ×2 II×II
the loci in which each of three norms is equal to 1, are plotted.
3. Difference index between two sets of curves
In this Section two sets of curves X = ( x ~ l i = 1, 2 . . . . . N x } and Y = { yj I J = 1, 2 . . . . . Ny } will be considered. In the following, it will be assumed that all the curves of the two sets are defined on the same time points ( q , t 2. . . . . tn}: therefore x, = [xi(q), xi(t2) . . . . . x;(G)]'. N o w let define an index D, which measures the difference between the two sets X and Y: 1
D(X, Y)
N x N y ~_, []x,-yj[], Vi,j
where [I x I[ is any of the norms introduced in the previous Section. The index D is the average of the distances between all the pairs (x i, y;), with x; ~ X and yj ~ Y. F r o m the definition it follows that D > 0. Moreover, if at least two curves x, ~ X and y j ~ Y exist such that x ; ~ y j , then D > 0 . Note that, since the two sets X and Y do not necessarily have the same number of curves, in general it is not possible to consider X and Y as elements of a suitable normed space. Therefore, D is not a norm, but only a 'difference index'.
ll×il ,~ 1
E= I
/
\
/
-I
4. Randomization test on the distance index D
1
i
X 1
•
-,\ Iixll
~= 1
Fig. 1. Two-dimensional vectors x = [x~ x2]' are considered. Each vector can be associated to a point with co-ordinates x1 and x 2. The figure shows the three loci of points such that IlxllE =1, Ilxlh =1 and Ilxllu =1.
The randomization test can be regarded as the parent of nonparametric tests. Its simple logic makes this test an example of the power and elegance of statistical techniques. A test of significance is, in general terms, a calculation by means of which one is able to accept or refuse, with a known probability, a specific hypothesis (usually called null hypothesis) about a population. The value of this probability gives the significance level of the test. The term null hypothesis generally means that the hypothesis being tested is of no difference, with reference to a chosen descriptive index (mean, median, and so on) between the experimental groups. In such a
210 situation, the classical procedure is to suppose that the null hypothesis is true and calculate the probability of chance being the only responsible of the given experimental result. Now, consider our two groups X and Y of curves and denote by D o the value taken by the distance index D. If the null hypothesis is true, there is no difference between the two groups, i.e. the ensemble of all their curves constitutes a homogeneous group. In such a case, any observed difference between the groups would be due to subject variability and measurement error only. This means that any partition of the ensemble, obtained by randomly assigning Nx subjects to group X and Ny subjects to Y, has the same chance to happen. Therefore, it is possible to consider the subjects of the two starting groups as belonging to the same population and generate all possible configurations of the two groups. Now, by calculating the index D for each generated combination, a distribution table of frequencies of D-values can be obtained. The probability P of chance being responsible for the experimental result found (i.e. the probability, under the null hypothesis, of obtaining a D-value as large as Do) is given by: p = Na/Nt,
where N d is the number of configurations with D value as large as Do and
U,=
+
Ny)V(Nx Ny )]
is the total number of combinations of ( N x + Ny) curves in two groups of N~ and Ny elements, respectively. Note that Nt grows very rapidly with respect to the size of the groups (for N ~ = N y = 7 , N t = 13,728; for N x = Ny = 8, N t = 51,480). As a consequence, the randomization test, in spite of its elegance, power and robustness, has rarely been used because of the large computational requirements. However, the power of m o d e m computers makes it feasible, at least in simple experimental designs. Moreover, a Monte Carlo version of this test has been developed in order to limit computer time requirements [4]: a number of combinations
are generated at r a n d o m between all possible combinations. In [4] it has been shown that such a 'simulated randomization test' turns out to be computationally advantageous without considerably weakening the efficacy of the test. In conclusion, the computer program that performs the test has to carry out the following tasks: (a) given the two sets of curves, all combinations of ( N x + IVy) curves in two groups of Nx and Ny elements are generated (alternatively, if the total number of combinations is too great, a suitable number of combinations are generated at random); (b) for each combination, the index D ('distance' between the two groups) is computed; (c) the significance level P is evaluated.
5. Results 5.1. Simulated data
Several simulations have been performed in order to test the proposed method and to study its behaviour in comparison with the use of Student's t-test. First, the errors of type a (false-positive errors: the test detects a difference that does not exist) were considered. Two sets of experimental curves, each set consisting of five curves of 30 points, were generated by adding to the same straight line a random error extracted from a normal distribution with zero expectation and given standard deviation. In this way, the difference between the two sets of curves was due to random error only. The standard deviation of the r a n d o m error ranged from 0.5 to 2, with steps of 0.5, and, for each value, the experiment was repeated 1000 times. The randomization test (using the Euclidean N o r m ) as well as Student's t-test on the means of the two groups at each time point were performed. In both cases, p = 0.05 and p = 0.01 significance levels were considered. The results are summarized in Tables 1 and 2. The randomization test shows a good agreement with the nominal 5% and 1% values of the significance level (see Table 1). On the other hand, using Student's t-test at p = 0.05 and p = 0.01, in 78% and 26% of total cases
211 -,o- pz0.05(30)
TABLE 1
4-
Percentage of false-positive errors of the randomization test (1000 simulations) Random error (SD)
p __.0.05 F + errors (%)
p < 0.01
0.5 1.0 1.5 2.0 2.5
4.9 4.7 4.9 5.2 4.9
0.4 1.1 0.7 1.0 0.5
Mean
4.92
0.74
p~0.01 (30) pz0.05 (40) p~0.01 (40) p>_0.05 (50) p~0.01 (50)
1,0-
-
0,8'
0,6
rt
0,4
0,2
t h e r e is at least o n e p o i n t w h i c h s h o w s a signific a n t t-value. N o t e t h a t in 1% a n d 5% of cases, t h e r e are a b o u t 2 - 3 a n d m o r e t h a n 4 p o i n t s , r e s p e c t i v e l y , s h o w i n g a s i g n i f i c a n t t - v a l u e (see T a b l e 2). A n o t h e r s i m u l a t i o n w a s p e r f o r m e d in o r d e r to i n v e s t i g a t e t h e p e r c e n t a g e o f e r r o r s of t y p e /3 ( f a l s e - n e g a t i v e e r r o r s : t h e test d o e s n o t d e t e c t a n e x i s t i n g d i f f e r e n c e ) . I n this study, t h e c u r v e s o f t h e two groups were generated by adding synthetic n o i s e to t w o d i s t i n c t p a r a l l e l s t r a i g h t lines. I n this case, the t w o sets w e r e r e a l l y d i f f e r e n t a n d o u r test w a s e x p e c t e d to d e t e c t this d i f f e r e n c e . V a r i o u s n u m b e r s o f c u r v e p o i n t s (30, 40, 50) a n d 20 v a l u e s o f s t a n d a r d d e v i a t i o n ( r a n g i n g f r o m I to 5) w e r e u s e d . F o r e a c h s i t u a t i o n , 1000 d a t a sets w e r e g e n erated and the number of times the randomization test t u r n e d o u t to b e s i g n i f i c a n t at p = 0.05 a n d p = 0.01 w a s c o u n t e d . T h e r a t i o o f s i g n i f i c a n t tests
0,0 1
2
3
4
Error SD (noise-to-signal ratio)
Fig. 2. The figure shows the power ( = 1 minus the probability of a false-negative error) of the randomization test as a function of the SD, the simulated measurement error. Each curve corresponds to a different combination of number of points (30, 40, 50) and significance level (5%, 1%). The abscissa can also be interpreted as a noise-to-signal ratio index (see Section 5). to t o t a l c a s e s c a n b e r e g a r d e d as a n e s t i m a t e o f the p o w e r ( d e f i n e d as 1 - P ( F - ) , w h e r e P(F-) denotes the probability of a false-negative error) of t h e test. T h e p o w e r as a f u n c t i o n o f t h e s t a n d a r d d e v i a t i o n o f t h e r a n d o m e r r o r is p r e s e n t e d in Fig. 2 for d i f f e r e n t n u m b e r s o f p o i n t s (30, 40, 50) a n d s i g n i f i c a n c e levels ( p = 0.05 a n d p = 0.01). S i n c e t h e d i f f e r e n c e b e t w e e n t h e s t r a i g h t lines ( t h e sig-
TABLE 2 Percentage of false-positive errors of point-to-point Student's t-test (1000 simulations) Random
c=l a
error
p __<0.05
(SD)
c=2 p _<0.01
F + errors (%)
p ~ 0.05
c=3 p < 0.01
F + errors (%)
p < 0.05
c=4 p < 0.01
F + errors (%)
p < 0.05
p < 0.01
F + errors (%)
0.5 1.0 1.5 2.0 2.5
78.3 76.7 78.5 79,8 78,3
26.2 26.5 25.3 27.0 26.1
45.8 42.4 44.3 44.1 42.9
3.2 3.2 3.6 4.3 3.6
19.7 18.5 17.7 18.5 19.2
0.2 0.3 0.7 0.6 0.4
6.5 6.2 5.4 6.7 5.4
0.1 0.0 0.2 0.0 0.0
Mean
78,32
26.22
43.90
3.58
18.72
0.44
6.04
0.06
" c is a threshold: a difference between the two groups of curves is detected when the number of significant t-values is equal to or greater than c (in the experiment, each simulated curve consists of 30 points).
212
nal) was chosen equal to 1, the abscissa can be regarded as a noise-to-signal ratio index. For example, an abscissa equal to 2 represents a situa-
tion in which the noise has a standard deviation that is twice the value of the signal, i.e. the mean difference between the curves of the two groups.
5.2. Experimental data
A Euclidean ~I.30
norm
93.54
95.79 98.02 98.03 100.26 100.27 102.50 102.51 104.74 104.75 106.98 106.99 109.22 109.23 111.46 111.47 I13.70 113.71 115.94 115.95 118.18 118.19 120.42 120.43 122.66 122.67 124.90 124.91 127.15 127.1~ 129,09 129.40 131.63 131.64 133.87 > 133.88 Actual
distance
between
files:
D N F 1 0 . D A T vs.
SALINA.DAT
************44*4***^**4*******4***************44, *********** ********************************* *************************** ******4***********44***************************** ******************4**************
*****
<- A.d.
= 136.12
cases>:A.d.:
I
tot.
cases:
56
prob.:
0.0179
8 Euclidean 66.22
~7.57 68.92 70.28 71.63
72.98 74.34 75.69
77.04 78.40 79.75 81.10 82.46 83.81 85.16 86.52 87.87 89.22 90.58 ) Actual
n o r m between
files:
LNF_I0.DAT
vs.
SALINA.DAT
67.56 68.91 70.27 71.62 72.97 74.33 75.68 77.03 78.39 79.74
************************************************** ************* ************* ******************************* ********* ********* ************* *************************** ********* ************* 81.09 * * * * * * * * * * * * * 82.45 ************* 83.80 ********* 85.15 86.51 ************* 87.86 89.21 90.57 91.92 91.93 **** (- A.d.
dlstanee
= 93.28
cases>=A.d.:
1
tot.
cases:
56
prob.:
0.0179
C Euclidean 102.49 103.61 104.72 105.84 106.95 108m07 109.19 110.30 111.42 112.54 113.65 114.77 115.89 117.00 118.12 119.23 120.35 121.47 122.58
Actual
norm 103.60 104.71 105.83 I06.9~ 108.06 109.16 110.29 111.41 112.53 113.64 114.76 I15.88 116.99 118.11 119.22 120.34 121.46 122.57 123.69
distance
between
files:
LNF 1 0 . D A T
vs.
DNF 1 0 . D A T
*~*********************** ************************* **4*******4***4********** ************************* 4***********4************************4************
*************************
************44*********4* *************************
= 124.82
cases>=A.d.:
2
tot.
cases:
20
prob.:
0.1000
Fig. 3. The program CURT has been used to perform the randomization test on some groups of experimental curves (see Section 5). The figures show the frequency histograms of the differences between the sets obtained by generating all the possible combinations of the curves in two groups. The actual distance (A.d.), denoted by an arrow, indicates the value of the distance index actually observed between the e×perimental groups. Prob. provides the most stringent significance level in correspondence of which the test is satisfied. A: Saline (five curves) vs. d-fenfluramine (four curves). B: Saline (five curves) vs. l-fenfluramine (4 curves). C: d-Norfenfluramine (three curves) vs. l-norfenfiuramine (three curves).
The proposed method has also been applied to published experimental data [5]. In [5], two sets of three curves representing the time courses of a brain monoamine, after administration of two isomers of a drug, are compared with a set of 5 curves of animals treated with saline. The data ranged from 0 to 180 min and are sampled every 6 rain. Statistical analysis has been performed using Student's t-test applied at each time point (for details, see [5, Fig. 2B and Fig. 1C]). Results of our method, using the Euclidean Norm (analogous results are obtained by the Maximum and 1-Norm), are shown in Fig. 3A and Fig. 3B. The output of our program supplies the distribution table of the distance index D and the exact significance of the test. The results obtained by our method are in agreement with those obtained in [5] by using Student's t-test. However, only a significance level of 10% is obtained (see Fig. 3C) in the comparison between the two treated groups, whereas Student's t-test gives a significance level of 5% for time points in the range between 30 and 60 min. However, this disagreement is only seeming. In fact, the low number of curves in one of the two groups (three treated animals), allows only 20 total possible combinations, when randomizing, and a maximum attainable significance level of 1 / 2 0 ( p = 0.05). In such a situation, a cautious judgement has to be used and the significance level of 10% obtained with the randomization test appears to be more realistic than the Student's 5% obtained only over a subset of data from 30 to 60 min (corresponding to seven samples out of 31).
6. Concluding remarks Nowadays, the wide availability of powerful micro and personal computers has made possible the use of statistical tests based on randomization techniques. In this paper, by resorting to the notion of norm and a randomization technique, a new
213
method for the statistical comparison of two groups of curves has been presented. The choice of the kind of norm (Euclidean, 1-Norm, Maximum Norm) allows to specify alternative criteria: the Maximum Norm stresses disagreements due to differently located peaks, whereas the Euclidean Norm (and, in a greater degree, the 1-Norm) gives more weight to distributed differences. It seems that in most practical applications the use of the Euclidean Norm would prove satisfactory. While point-to-point Student's t-test produces as many results (possibly contradictory) as the number of points, our test provides a univocal result. The performances and applicability of the test have been evaluated by means of both simulated and experimental data.
7. Hardware and software specifications CURT, the program that implements the test, has been written in C and is currently available on DEC VAX systems under the VMS operating system, SUN 3 computers under Unix BSD 4.2, and IBM-PC compatibles under MS-DOS.
Acknowledgements The authors are greatly indebted to Dr. M.G. De Simoni (Istituto di Ricerche Farmacologiche 'Mario Negri'): she called their attention to the problem considered in this paper and provided several experimental data. They would also like to acknowledge the fact that the computer program was written by M. Lena.
References [1] R.E. Kirk, Experimental Design: Procedures for the Behavioral Sciences (Brooks/Cole, Belmont, MA, 1968). [2] D.S. Salsburg, The religion of statistics as practiced in medical journals, Am. Stat. 39 (1985) 220-223. [3] E.R. Burns, A critique of the practice of comparing control data obtained at a single time point to experimental data obtained at multiple time points, Cell Tissue Kinet. 14 (1981) 219-224. [4] M. Recchia and M. Rocchetti, The simulated randomization test, Comput. Programs Biomed. 15 (1982) 111-116. [5] M.G. De Simoni, Z. Juraszczyk, F. Fodritto, A. De Luigi and S. Garattini, Different effects of fenfluramine isomers and metabolites on extracellular 5-HIAA in nucleus accumbens and hippocampus of freely moving rats, Eur. J. Pharmacol. 153 (1988) 295-299.