Available online at www.sciencedirect.com
Fuzzy Sets and Systems 225 (2013) 23 – 38 www.elsevier.com/locate/fss
Change point analysis of imprecise time series Carmela Cappellia,∗ , Pierpaolo D’Ursob , Francesca Di Iorioa a Università Federico II di Napoli, Italy b Sapienza Università di Roma, Italy
Received 20 July 2012; received in revised form 28 February 2013; accepted 1 March 2013 Available online 13 March 2013
Abstract In this paper we describe how to conduct a change-point analysis when dealing with time series imprecisely or vaguely observed, i.e. time ordered observations whose values are not known exactly, such as interval or ordinal time series (imprecise time series). In order to treat such time series, we propose to employ a fuzzy approach i.e. data are parameterized in the form of fuzzy variables. Then, to detect the number and location of change points we employ a deviation measure for fuzzy variables in the framework of Atheoretical Regression Trees (ART). We present simulation results pertaining to the behavior of the proposed approach as well as two empirical applications to real imprecise time series. © 2013 Elsevier B.V. All rights reserved. Keywords: Imprecise time-varying evaluations; Fuzzy time series; Atheoretical Regression Trees (ART); Deviation of fuzzy time series; Bayesian Information Criterion (BIC)
1. Motivation Change point analysis comprises various statistical tools which are employed for determining if and when a change in a data set has occurred. In the last two decades it has emerged as a relevant research topic both in statistics and econometric literature (for review see [18,19]). The detection of change points is relevant for several reasons. First it can reveal a behavior of the time series that could otherwise be misunderstood and modeled inadequately; a well known example is the confusion between long memory and occasional breaks in the mean that may lead to an erroneous identification of an integrated or fractionally integrated process (see e.g. [16,27]). Second, in the context of forecasting, detecting change points allows to improve the quality of forecasting especially in the case of long series covering extended periods of time. Eventually the identification of breaks might isolate shorter periods between longer ones, revealing the presence of outliers and thus the need for adjusting the data (see for example [6]). Since the seminal paper by Andrews [1] that addressed the case of a single break occurring at an a priori unknown break date, the focus is on detecting multiple breaks at unknown dates, which represents the most challenging task. In this context the undiscussed contribution is due to Bai and Perron that in various papers [2–4] have presented a comprehensive discussion of the issue providing estimation methods, testing procedures and confidence intervals for multiple structural changes in the linear model framework. ∗ Correspondence to: Dipartimento di Scienze Politiche, Via L. Rodinò n. 22, 80138, Italy. Tel.: +39 3498010494; fax: +39 0812537466.
E-mail addresses:
[email protected] (C. Cappelli),
[email protected] (P. D’Urso),
[email protected] (F. Di Iorio). 0165-0114/$ - see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.fss.2013.03.001
24
C. Cappelli et al. / Fuzzy Sets and Systems 225 (2013) 23 – 38
In the case of multiple changes in the mean Cappelli et al. [6] have proposed a method called Atheoretical Regression Trees (ART) that employs Least Square Regression Trees (so forth LSRT) to estimate the number and location of multiple change points. Extensive simulation studies, comparison with current methods and applications to various real time series have provided evidence of the usefulness of the approach (see [25]). Also, in Cappelli et al. [7] a straightforward extension of ART called Theoretical Regression Trees (TRT) is introduced. It employs the recursive partitioning principle of regression trees for locating changes in the coefficient of a parametric model considering the general framework of the linear model. In general, change point analysis is performed on numerical time series measured by exact (crisp) values. In this paper we focus on time series which are imprecisely or vaguely observed (so forth denoted imprecise time series). In particular we describe how to employ ART to conduct change-point analysis of this type of time series where the imprecision is assumed to be represented by means of fuzzy sets giving rise to fuzzy time series. Indeed, in many real life and research situations we meet time ordered data whose values are not known exactly. For example either they may assume several values at a given time as a time series of daily temperatures measured on an hourly basis or there may exist uncertainty in representing the value of a given time as in a time series of prognostic judgments of risk for a disease expressed by linguistic terms. In either case, the use of the average or extreme (highest or lowest) temperature during the daily cycle or a numerical coding for the linguistic terms entails loss of information and inaccuracies. On the contrary a natural way to deal with and account for imprecision is to consider fuzzy logic [29], that is by definition imprecise. In the context of change point analysis the imprecise time series is converted into a fuzzy time variable and, in order to estimate the number and location of change points, we propose to employ, in the framework of ART, a deviation measure decomposition for fuzzy variables [13] based on the Yang–Ko’s metric [28]. The remainder of the paper is organized as follows. In Section 2 we introduce the notion of imprecise data discussing how and why imprecision can be fruitfully represented by means of fuzzy sets providing the basic fuzzy formalization of imprecise data; in Section 3 we address the main issue of the paper, i.e. the estimation of multiple change points in imprecise time series, in particular we introduce the ART method illustrating how it can be employed to detect change points in imprecise times series upon a proper fuzzy parameterization. In Section 4 we present the results of a large simulation study pertaining to the behavior of the proposed approach considering imprecise continuous-valued time series whereas in Section 5 we report two empirical examples that illustrate its applicative effectiveness. Final remarks follow in Section 6. 2. Imprecise information and fuzziness In the process of knowledge acquisition both data (empirical information) and theoretical assumptions (theoretical information) may be affected by imprecision, uncertainty or vagueness, stemming from several sources. In the specific case of statistical reasoning, various features of uncertainty may be considered [8]: (i) the vagueness connected with the use of linguistic terms in the description of the real world (e.g. when analyzing ordinal qualitative data); (ii) the imprecision deriving from the granularity of the terms utilized in the description of the physical world [30] (e.g. in a sociological investigation we may observe or analyze the variable “age of a person” in terms of granules consisting of single years, or intervals of five years, or ordered classes such as young, middle age, old; an increasing uncertainty is associated with these different granulations); (iii) the imprecision in measuring the empirical phenomena; (iv) the uncertainty related to the link between the observed data and the universe of possible data; (v) the (partial or total) ignorance concerning the values of a phenomenon in a specific observational instance or the validity of a given theoretical assumption (e.g. when adopting a Gaussian model for a stochastic quantity). In this paper, we focus on cases (i)–(iii) and we consider the situations in which the empirical time ordered data are imprecisely or vaguely observed. The imprecise time series are expressed by ordinal qualitative/linguistic time series codified either by suitable scaling processes or represented by time series intrinsically imprecise, where the imprecision is connected to the granularity concept or to the (non-probabilistic) measurement error. The imprecision is assumed to be represented by means of fuzzy sets giving rise to fuzzy time series, i.e. parameterized in the form of fuzzy time series.
C. Cappelli et al. / Fuzzy Sets and Systems 225 (2013) 23 – 38
25
To this purpose, two interpretations of data fuzziness can be considered: the “ontic” looks at fuzziness as being an intrinsic property of the investigated phenomenon, while the “epistemic” interpretation assumes that it is due to subjective ignorance about an underlying crisp (non-fuzzy) value. The relevance of the fuzzy approach in the representation of imprecise data is underlined in several contributions in the literature. A relevant example is Manski and Tamer [23] who discuss the case of interval data. Indeed researchers often deal with interval data, i.e. fuzzy data with uniform membership function, on variables that could be measured more precisely. This is the case of the Health and Retirement Study (HRS) where respondents are asked to report their health, in particular if health falls within a sequence of brackets yielding to a health interval for each respondent. A further use of fuzzy values is illustrated by Phillis and Kouikoglou [24] who argue that fuzzy values are appropriate to express how humans extract qualitative information from numerical, categorical or linguistic data as well as the way they use this information to make decisions and assessments. González-Rodríguez et al. [17] point out that several types of data such as evaluations, medical diagnoses or ratings cannot be described by means of numerical values and they are usually classified as either nominal or ordinal. The most common example is the Likert scales whose categories are labeled with numerical values. The statistical analysis of these scales is rather limited because several techniques cannot be directly used and/or the interpretation and the reliability of the results are notably reduced. Moreover the transition from one category to another is arbitrary and the various categories may not be perceived in the same manner by different respondents so that both variability and accuracy cannot always be well captured. Fuzzy values provide a new and simple representation of such data that is more expressive and accurate than classical ordinal scales. Many of the usual statistical techniques, measures and models have been extended; the transition from one category to another can be rendered gradually and the variability, accuracy and subjectiveness can be incorporated in the data. In the same streamline Sinova et al. [26] remark that fuzzy values enable to cope with the imprecision associated with a wide set of imprecise data such as data from surveys, ratings, etc. Rather than treating these type of data as either numerical or categorical, a fuzzy scale combines many of the advantages of both scales, such as the manageability of the numerical scale and the interpretability of the categorical one. In the specific case of imprecision associated with the use of linguistic term-based scales in the evaluation process we distinguish two fuzzy approaches: 1. Fuzzy rating scale: This kind of scale is a modification of traditional approaches to measurement that allows respondents to indicate both a preferred point on a scale and latitudes of acceptance on either side, and it represents a way to elicit imprecise valuation/rating that captures the inherent imprecision of these valuations as well as their diversity, variability and subjectivity [20,15]. 2. Likert or associated fuzzy conversion scales: They are the most commonly used scales to elicit imprecise responses in describing ratings, evaluations, perceptions, judgments of non-numerically measurable attributes. In this case the inherent imprecision is reflected although the diversity, variability and subjectivity may not be captured properly [15]. The applications shown in Section 5 are devoted to types of imprecision (i)–(iii); thus, the fuzziness associated with ordinal qualitative time data (case (i)) is expressed by means of a fuzzy conversion scale, i.e. we employ a fuzzy version of Likert-type scales; the fuzziness connected to the granularity (case (ii)) or to the measurement of the empirical phenomena (case (iii)) is suitably formalized in an analytical manner. Following a fuzzy approach, an imprecise variable can be formalized in terms of fuzzy numbers. In particular we consider fuzzy numbers of type LR, is represented as a LR fuzzy variable Y ≡ (c, l, u) L R with the following membership thus the imprecise variable Y function: ⎧ c y − ⎪ , y ≤ c (l > 0) ⎨L l
(y) = y − c ⎪ , y ≥ c (u > 0); ⎩R u where c denotes the center and l and u the left and right spreads, respectively; L (and similarly R) is a decreasing shape function from + to [0, 1] with L(0) = 1, L(z) < 1 for all z > 0, L(z) > 0 for all z < 1, L(1) = 0 for all z and L(+∞) = 0 (see [11,12,14,32]). Note that LR fuzzy numbers are particularly appropriate in the context of time series because the left and right spreads enable to account for the variability within the given time unit. For example when analyzing the daily prices of a financial asset, through the spreads, the estimated volatility can be incorporated.
26
C. Cappelli et al. / Fuzzy Sets and Systems 225 (2013) 23 – 38
Fig. 1. A split in a tree diagram.
In order to analyze statistical aspects of LR fuzzy variables, many useful statistical measures and metrics for fuzzy variables can be found in the literature. In particular in the following sections we will refer to the Yang–Ko [28] metric approach. 3. Change point analysis of imprecise time series via ART In this section we introduce the issue of estimating multiple change points from a general point of view and we show how in the case of changes in the mean the problem can be addressed using the ART method illustrating, in particular, how it can be employed to detect change points in imprecise time series. Let Yt be a time series characterized by m + 1 regimes and m change points so that t = T( j−1) + 1, . . . , T j and j = 1, . . . , m + 1 (we adopt the convention that T0 = 0 and Tm+1 = T where T is the length of the series). A common estimation method of the set of unknown break dates is that based on the least square principle i.e. the estimated break points (Tˆ1 , . . . , Tˆm ) are such that (Tˆ1 , . . . , Tˆm ) = arg min SS R(T1 , . . . , Tm ) (T1 , ...,Tm )
(1)
where SS R(T1 , . . . , Tm ) denotes the sum of squared residuals of the partition that in case of multiple changes in mean is given by SS R(T1 , . . . , Tm ) =
m+1
Tj
(Yt − j )2
j=1 t=T( j−1) +1
To detect the presence of such changes Cappelli et al. [6] have proposed a procedure based on LSRT [5] which are piecewise-constant models: a node h, a subsample of statistical units, is split into its left and right descendants h l and h r to reduce the deviance of the observed dependent variable y fitting to each node the mean of corresponding y’s values. The algorithm selects the split, i.e. the binary division, that minimizes the sum of squared residuals: (y − (h ˆ g ))2 (2) SS R(h l ) + SS R(h r ) = g∈{l,r } y∈h g
where (h ˆ g ) is the mean of the y values in node h g (g ∈ {l, r }). Note that the splitting criterion (2) corresponds to the objective function (1) computed for a binary partition and it is based on the deviation decomposition property. Fig. 1 displays the procedure for a single split in a binary tree diagram. Once the binary partition of a node is performed, the splitting process is recursively applied to each subnode until either the subnodes reach a minimum size or no improvement of the criterion can be achieved. LSRT provide a practical tool for detecting multiple changes in the mean occurring at unknown dates. Tree regressing an observed time series yt on a sequence of completely ordered numbers k = 1, . . . , T gives rise to a partition of the series into contiguous segments such that ˆ j ˆ j+1 ; the partition is represented as a binary tree diagram whose split points identify candidate change points whereas tree pruning together with model selection criteria gives their actual number. The method, called ART (Atheoretical Regression Trees), mimics Bai and Perron’s estimation method of the change points which is a global minimizer of the objective function (1) whereas ART is a local minimizer and for this reason it is much faster (at any node h the complexity is of order O(n(h)) where n(h) is the cardinality of node h) providing comparable
C. Cappelli et al. / Fuzzy Sets and Systems 225 (2013) 23 – 38
27
Fig. 2. Examples of membership functions for particular values of and .
results (for details on the method, simulation studies, comparison with Bai and Perron’s and other current methods and applications to various real time series, see [25]). As discussed in Section 2 an imprecise time series can be parameterized in the form of a LR fuzzy time series t ≡ (ct , lt , u t ) L R , where ct denotes the center at time t and lt and u t the left and right spreads, respectively, with the Y following membership function:
⎧ ct − yt ⎪ ⎪ , yt ≤ ct (l > 0) ⎨L lt
(yt ) = ⎪ yt − ct ⎪ ⎩R , yt ≥ ct (u > 0). ut Then, based on the above-mentioned metric approach by Yang and Ko [28] we can define the deviation of the fuzzy time series over the entire sample period (t = 1, . . . , T ) as t ) = 3 SS(Y
T
(ct − c) ¯ 2 − 2
t=1
+2
T
¯ + 2 (ct − c)(l ¯ t − l)
T
t=1 T
¯2 (lt − l)
t=1
(ct − c)(u ¯ t − u) ¯ + 2
t=1
T
(u t − u) ¯ 2,
(3)
t=1
1 1 where = 0 L −1 () d, = 0 R −1 () d are parameters which summarize the shape of the left and right tails ¯ u¯ are the mean values of c, l and u. The deviation (3) represents the Yang–Ko of the membership function and c, ¯ l, ¯ u) t ≡ (ct , lt , u t ) L R and (squared) Euclidean distance between the pair of LR fuzzy variables Y Y¯ ≡ (c, ¯ l, ¯ L R . Note that t ) weighs differently the centers and spreads by means of and . These weights are generally lower than one, SS(Y except when the membership function gives a relatively greater importance to points away from the center. However, as it is reasonable to think, the weights of the centers are larger than the weights pertaining to the spreads. By means t ) of and we can define a suitable criterion to weigh the centers and spreads of the fuzzy variable therefore SS(Y explicitly takes into account the shape of the membership function of the fuzzy variable. We observe that, as in the subjectivistic approach to probability, also the choice of the membership functions is subjective. In fact, the membership functions are context-sensitive. Furthermore, the functions are not determined in an arbitrary way, but are based on a sound psychological/linguistic foundation. It follows that the choice of the membership function should be made in such a way that a function captures the approximate reasoning of the person involved. In this respect, the elicitation of a membership function requires a deep psychological understanding [8]. In Fig. 2 we show the shape of the membership function for various values of and in the case of fuzzy variables with symmetrical membership function. D’Urso and Santoro [13] proved that deviation (3) satisfies the decomposition property thus, we can apply ART to detect changes in mean of an observed fuzzified time series yt using the above introduced deviation measure in the computation of the splitting criterion. In the present case the best split of a generic node h minimizes ⎡ T (h g ) T (h g ) T (h g ) 2 2 ¯ ¯ g ))2 ⎣ 3 SS R(h l ) + SS R(h r ) = (ct − c(h ¯ g )) + −2 (ct − c(h ¯ g ))(lt − l(h g )) + (lt − l(h g∈{l,r }
t=1
t=1
t=1
28
C. Cappelli et al. / Fuzzy Sets and Systems 225 (2013) 23 – 38 T (h g )
+2
(ct − c(h ¯ g ))(u t − u(h ¯ g )) + 2
t=1
T (h g )
⎤
(u t − u(h ¯ g ))2 ⎦
(4)
t=1
¯ g ) and u(h where c(h ¯ g ), l(h ¯ g ) are the mean values of the centers and of the left and right spreads in node h(g) (g ∈ {l, r }). After a large tree is grown we employ classical cost-complexity pruning [5] to generate a sequence of subtrees, i.e. of nested partitions that are alternative change point models of various dimension (number of changes and regimes). A common procedure to select the preferable subtree (model) among the competing ones is to consider an information criterion; in particular we use the Bayesian Information Criterion (B I C) defined as B I C(m) = ln ˆ 2 (m) + p ln(T )/T where ˆ 2 (m) = T −1 SS Ryt (Tˆ1 , . . . , Tˆm ) is the sum of squared residuals of the m-partition of the fuzzified time series and p = (m + 1) × (k + 1) with k = 3 because in each regime three parameters are estimated i.e. the mean values of the centers and of the left and right spreads, respectively. 4. Simulation experiments In this section we present the results of simulation experiments carried out to evaluate the proposed approach considering the number of structural change points detected (cp) and the rate of correct identification of the change points (ci) (either exact identifications or short intervals around the true value) as performance indicators. Three cases have been analyzed: 1. Changes only in the centers. 2. Changes only in the spreads. 3. Changes in both centers and spreads. In the simulations we have considered continuous-valued data. The data generating process (henceforth denoted DGP)
of the centers and spreads are: ct ∼ N (t , 2 ), lt ∼ U (at , bt ) and u t ∼ U (at , bt ) respectively. Throughout the 2 simulations it is = 1, m = 2 thus two changes occur in the data, the change points are T1 = 100 and T2 = 200, the length of the series is T = 300 and 1000 Monte Carlo replications are generated. 4.1. Case study I We start with the case where changes occur only in the centers and thus the presence of the spreads should not affect their identification. Indeed this is a base case to assess the behavior of the method that we expect to perform as standard ART for exact data providing similar results. The parameter values of the Normal distribution of the centers and of the continuous Uniform of the spreads are ⎧ =0 if t ≤ 100 ⎪ ⎪ ⎪ t = 1 ⎪ if 100 < t ≤ 200 ⎨ t if t > 200 t = 2 ⎪ ⎪ = 0, b = 1 if t = 1, . . . , 300 a ⎪ t t ⎪
⎩
at = 1, bt = 2 if t = 1, . . . , 300 For illustrative purposes, Fig. 3 displays one of the simulated series of the centers with the upper and lower bounds defined by the spreads. Although the graphical inspection of the series clearly suggests the presence of two changes, their size is small and, in general, their shape, characterized by increasing steps, represents a difficult case where most procedures fail to select both the right number of changes and their location. We have applied ART employing the deviation measure defined in (4) and considering alternative values for the parameters and that control the shape of the membership function, eventually, to avoid the identification of changes in the tails, we have set a minimum segment length of 20 observations. Table 1 reports the results averaged over the 1000 Monte Carlo replications. Starting with the mean number of detected change points (cp) the method identifies the right number of changes and only occasionally detects an additional spurious break. The rates of correct identifications (ci) are, as usual, low in the case of exact identifications but they become adequate and decidedly high as we consider larger intervals around the true
C. Cappelli et al. / Fuzzy Sets and Systems 225 (2013) 23 – 38
29
Fig. 3. Changes only in the centers, one simulated series. Table 1 Changes only in the centers—simulation results averaged over the 1000 MC replications. Break 1 (T1 = 100) ci ci ± 1
ci ± 2
ci ± 4
Break 2 (T2 = 200) ci ci ± 1
ci ± 2
ci ± 4
.24
.47
.57
.71
.27
.46
.60
.75
2.10
.25
.47
.58
.72
.27
.47
.60
.75
= = 2/3 2.12
.25
.47
.58
.72
.27
.47
.60
.76
cp
= = 1/3 2.09
= = 1/2
Table 2 Standard ART applied to the center series—simulation results averaged over the 1000 MC replications. cp
2.11
Break 1 (T1 = 100) ci ci ± 1
ci ± 2
ci ± 4
Break 2 (T2 = 200) ci ci ± 1
ci ± 2
ci ± 4
.26
.60
.76
.25
.60
.76
.47
.47
date. Note that in a further experiment, whose complete results we do not report for the sake of brevity, we have doubled the size of the shifts (the mean increases of two steps) and as expected the rates of correct identifications increased considerably, reaching 89% and 94% in ±4 observations for the first and second change respectively. A further remark concerns the values of parameters and that do not affect neither the mean number of changes detected nor the rates of correct identifications. Indeed these parameters reflect the shape of the membership function that involves the spreads that are not subject to change. Eventually, since in this case only the centers are subject to change, for comparison purposes we have applied standard ART to the series of the centers employing deviation measure (2); the results are summarized in Table 2. We see that both the number of identified changes and the rates of correct identifications are similar to the previous experiment (occasionally slightly higher correct identifications). Based on these results we conclude that, although the spreads are not subject to change and thus the data could be seen and treated as crisp data, the use of fuzzy values and correspondingly of a fuzzy deviation measure does not affect the analysis and the standard method does not outperform the fuzzy approach.
30
C. Cappelli et al. / Fuzzy Sets and Systems 225 (2013) 23 – 38
Fig. 4. Changes only in the spreads, one simulated series. Table 3 Changes only in the spreads, setting 1—simulation results averaged over the 1000 MC replications. Break 1 (T1 = 100) ci ci ± 1
ci ± 2
ci ± 4
Break 2 (T2 = 200) ci ci ± 1
ci ± 2
ci ± 4
.24
.46
.60
.75
.08
.20
.29
.44
1.22
.26
.47
.61
.77
.11
.25
.36
.50
= = 2/3 1.23
.28
.50
.63
.79
.15
.31
.44
.57
cp
= = 1/3 1.26
= = 1/2
4.2. Case study II This case study, i.e. changes occurring only in the spreads, is particularly relevant because, if a standard method, such as classical ART or Bai and Perron’s that do not take into account the imprecise nature of data, is applied to the series of the centers (as representative point data of the imprecise series) no change points would be detected. On the contrary we expect the use of the fuzzy deviation measure to enable the detection of the changes. We have considered two alternative settings characterized by changes of increasing magnitude; for the first one the parameter values employed in the DGP of the centers and spreads are ⎧ t = 0 if t = 1, . . . , 300 ⎪ ⎪ ⎪ ⎪ = 0, b = 1 if t ≤ 100 a t t ⎪ ⎪ ⎪ ⎪ = 1, b = 2 if 100 < t ≤ 200 a t ⎨ t at = 1.5, bt = 2.5 if t > 200
⎪ ⎪ at = 1, bt = 2 if t ≤ 100 ⎪ ⎪
⎪ ⎪ ⎪ = 1.5, b = 2.5 if 100 < t ≤ 200 a t t ⎪ ⎩
at = 2.5, bt = 3 if t > 200 One of the simulated series of the centers and upper and lower bounds based on the spreads is plotted in Fig. 4. The plot shows a slight evidence of a single change. Indeed the changes in the spreads are quite mild and, in particular, given that the centers do not change, the second change point (T2 = 200) is very difficult to identify because its size is small and moreover the Uniform distributions that define the left and right spreads in the second segment overlap. Thus, this is the case when a change can be regarded as being dubious. In Table 3 the results averaged over the Monte Carlo replications are summarized.
C. Cappelli et al. / Fuzzy Sets and Systems 225 (2013) 23 – 38
31
Table 4 Changes only in the spreads, setting 2—simulation results averaged over the 1000 MC replications. Break 1 (T1 = 100) ci ci ± 1
ci ± 2
ci ± 4
Break 2 (T2 = 200) ci ci ± 1
ci ± 2
ci ± 4
.24
.45
.56
.73
.20
.39
.52
.69
2.05
.28
.48
.61
.76
.23
.42
.56
.71
= = 2/3 2.03
.33
.54
.65
.81
.28
.48
.62
.76
cp
= = 1/3 2.07
= = 1/2
As expected the rates of correct identifications of the second (dubious) change are very low and, consistently, the number of changes is underestimated. Also, in this case, the values of the parameters and related to the shape of the membership function affect the rates of correct identifications, in particular = = 2/3 that corresponds to a parabolic shape (higher variability around the centers that are not subject to change) induces an increase of the rates of correct identifications of the second break that nevertheless remain quite low. In the second setting we have slightly increased the size of the second change in such a way that the Uniform distributions that define the segments of the left and right spreads do not overlap. The parameter values are ⎧ at = 0, bt = 1 if t ≤ 100 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ at = 1, bt = 2 if 100 < t ≤ 200 ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ at = 2, bt = 3 if t > 200 ⎪ at = 1, bt = 2 if t ≤ 100 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ at = 2, bt = 3 if 100 < t ≤ 200 ⎪ ⎪ ⎪ ⎪ ⎩
at = 3, bt = 4 if t > 200 In Table 4 the results averaged over the 1000 Monte Carlo replications are summarized. We see that the rates of correct identifications of the second change are considerably increased and the number of changes detected is no longer underestimated. As in the previous case the values of the parameters and affect the rates of correct identifications achieving the best performance when = = 2/3. 4.3. Case study III Eventually we have analyzed the most comprehensive case where both the centers and spreads are subject to changes. The means of the Normal distributions of the centers have been set as in case study I whereas the spreads have been generated using the parameter values of the two settings defined in case study II. For the first setting one simulated series is depicted in Fig. 5, we see that the presence of changes both in the centers and spreads, although not very strong, leads to a graphical evidence of two changes of increasing magnitude. Table 5 reports the results of the simulation experiment. The rates of correct identification of the second change point (T2 = 200) are decidedly higher than in case study II and in general they increase for both change points as parameters and increase achieving the overall best performance when = = 2/3 also, in this latter case, the procedure detects the right number of changes. For the sake of completeness we have run the simulation experiment considering also the second parameter setting defined for the spreads in case study II, the results are summarized in Table 6. Not surprisingly increasing the size of the changes provides higher rates of correct identification especially for the second change point and the value of parameters and affects either the rates of correct identifications or the number of detected changes that, in this case, is underestimated when = = 1/3. It is worth reminding that the procedure is
32
C. Cappelli et al. / Fuzzy Sets and Systems 225 (2013) 23 – 38
Fig. 5. Changes in the centers and spreads, one simulated series.
Table 5 Changes in centers and spreads, setting 1—simulation results averaged over the 1000 MC replications. Break 1 (T1 = 100) ci ci ± 1
ci ± 2
ci ± 4
Break 2 (T2 = 200) ci ci ± 1
ci ± 2
ci ± 4
.20
.37
.46
.60
.20
.37
.49
.66
1.42
.34
.56
.65
.76
.24
.43
.56
.72
= = 2/3 1.93
.35
.59
.69
.80
.28
.48
.61
.76
cp
= = 1/3 1.05
= = 1/2
Table 6 Changes in centers and spreads, setting 2—simulation results averaged over the 1000 MC replications. Break 1 (T1 = 100) ci ci ± 1
ci ± 2
ci ± 4
Break 2 (T2 = 200) ci ci ± 1
ci ± 2
ci ± 4
.28
.47
.55
.68
.30
.46
.57
.71
1.76
.37
.58
.67
.78
.38
.57
.68
.81
= = 2/3 2.04
.36
.59
.69
.80
.38
.59
.70
.82
cp
= = 1/3 1.05
= = 1/2
based on a recursive binary partitioning algorithm thus, once a split is performed and the corresponding change point is identified, the search is repeated separately on the subsegments defined by the change point. For this reason the method is able to handle the case of multiple (more than two) change points as shown by a further simulation experiment that we have run for case study III considering 3 change points. As in the previous experiments the length of the series is T = 300 whereas the changes occur at times T1 = 75, T2 = 150 and T3 = 225. The MC replications are simulated
C. Cappelli et al. / Fuzzy Sets and Systems 225 (2013) 23 – 38
33
Table 7 Case of three changes in centers and spreads—simulation results averaged over the 1000 MC replications. Break 1 (T1 = 75) ci ci ± 2
ci ± 4
Break 2 (T2 = 150) ci ci ± 2
ci ± 4
Break 3 (T3 = 225) ci ci ± 2
ci ± 4
.14
.36
.51
.17
.44
.60
.29
.54
.79
.25
.53
.68
.22
.54
.68
.40
.62
.88
= = 2/3 2.73 .29
.61
.76
.26
.60
.74
.47
.71
.90
cp
= = 1/3 2.22
= = 1/2 2.50
Fig. 6. Case of 3 changes in the centers and spreads, one simulated series.
setting the parameter as follows: ⎧ =0 if t ≤ 75 ⎪ ⎪ ⎪ t ⎪ = 1 if 75 < t ≤ 150 ⎪ ⎪ ⎪ t ⎪ = 2 if 150 < t ≤ 225 ⎪ t ⎪ ⎪ ⎪ = 0 if t > 225 ⎪ t ⎪ ⎪ ⎪ = 0, b = 1 if t ≤ 75 a ⎪ t t ⎪ ⎨ if 75 < t ≤ 150 at = 1, bt = 2 at = 1.5, bt = 2.5 if 150 < t ≤ 225 ⎪ ⎪ ⎪ ⎪ if t > 225 at = 0, bt = 1 ⎪ ⎪ ⎪
= 1, b = 2 ⎪ if t ≤ 75 a ⎪ t t ⎪ ⎪
= 1.5, b = 2.5 if 75 < t ≤ 150 ⎪ a ⎪ t t ⎪ ⎪ ⎪ if 150 < t ≤ 225 at = 2.5, bt = 3 ⎪ ⎪ ⎩
if t ≤ 225 at = 1, bt = 2 One of the simulated time series depicted in Fig. 6 shows that after two changes the series switches to the first regime that is a common situation in change point analysis. The results are reported in Table 7. The results are consistent with the previous ones: the larger the size of the change (in this case the one occurring at time T3 = 225) and the value of parameters and , the higher the rates of correct identification of the change points that achieve very high levels when = = 2/3. The number of change points detected is also affected by the values of and .
34
C. Cappelli et al. / Fuzzy Sets and Systems 225 (2013) 23 – 38
The various simulation studies conducted have shown the use of the fuzzy deviation measure in the framework of the ART procedure to be a useful tool to perform change point analysis of imprecise time series either in terms of number of change points identified or rates of correct identifications of the true change point. In particular, when changes are present only in the centers the standard procedure does not outperform the proposed approach; in case of changes only in the spreads the use of a fuzzy deviation measure enables the identification of the changes whereas if any standard method is applied to the time series of the centers (as representative point data of the imprecise series) no changes can be identified; eventually, when changes are present both in centers and spreads the procedure archives the overall best performances. A further relevant finding concerns the role of the parameters that reflect the shape of the membership function. Indeed, the choice of their value is crucial when changes are present in the spreads either with or without changes in the centers, in particular, according to our results, the parabolic shape provides the best performances either in terms of rates of correct identification of the true change points or number of change points detected. Eventually, the method that uses a recursive partitioning algorithm it is able to detect multiple change points. A final remark concerns the notable efficiency of the method whose mean CPU time spent for executing the various simulation experiments on a Pentium Dual Core (processor E6500, 2.94 GHz, RAM 4.00 GB) was 1 min and 10.45 s. 5. Empirical applications In this section we present two empirical applications that illustrate the applicative effectiveness of the proposed approach to detect change points in real time series considering either a time series measured on an ordinal scale or a real-valued time series. 5.1. Change point analysis of Fitch ratings of the Italian sovereign debt The time series of the ratings assigned by Fitch to the Italian sovereign debt covers the period 1988–2012. As discussed in Section 2 when a time series arises from human judgments, perceptions or evaluations, whatever the quality scale employed by the expert, the corresponding items are intrinsically accompanied by a (non-probabilistic) vagueness and imprecision. In cases such as these, instead of treating the data as either numerical or categorical, the adoption of a fuzzy scale is more expressive and accurate and it represents a proper way to take into account the above-mentioned vagueness. In particular, international rating agencies employ linguistic ordinal scales to provide an opinion on the relative ability of an entity to meet financial commitments and, as stated by Fitch: “ratings are relative measures of risk; as a result, the assignment of ratings in the same category may not fully reflect small differences in the degrees of risk” (www.fitchratings.com). This remark holds also for the transition from a rating to another thus it is a case where the fuzzification seems appropriate. The rating scale employed by Fitch ranges from AAA = highest cr edit qualit y to D = de f ault where the categories AAA to BBB are defined investment grade and indicate relatively low to moderate credit risk, while ratings in the categories BB to D, called speculative grade, either signal a higher level of credit risk or that a default has already occurred; for the sake of brevity we do not provide the complete list of categories (details can be found in the agency website). Despite repeated downgrades, Italy’s ratings have always been in the categories investment grade, thus we have fuzzified the series employing the five point fuzzy coding (fuzzy Likert-based scale): A− = (3, 3, 1.5), A+ = (4, 1.5, 1.5), AA− = (6, 1, .5), AA = (8, 1.75, .25), AA+ = (10, 2, 0) [21]. Fig. 7 depicts the time series of the fuzzified ratings with the upper and lower bounds based on the right and left spreads. We applied ART to the fuzzified series setting a minimum number of observations per segment of 5 years, to avoid the identification of a break at the end of the series. Whatever the values of parameters and we found evidence of a single break (minimum BIC partition) occurring in 1992 when Italian sovereign debt lost and up to now never regained rating AA+. It is also worth noticing that a second break is located (but not chosen) in 2007 when a period of AA− began leading to further downgrades in the last two years. Indeed, 2007 it is the year when several experts date the beginning of the recent financial crisis. It is worth noticing that our approach identifies the same change point of a tree based procedure that treats the response variable as ordinal, i.e. a tree method applied to the linguistic ordinal scale, whereas Bai and Perron’s method applied to a five point numeric scale, being a global minimizer of the sum of squared
C. Cappelli et al. / Fuzzy Sets and Systems 225 (2013) 23 – 38
35
Fig. 7. Time series of the fuzzified Italy’s Fitch ratings with the upper and lower bounds.
residuals, identifies and chooses a five segment partition i.e. it separates all the sub-periods characterized by different ratings. 5.2. Change point analysis of temperatures in Rome In this empirical application we have considered a data set of meteorological variables collected in Rome during the year 1999. In particular, since our goal is to detect change points, we have analyzed the variable temperature that is likely to contain some, focusing on the subperiod 01:July–30:October, thus the length of the time series is T = 122. Indeed, the original data set provides hourly values of the temperature whose conversion into a single value for each day entails loss of information and inaccuracy. In order to overcome this drawback, following Coppi et al. [8], we have generated a fuzzy variable with LR membership function defining the centers as the daily mean temperatures whereas the left and right spreads are obtained by averaging the hourly deviations of the values lower and higher than the mean, respectively. The time series of the centers and upper and lower bounds based on the right and left spreads are depicted in Fig. 8. The graph of the series suggests the presence of two breaks. Indeed, setting a minimum segment length of 15 observations ART detects two changes at dates 1st of October and 29th of August, a third change (identified but not chosen) is located at date 4th of August, also neither the dates of the changes nor their numbers are affected by the shape of the membership function. A nice feature of ART is provided by the graphical representation of the tree partition corresponding to the change points, reported in Fig. 9 whereas in Table 8 some stylized facts of the entire series and the subperiods defined by the detected change points are reported. The tree shows that the first change point identified by the procedure occurs at the beginning of October i.e. it separates July, August and September from October. This change is the strongest one and it is associated with decreasing temperature and lower variability of both the temperatures above and below the center (mean) as we can see from the values in Table 8. The second change point occurring at the end of August separates July and August, the hottest months, from September that is milder than the previous ones but characterized by higher variability of the temperatures above the mean. Eventually, standard ART applied to the time series of the centers identifies different change points, in particular it locates the first change point three days later and the second one two days before. The results of both applications suggest that the proposed approach represents a useful tool to investigate the presence of change points in imprecise time series with different types of impreciseness (see Section 2).
36
C. Cappelli et al. / Fuzzy Sets and Systems 225 (2013) 23 – 38
Fig. 8. Fuzzy time series of the temperatures in Rome, sample period 1 July–30 October, year 1999.
Fig. 9. Tree diagram of the temperature change points. Table 8 Stylized facts of the entire series and of the subperiods identified by the change points. c¯
l¯
u¯
Entire series 1:July–30:October
24.0
1.45
Regimes 01:July–29:August 30:August–01:October 02:October–30:October
26.9 23.5 18.7
1.50 1.47 1.33
min
max
2.60
9.2
40.4
2.60 2.73 2.48
18.7 16.6 9.2
40.4 36.3 31.2
6. Conclusion Change point analysis is a useful tool for monitoring and control, and in the last decades it has emerged as a relevant research topic. Various methods proposed in the literature consider the case of numerical time series measured by exact (crisp) values whereas this paper has addressed the problem of detecting change points in imprecise times series proposing, in the framework of Atheoretical Regression Trees, the use of a fuzzy deviation measure. Indeed the parameterization of imprecise time series in the form of LR fuzzy time series enables to employ usual statistical measures and techniques capturing the imprecision, subjectivity and vagueness of such data.
C. Cappelli et al. / Fuzzy Sets and Systems 225 (2013) 23 – 38
37
Simulation studies as well as two empirical applications to real imprecise time series have been carried out. According to the results of the simulation experiments the method selects the number of changes and their location accurately. In particular, when changes are present only in the centers and thus the data in terms of change point analysis seem crisp, the use of the fuzzy deviation measure that involves the spreads favors the correct identification of the change points. In case of changes only in the spreads our proposal represents a tool to detect change points that would otherwise be missed by classical methods for exact (crisp) data typically applied to the series of the centers. In this case the choice of the shape of the membership function is relevant i.e. the values of the parameters that reflect this shape, indeed especially in case of a quite large change such as the second one in our study, the parabolic function that accounts for a higher variability around the centers, increases the rates of correct identifications and also, when both centers and spreads are subject to change, this choice is also relevant for the selection of the right number of breaks. Moreover the method, which employs a recursive partitioning algorithm, is able to handle more than two change points. As a further benchmark we have presented two applications to real time series, considering either the case of the fuzzification of an ordinal linguistic time series or the generation of a real-valued fuzzy time series. The results confirm that the suggested fuzzy approach to change point analysis is particularly useful when the time series are imprecise or vague and their impreciseness is connected to ordinal qualitative evaluations (ordinal linguistic time series), suitably codified in fuzzy terms by means of fuzzy rating scale or Likert or associated fuzzy conversion scales, or to the granularity (non-probabilistic) measurement error. Indeed both the simulation studies and the applications to real data have shown the benefits of our fuzzy approach in observational contexts affected by imprecision. We underline that, in the suggested fuzzy approach, the fuzziness concerns only the so-called empirical information (the data). In the future, following the Informational Approach [8] we will also investigate the case in which the fuzziness concerns the theoretical component of the information (theoretical information), i.e. the method/model for analyzing the imprecise time-varying data is also fuzzy. Eventually, the procedure can be easily implemented in any software that provides the classification and regression tree methodology and it provides a quick flexible tool, that, due to its simplicity, is particularly useful for applied time series analysis. Acknowledgment The authors wish to thank the Editor and the referees for their useful comments and suggestions which helped to improve the quality and presentation of this manuscript. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [11] [12] [13] [14] [15] [16]
D.W.K. Andrews, Tests for parameter instability and structural change with unknown change point, Econometrica 61 (1993) 821–856. J. Bai, P. Perron, Estimating and testing linear models with multiple structural changes, Econometrica 66 (1998) 47–78. J. Bai, P. Perron, Computation and analysis of multiple structural change models, J. Appl. Econ. 18 (2003) 1–22. J. Bai, P. Perron, Multiple structural change models: a simulation analysis, in: M. Broy, E. Dener (Eds.), Econometric Theory and Practice: Frontiers of Analysis and Applied Research, Cambridge University Press, 2006, pp. 212–237. L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees, Wadsworth & Brooks, Monterey, CA, 1984. C. Cappelli, R. Penny, W. Rea, M. Reale, Detecting multiple mean breaks at unknown points in official statistic, Math. Comput. Simulation 78 (2008) 351–356. C. Cappelli, P. D’Urso, F. Di Iorio, Multiple structural-change model analysis via theoretical regression trees, in: Proceedings of 8-th Meeting of the Classification and Data analysis Group of SIS, e-book, Pavia University Press, 2011. R. Coppi, P. Giordani, P. D’Urso, Component models for fuzzy data, Psychometrika 71 (2006) 733–761. R. Coppi, P. D’Urso, P. Giordani, A. Santoro, Least squares estimation of a linear regression model with LR fuzzy response, Computational Statistics and Data Analysis 51 (2006) 267–286. D. Dubois, H. Prade, Fuzzy Sets and Systems: Theory and Applications, Academic Press, New York, 1980. D. Dubois, H. Prade, Possibility Theory. An Approach to Computerized Processing of Uncertainty, Plenum Press, New York, 1988. P. D’Urso, A. Santoro, Goodness of fit and variable selection in the fuzzy multiple linear regression, Fuzzy Sets Syst. 157 (2006) 2627–2647. P. D’Urso, Clustering of fuzzy data, in: J.V. de Oliveira, W. Pedrycz (Eds.), Advances in Fuzzy Clustering and its Applications, J. Wiley and Sons, 2007, pp. 155–192. S. de la Rosa de Sáa, M.A. Gil, G. González-Rodríguez, M.A. Lubiano, Fuzzy rating scale-based questionnaire: some drawbacks and statistical benefits, in: V European Congress of Methodology, July 17–20, 2012, Santiago de Compostela, Spain. C.W.J. Granger, J. Hyung, Occasional structural breaks and long memory with an application to the S&P 500 absolute stock returns, J. Empirical Finance 11 (2004) 399–421.
38
C. Cappelli et al. / Fuzzy Sets and Systems 225 (2013) 23 – 38
[17] G. González-Rodríguez, A. Colubi, M.A. Gil, Fuzzy data treated as functional data: a one-way ANOVA test approach, Comput. Stat. Data Anal. 56 (2012) 943–955. [18] B. Hansen, Testing for parameter instability in linear models, J. Policy Modeling 14 (1992) 517–533. [19] B. Hansen, The new econometrics of structural change: dating breaks in US labor productivity, J. Econ. Perspect. 15 (2001) 117–128. [20] T. Hesketh, R. Pryor, B. Hesketh, An application of a computerized fuzzy graphic rating scale to the psychological measurement of individual differences, Int. J. Man-Mach. Stud. 29 (1988) 21–35. [21] W.L. Hung, M.S. Yang, Fuzzy clustering on LR-type fuzzy numbers with an application in Taiwanese tea evaluation, Fuzzy Sets Syst. 150 (2005) 561–577. [23] C.F. Manski, E. Tamer, Inference on regressions with interval data on a regressor or outcome, Econometrica 70 (2002) 519–546. [24] Y.A. Phillis, V.S. Kouikoglou, Fuzzy Measurement of Sustainability, Nova Science Publishers, New York, 2009. [25] W. Rea, M. Reale, C. Cappelli, J.A. Brown, Identification of changes in mean with regression trees: an application to market research, Econometric Rev. 29 (2010) 754–777. [26] B. Sinova, M.A. Gil, A. Colubi, S. Van Aelst, The median of a random fuzzy number. The 1-norm distance approach, Fuzzy Sets Syst. 200 (2012) 99–115. [27] A. Smith, Long memory and the illusion of long memory in economic time series, J. Bus. Econ. Stat. 23 (2005) 321–335. [28] M.S. Yang, C.H. Ko, On a class of fuzzy c-numbers clustering procedures for fuzzy data, Fuzzy Sets Syst. 84 (1996) 49–60. [29] L. Zadeh, Fuzzy sets, Inf. Control 8 (1965) 338–353. [30] L.A. Zadeh, Toward a generalized theory of uncertainty (GTU)—an outline, Inf. Sci. 172 (2005) 1–40. [32] H.J. Zimmermann, Fuzzy Set Theory and its Applications, Kluwer, Boston, 2001.