Seasonal clustering technique for time series data

European Journal of Operational Research 175 (2006) 376–384 www.elsevier.com/locate/ejor Stochastics and Statistics Seasonal clustering technique fo...

Download PDF

145KB Sizes 1 Downloads 124 Views

Report

PDF Reader
Full Text

European Journal of Operational Research 175 (2006) 376–384 www.elsevier.com/locate/ejor

Stochastics and Statistics

Seasonal clustering technique for time series data Tasha R. Inniss

*

Department of Mathematics, Spelman College, 350 Spelman Lane SW, Box 320, Atlanta, GA 30314-4399, USA Received 24 June 2004; accepted 29 March 2005 Available online 10 August 2005

Abstract In data mining, the unsupervised learning technique of clustering is a useful method for ascertaining trends and patterns in data. Most general clustering techniques do not take into consideration the time-order of data. In this paper, mathematical programming and statistical techniques and methodologies are combined to develop a seasonal clustering technique for determining clusters of time series data. We apply this technique to weather and aviation data to determine probabilistic distributions of arrival capacity scenarios, which can be used for eﬃcient traﬃc ﬂow management. In general, this technique may be used for seasonal forecasting and planning. Ó 2005 Elsevier B.V. All rights reserved. Keywords: Partitional clustering; Seasonal clustering; Set partitioning integer program; Empirical distribution function; Mean square ratio

1. Motivation for development of technique Domestic air traﬃc has greatly increased over the last couple of decades and is predicted to continue to increase at a rate of 3–5% over the next 10 years [1]. With this great increase in air traﬃc comes a large increase in the demand for airspace and airport resources. Unfortunately, airspace and airport capacities are not increasing at a rate necessary to meet this rising demand. It is vital that new methodologies and tools be developed to *

Tel.: +1 404 270 5829; fax: +1 404 270 5836. E-mail address: [email protected]

address the inevitable rise in congestion. Because of this surge in air traﬃc and the limited capacity of airports, air traﬃc ﬂow management is becoming an increasingly diﬃcult task. When an airportÕs capacity (number of ﬂights able to land) is reduced during ‘‘peak demand hours’’, demand for an airportÕs resources exceeds the capacity at which the airport can meet this demand. During instances of capacity–demand imbalances, air traﬃc ﬂow management (TFM) in an eﬃcient and safe manner is of premier importance. The overall goal of TFM is to strategically plan and manage entire ﬂows of air traﬃc, provide the greatest and most equitable access to airspace

0377-2217/$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.ejor.2005.03.049

T.R. Inniss / European Journal of Operational Research 175 (2006) 376–384

resources, mitigate congestion eﬀects from severe weather, and ensure the overall eﬃciency of the system without compromising safety. Managing traﬃc ﬂow during capacity–demand imbalances is known as the traﬃc ﬂow management problem (TFMP). Ground-holding procedures are a principal tool used to address the TFMP. The two main ground-holding procedures employed are ground stops and ground delay programs (GDPs). In a ground stop, ﬂights are held on the ground at their departure airports until it is determined that the capacity–demand imbalance has abated. In a GDP, ﬂights in a certain time window or at a certain distance away from a congested airport are assigned delay to be taken on the ground until a time that they can safely land at their terminal airports with little to no airborne delay. To solve the TFMP during GDPs, stochastic ground-holding models can be used. There is a class of stochastic ground-holding models that are solved using probabilistic distributions of ‘‘scenarios’’ or possible realizations of airport arrival capacity as input (along with other inputs) [3]. These probabilistic distributions of capacity scenarios, which will be referred to as Capacity Probabilistic Distribution Functions (CPDFs), can be derived or estimated by creating relative frequency histograms of empirical data of interest. An overall CPDF can be decomposed into daily, monthly or seasonal CPDFs. The type that will ultimately be utilized depends on the operational preferences of the specialists at the Federal Aviation AdministrationÕs (FAA) Air Traﬃc Control Systems Command Center (ATCSCC), as well as other factors. See Inniss [7] for a more detailed discussion. In this paper, the focus will be on estimating seasonal CPDFs. Decomposing an overall CPDF into groupings of months (seasons) based on some measure of similarity (dissimilarity) can be modeled as a clustering problem. The data used for this research are time ordered, thus standard clustering techniques will not be suﬃcient to address and solve the aforementioned problem. Most general clustering techniques do not take into consideration the time-order of data. With improved capabilities for data warehousing, an increasingly amount of time series data is being collected. Unfortunately, the classic standard unsupervised

377

learning techniques are not able to handle the case of clustering data in such a way that the sequential order of time series data is maintained. A method is needed to perform clustering that is imbedded in a time series. The resulting clusters must be contiguous and homogeneity should exist within the clusters. This type of clustering will be referred to as seasonal clustering. In this paper, mathematical programming and statistical theories and techniques will be used to develop a clustering technique that can be applied to time series data. This seasonal clustering methodology will be applied to GDP and fog weather data from San Francisco International Airport (SFO). The GDP data consists of the lengths (durations) of ground delay programs during inclement weather at SFO. The fog weather data was obtained from the National Climatic Data Center (NCDC) and include the duration of low ceiling and visibility weather conditions on those days that a GDP was run at SFO. The results of this analysis presented in this paper could be used to aid in solving the TFMP by providing the required probabilistic distributions of airport arrival capacity scenarios for the stochastic ground-holding models. In general, this approach can be applied to any data that is sequential and requires the time-order of the data be maintained.

2. Overview of clustering In data mining, the unsupervised learning technique of clustering is a useful method for ascertaining trends and patterns in data, when there are no pre-deﬁned classes. There are two main types of clustering, hierarchical and partitional. In hierarchical clustering, each data point is initially in its own cluster and then clusters are successively joined to create a clustering structure. This is known as the agglomerative method. In partitional clustering, the number of clusters must be known a priori. The partitioning is done by minimizing a measure of dissimilarity within each cluster and maximizing the dissimilarity between diﬀerent clusters. See Everitt [4] and Hartigan [5] for more information on cluster analysis.

378

T.R. Inniss / European Journal of Operational Research 175 (2006) 376–384

The determination of the best or most appropriate measure of dissimilarity is the subject of many research papers. Measure of dissimilarity can also be thought of as a cost (objective) function. In Hwang [6] and Anily and Federgruen [2], it was determined that the optimal partition comes from a sequential ordering of the data. For such problems as these, Joseph and Bryson [8] refers to them as the ‘‘linearly ordered clustering problem’’ or the ‘‘sequential clustering problem’’. According to [8], the sequential clustering problem must satisfy three assumptions: (1) data is sequential and the partition must maintain sequential order of data; (2) the objective function is the sum of the costs of the clusters in the partition; and (3) the set of possible clusters is restricted due to the time-order of data. Our problem is reminiscent of partitional clustering and can be thought of as a sequential clustering problem. The problem can be modeled in diﬀerent ways, one of which is an integer programming set partitioning problem (see [11]). This topic will be explored and discussed in the next section.

3. Modeling seasonal clustering problem Given the twelve months in a year, the goal is to partition the year into groupings of contiguous months that contain the most similar weather conditions. The problem of determining the optimal partitions (seasons) can be formulated as a set covering/partitioning integer programming problem. The goal of the set covering integer program is to ‘‘cover’’ the whole year by a ﬁnite number of covers or seasons with the smallest total cost and the goal of the set partitioning IP is to cover the whole year by a ﬁnite DISJOINT set of seasons in a least cost fashion. The set partitioning IP has the following formulation (see [9]): Minimize

n X

C s xs

s¼1

Subject to

n X

ajs xs ¼ 1;

s¼1

xs 2 f0; 1g;

for each month j;

A = [ajs] is a 0–1 incidence matrix with ajs = 1 if j 2 Ms (month j is in candidate season Ms), 0 n otherwise; fM s gs¼1 corresponds to the set of candidate seasons; n is the number of candidate seasons; Cs is the cost of including Ms in the cover; and xs is a binary variable with value 1 when Ms is included in the cover, and 0 otherwise. It should be noted that the formulation for set covering has >= constraints instead of equality constraints. Due to the fact that the set covering formulation has an inequality constraint, it is possible for over-covering to occur. In the case of assigning months to seasons, the columns of A (i.e. the set of Ms) can be eﬃciently enumerated since a season is characterized by a start month and an end month and the months must be contiguous. The possible seasons can be enumerated according to the number of contiguous months. A season could contain (exactly) 1 month, thus there are 12 1-month clusters (seasons). A season could contain (exactly) 2 contiguous months, i.e. Jan/Feb or Apr/May, thus there are 12 2-month clusters (seasons). Continuing in this manner, there are 12 3-contiguous month clusters, 12 4-contiguous month clusters, 12 5-contiguous month clusters, 12 6-contiguous month clusters, 12 7-contiguous month clusters, 12 8-contiguous month clusters, 12 9-contiguous month clusters, 12 10-contiguous month clusters, 12 11contiguous month clusters, and only ONE season consisting of all 12 contiguous months. It should be noted that the length of a possible season is EXACTLY 1 month, or EXACTLY 2 months, and so forth. If we enumerated the seasons in such a way that a given season could have at most a certain number of months (or length), then a greater number of possible combinations would result. All permutations of contiguous months for the season that contains all months are considered the same season. In other words, it does not matter whether you begin in January and end in December or if you begin in April and end in March. Since there are 12 months, there are 12 diﬀerent seasons for each possible season length except for the season of length 12 (only 1 way). Thus, 12 multiplied to 11 plus 1 results in 133 possible seasons. For practical purposes, we assume no weather season lasts more than 5 months. Thus, in this case,

T.R. Inniss / European Journal of Operational Research 175 (2006) 376–384

there is a total of 60 possible seasons. Since there are 12 months and 5 diﬀerent possible season lengths, enumerating the seasons yields 60 possible seasons in this candidate season set. The results given in this paper will be for 60 candidate seasons, thus n = 60.

4. Seasonal clustering approach We will explore diﬀerent versions of our problem. The objective of our model is to minimize the total sum of costs of seasons chosen. Since the seasons are chosen in a least costly fashion, a cost of a season must be deﬁned and determined. Conceptually, the cost (Cs) of a season Ms is the ‘‘diﬀerence’’ between a monthÕs CPDF and a seasonÕs CPDF. In plain language, the cost of a season is the diﬀerence between a seasonÕs (collection of months) distribution and the distributions of the months contained in the given season. A distribution is characterized by a mean and variance. One version of our problem is an approach that can be used when it is known a priori that there exists a mean–variance relationship of the data. The other version assumes no mean–variance relationship and determines the cost of a season using empirical distribution functions. 4.1. Determining seasonal clusters using GDP data In this section, seasonal clusters will be determined using GDP data. It should be noted that it was empirically found that there exists a direct (increasing) relationship between the mean and the variance of GDP durations; hence the data contains a mean–variance relationship. In this analysis, the cost function will be based on a diﬀerence in means. While this clearly represents an approximation, a cost function based on comparing means should also capture diﬀerences in variances, in this case. Several cost functions are possible for comparing seasonal and individual monthly means. The following cost functions are considered: (i) sum of squared deviations (SoSqs), (ii) normalized sum of squared deviations, or (iii) seasonal vari-

379

ances. The cost functions were chosen because they measure the diﬀerence between a seasonÕs mean and the means of the months contained in the season. The ﬁrst cost is the sum of squared deviations between a seasonÕs value (average GDP duration) and the values of the months contained within that season. The cost, normalized sum of squared deviations, is the sum of squared deviations divided by the number of months contained in that season. This cost function is chosen due to the possibility of a longer season being penalized by having a larger value for SoSqs. A seasonal variance is deemed appropriate because actual daily ground delay durations are considered. A seasonal variance is determined by calculating the variance of all daily GDP durations from the overall seasonal average. Table 1 gives the formulas for the three diﬀerent clustering criteria. Here X .j is the average over all days i in month j and Xij is the GDP duration on day i in month j. The (overall) seasonal average, traditionally denoted by X .. since it is averaging over all days i and all months j, is denoted by X s in Table 1. The candidate set of seasons are enumerated as described previously and input into the set covering/partitioning model. Each season has a value, which is the average duration of a GDP in that season. For example, the value for January is the average of the Jan95 average GDP duration, the Jan96 average GDP duration, and the Jan97 average GDP duration. The value for the Jan/Feb season is the averages of all GDP average durations for Jan95, Jan96, Jan97, Feb95, Feb96 and Feb97. Since it is possible that the set covering/partitioning procedure, under certain seasonal clustering criteria, could choose all seasons of length 1, we add the following constraint, which limits the number of seasons chosen: n X xs 6 N ; s¼1

Table 1 Seasonal clustering criteria (cost functions, Cs) Pm 2 Sum of squared deviations (SoSqs) j¼1 ðX .j X s Þ P 2 Normalized SoSqs ðm1 Þ m j¼1 ðX .j X s Þ P P 2 1 Seasonal variances ðm1 Þ m j¼1 i ðX ij X s Þ

380

T.R. Inniss / European Journal of Operational Research 175 (2006) 376–384

where N is the maximum number of covers or seasons. To solve, the CPLEX Mixed Integer Programming Solver (V. 6.0) on a SUN Sparc10 Station was used. Table 2 gives the set covering solutions in terms of seasons for n = 60 and N = 4. (Recall that n refers to the number of enumerated candidate seasons given the restriction that no season lasts more than 5 months and N is the maximum number of resulting seasons.) Observe that there is ‘‘over-covering’’ that occurs using the seasonal variance cost function. Recall that set covering allows for over-covering due to the inequality constraint. It is interesting to note that the seasons determined by the SoSqsÕ seasonal clustering criterion correspond to the boxed seasons in the time plot of the monthly average GDP lengths averaged over all 3 years, 1995, 1996 and 1997 (see Fig. 1). If set partitioning is used, then the resulting seasons for N = 3, 4, and 5 are all disjoint (see Table 3 for results for n = 60). Thus, this approach seems more appropriate for a seasonal ‘‘clustering’’ method since the results of the set partitioning model ensures disjoint clusters. In this section, for ﬁxed seasonal boundaries, the overall CPDF for GDP data was partitioned into seasonal clusters or subunits. The same procedures could be applied using the weather data Table 2 Set covering solutions for GDP seasons SoSqs

Normalized SoSqs

Seasonal variances

Apr–Jun Jul/Aug Sep/Oct Nov–Mar

Feb–Jun Jul–Sep Oct/Nov Dec/Jan

Apr–Aug Jul–Oct Jul Nov–Mar

Table 3 Results from set partitioning of GDP data N=3

N=4

N=5

Apr/May Jun–Oct Nov–Mar

Apr–Jun Jul/Aug Sep–Jan Feb/Mar

Apr–Jun Jul/Aug Sep–Nov Dec/Jan Feb/Mar

from NCDC. In the next section, the clustering criterion developed will be based on diﬀerences in distributions rather than diﬀerences in means and applied to fog weather data. This method is the one to be utilized when there does not necessarily exist a mean–variance relationship. Thus, this can be applied in a more general case. 4.2. Determining seasonal clusters using fog weather data In the previous section, CPDFs were based only on means (of GDP data) due to the relationship between the means and variances. Since it is possible to have two distributions that have the same mean, but are diﬀerent, the cost function (in this section) will be based on diﬀerences in distributions instead of diﬀerences in means. When we refer to a ‘‘distribution’’, we mean an empirical distribution function (EDF), which is completely determined by observed values of a random variable, and is used to estimate an underlying cumulative distribution function (cdf) of a group of observations or empirical data. For any given season in the candidate season set, an EDF is calculated for each month j in the given season according to

length in hours

F j ðyÞ ¼ 3-Year Average GDP Length

7 6 5 4 3 2 1 0 Jan

Feb Mar

Apr

May

Jun

Jul

Aug Sep

Oct

Nov

Dec

Fig. 1. Average of monthly average GDP durations over 1995, 1996, 1997.

mj 1 X Ify i 6 yg; mj i¼1

j ¼ 1; . . . ; 12;

where mj is the number of data points (days of fog conditions) in month j, yi is the ith data point and I{exp} is 1 if expression exp is true and 0 if it is false. For each real number y, Fj(y) calculates the proportion of data that is less than or equal to that point y. The average of the monthly EDFs, known as the pooled EDF, gives the EDF for the season. The pooled EDF, F(y) is computed by

T.R. Inniss / European Journal of Operational Research 175 (2006) 376–384

F ðyÞ ¼

1 X mj F j m j

where m ¼

X

mj .

j

To measure the diﬀerence between a seasonÕs EDF (pooled EDF) and the months contained in that season, the Kolmogorov–Smirnov (KS) statistic can be used (see [10] for more detailed information on EDFs and the KS statistic). The KS statistic is appropriate for measuring the diﬀerence in a seasonÕs EDF and the EDFs of the months contained in that season since the KS statistic measures the maximum deviation between the EDFs within classes and the pooled EDF. Therefore, the KS statistic will be used as the cost of a season in the cost function of the set partitioning formulation and is calculated by sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ﬃ X mj 2 max ½F j ðyÞ F ðyÞ ; y ¼ 1; 2; . . . ; m. y m j A season whose KS statistic is small implies that the maximum deviation of any monthÕs EDF from the seasonal EDF is small. Hence, the objective of the set partitioning formulation is to minimize the maximum deviation of the monthsÕ EDFs from the seasonal EDF or minimize the KS statistic for a given season. Note that the KS statistic requires two or more classes (months) in order to be calculated, thus no single month seasons are allowed using this cost criterion. Table 4 contains the resulting seasons for n = 60 candidate seasons using weather data. Observe that there are diﬀerent sets of seasons resulting from using diﬀerent seasonal clustering criteria (cost functions) developed in this and the previous sections. In the next section, we develop a method for assessing the quality of a set of seasons and a way to determine which set is the ‘‘best’’ set. Table 4 Seasons resulting from partitioning weather data N=3

N=4

N=5

Mar–Jun Jul–Sep Oct–Feb

Mar/Apr May/Jun Jul–Sep Oct–Feb

Mar/Apr May/Jun Jul–Sep Oct/Nov Dec–Feb

381

5. Evaluating sets of seasons (computational results) In previous sections, diﬀerent cost functions yielded diﬀerent sets of seasons. The question now is: which set of seasons is the ‘‘best’’ set? This section will discuss diﬀerent methods for evaluating the quality of a given set of seasons. One way to evaluate a given set of seasons is by comparing the means of the diﬀerent seasons to ascertain if they are statistically diﬀerent from each other. This can be done using the method of multiple comparisons in a single-factor analysis of variance (ANOVA). Single-factor ANOVA is used to test whether there do indeed exist statistically diﬀerences in the means of the months. This must be performed before multiple comparisons because if there does not exist a diﬀerence in means (null hypothesis not rejected), then there is no need to determine where the diﬀerences are (see [10]). For the results of the F-test of a single-factor ANOVA to be valid, assumptions, such as normality of residuals and homoscedasticity, must be checked. It should be noted that there is a departure from normality if the model with daily GDP durations is used. If the model is based on average GDP durations instead of daily GDP durations, the assumptions are more nearly satisﬁed. The averaging of the daily GDP durations removes enough variability in the data to give constant error variance and normality of residuals. An F-test is used to determine if there are statistically signiﬁcant diﬀerences among the means of the months. Using the model involving the average GDP durations, the F-test tested the hypothesis that all factor level (monthly) means are equal and resulted in a p-value of 0.0030, which implies that there does exist some linear function of parameters that is signiﬁcantly diﬀerent from 0. In other words, there does exists a signiﬁcant difference in means of the months. The set of seasons from the previous sections give some idea of where the diﬀerences are. The procedure of multiple comparisons can be used to evaluate a given set of seasons. Multiple comparisons, also known as mean separation tests, tests for equality between two or more factor level

382

T.R. Inniss / European Journal of Operational Research 175 (2006) 376–384

means. Single-factor ANOVA with multiple comparisons was implemented for the GDP seasons from Table 2. Table 5 lists the diﬀerent contrasts (in terms of seasons), their corresponding p-values and standard errors that resulted from using the ScheﬀeÕ multiple comparisonsÕ procedure. Based on the p-values, there are statistically signiﬁcant diﬀerences between the means of the following pairs of seasons: Jul/Aug and Nov–Mar; Jul/Aug and Sep/Oct; Jul/Aug and Apr–Jun; Nov–Mar and Sep/Oct; and Nov–Mar and Apr–Jun. It should be noted that the hypothesis that there is a diﬀerence in the mean of the Sep/Oct season and the mean of the Apr–Jun season could not be rejected. This is due to the fact that they essentially have the same mean, but cannot be placed in the same season because they are separated by the Jul/Aug season, which has a mean that statistically diﬀers from both seasons. Caution should be taken with this method since the assumptions on the error terms were violated for the actual daily GDP durations. In practice, these assumptions are not often satisﬁed. To avoid the issue of whether assumptions are satisﬁed or

not for the F-test to be valid, the mean square ratio (better known as the F-value if normality is satisﬁed) can be used to evaluate a given set of seasons. The mean square ratio is the ratio of the mean square between groups (seasons) and the mean square within groups (seasons) and is calculated according to the formula below P s

ns ðY .s Y .. Þ2

k1

P P s

j

ðY js Y .s Þ2

.

nk

It is desired to have seasons that exhibit homogeneity within seasons and variability between seasons. A mean square ratio that is large conﬁrms that this is the case. The mean square ratios are computed for pairwise contiguous seasons. If the minimum of these values is greater than some large constant, e.g. 10, then the set of seasons is valid. The set of seasons resulting from the set partitioning procedure that minimized the KS statistic using weather data (Table 4, N = 3) satisfy the mean square ratio criterion (see Table 6).

Table 5 Results of single-factor ANOVA with multiple comparisons Contrasts

p-Values

Standard errors

Jul/Aug vs. Apr–Jun Nov–Mar vs. Apr–Jun Nov–Mar vs. Jul/Aug Nov–Mar vs. Sep/Oct Sep/Oct vs. Jul/Aug

0.0288 0.0015 0.0001 0.0053 0.0388

1.801 2.401 2.750 2.750 1.315

Table 6 Mean square ratio values for weather Contiguous seasons

Mean square ratio

Mar–Jun vs. Jul–Sep Jul–Sep vs. Oct–Feb Oct–Feb vs. Mar–Jun

14.06 24.39 11.65

Non-contiguous seasons can be similar

Intra-season homogeneity (From Set Partitioning IP) Transition period

Inter-season variability

Fig. 2. Schematic for seasonal clustering technique.

T.R. Inniss / European Journal of Operational Research 175 (2006) 376–384

Many cost functions were given in previous sections to determine sets of seasons and the postanalysis in this section is used to determine the ‘‘best’’ set of seasons, where ‘‘best’’ refers to a set of seasons where there exists as much intra-season homogeneity as possible and as much inter-season variability as possible.

6. Conclusion In general, clustering refers to the grouping of objects that are similar. Partitional clustering refers to the decomposition of a data set into a partition or set of disjoint clusters through the minimization of a distance (cost) function. Thus, based on the properties of partitional clustering, a seasonal clustering technique (clustering with an imbedded time series) should have the properties that the clusters chosen are the results of minimizing some cost function, they are disjoint, and they contain points that are contiguous. Usually in a partitional clustering procedure such as K-means, there is a process of searching through the set of all possible clusterings (partitions) to determine the best partition of the data. The set of all possible partitions can be too large so that a local optimization method is needed. In K-means, a data set of M points is partitioned into K clusters and the cost criterion is the average squared distance of the observations from their nearest center location. The nearest center location is usually determined by a standard Euclidean distance function. The search procedure is an iterative procedure of assigning data points to clusters that contain their nearest centers, recomputing the center locations and then reassigning the observations until the centers change by a small (e) amount. In our seasonal clustering technique, there is restricted partitioning because the number of possible partitions is reduced due to the restriction that time-order of observations must be maintained. Thus, the set of all possible partitions can be eﬃciently enumerated, as in a set partitioning procedure. Hence, set partitioning will be used as the search procedure (determination of clusters/ seasons) for the seasonal clustering technique. It

383

is known that this IP whose A matrix is an interval matrix, hence totally unimodular, can be solved eﬃciently [9]. Thus, the seasonal clustering problem can be solved eﬃciently. As in partitional clustering, the determination of the best-cost (objective) function and appropriate number of clusters are diﬃcult tasks. The cost (objective) function is based only on within season interaction. In the seasonal clustering approach we have proposed, alternative sets of clusters/seasons were generated by the set partitioning model based only on within season interactions. A post-processing step then took into account the between season interactions using the mean square ratio criterion (see Fig. 2). In this paper, our seasonal clustering technique is modeled as a set partitioning integer programming problem and resulting clusterings are evaluated using the mean square ratio criterion. The resulting seasonal distributions, which have satisﬁed the mean square ratio criterion, can be used for the required inputs (distributions of airport arrival capacity scenarios) into stochastic groundholding models. In combination, the results would give the optimal number of ﬂights to ground in a ground delay program to aid more eﬃcient traﬃc ﬂow management. References [1] Aircraft Operating Costs by Stage of Flight and Associated Air Traﬃc Control Delay Costs Based on a Typical Flight, Air Transport Association, Washington, DC, 1999. [2] S. Anily, A. Federgruen, Structured partitioning problems, Operations Research 39 (1991) 130–149. [3] M.O. Ball, R. Hoﬀman, A. Odoni, R. Rifkin, A stochastic integer program with dual network structure and its application to the ground-holding problem, Operations Research 51 (2003) 167–171. [4] B.S. Everitt, Cluster Analysis, Halsted Press, John Wiley and Sons, New York, 1993. [5] J.A. Hartigan, Clustering Algorithms, John Wiley and Sons, New York, 1975. [6] F. Hwang, Optimal partitions, Journal of Optimization Theory and Application 34 (1981) 1–10. [7] T.R. Inniss, Stochastic models for the estimation of airport arrival capacity distributions, Ph.D. Dissertation, University of Maryland, 2000. [8] A. Joseph, N. Bryson, Parametric linear programming and cluster analysis, European Journal of Operational Research 111 (1998) 582–588.

384

T.R. Inniss / European Journal of Operational Research 175 (2006) 376–384

[9] G.L. Nemhauser, L.A. Wolsey, Integer and Combinatorial Optimization, John Wiley and Sons, Inc., New York, 1988. [10] J. Neter, W. Wasserman, Applied Linear Statistical Models, Richard D. Irwin, Inc., Homewood, IL, 1974.

[11] M. Rao, Cluster analysis and mathematical programming, Journal of the American Statistical Association 66 (1971) 622–626.

Seasonal clustering technique for time series data

Seasonal clustering technique for time series data

Recommend Documents