Omega 41 (2013) 41–47
Contents lists available at SciVerse ScienceDirect
Omega journal homepage: www.elsevier.com/locate/omega
DEA with streaming data J.H. Dula´ a,n, F.J. Lo´pez b a b
School of Business, Virginia Commonwealth University, Richmond, VA 23284, United States School of Business Administration, Macon State College, Macon, GA 31206, United States
a r t i c l e i n f o
abstract
Available online 18 February 2012
DEA can be interpreted as a tool for the identification of ‘‘frontier outliers’’ among data points. These are points that are potentially interesting because they exhibit extreme properties in that the values of their attributes, either alone or combined, are at the upper or lower limits of the data set to which they belong. A real challenge for this type of frontier analysis arises when data stream in at high rates and the DEA analysis needs to be performed quickly. This paper extends DEA into this dynamic data environment. The purpose is to propose a formal theoretical framework to handle streaming data and to answer the question of how fast data can be processed using this new framework. Potential applications involving large data sets include auditing, appraisals, fraud detection, and security. In such settings the situation is likely to be dynamic with the data domain constantly changing as new entities arrive in the course of time. New specialized tools to adapt DEA to deal with streaming data will be explored. & 2012 Elsevier Ltd. All rights reserved.
Keywords: Data envelopment analysis (DEA) Linear programming (LP)
1. Introduction As an illustrative and motivating example, consider a hypothetical model for identifying credit card fraud. A fraudster in possession of an invalid credit card would try to maximize its benefit. This person is also concerned that the fraud will be eventually discovered. The person is aware that exceeding certain limits in specific dimensions might expose his bad intentions. For example, one particularly expensive transaction could signal to the company that something may be wrong. Other dimensions could include the frequency of use of the card, the number of purchases of particular items, calculations such as a relation between physical distances and time differences between transactions. It would be reasonable to assume that a sophisticated fraudster will try to avoid the limits within a single dimension but it is still in his interest to extract as much benefit from the card. Therefore, he may reveal himself by the fact that some combination of these dimensions shows extreme values. Information about charges stream into a credit card company at high rates. The values for the relevant dimensions represent a data point that is located somewhere in a polyhedral hull defined by all transactions up to then. In Dula´ [13] it was shown that such hulls are equivalent to Variable Returns to Scale (VRS) production possibility sets. Presumably, transactions well in the interior of
n
Corresponding author. Tel.: þ1 80408286002; fax: þ1 804 828 7174. E-mail addresses:
[email protected] (J.H. Dula´),
[email protected] (F.J. Lo´pez). 0305-0483/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. doi:10.1016/j.omega.2011.07.010
the hull are less suspicious to the credit card company than those on the frontier or which redefine the frontier. This verification needs to be performed quickly as data about charges stream in and the frontier needs to be updated if needed. The challenge to this company is to process and identify these data quickly enough to react in a timely fashion to possible fraudulent activity. In the scenario above, the problem becomes one of classifying new points with respect to an available VRS hull. The classification would be as either a part of the hull or external to it. In the model above, external points would be of particular interest to the firm for their potential to identify fraud and would elicit additional scrutiny. External points also redefine the hull if incorporated into the data set. The analysis of the impact of the inclusion of a new point on a DEA data set is a geometrical problem. There are two cases when a new entity is incorporated into the data domain that defines a VRS hull. In the first case the arriving point will be internal or on the frontier of the hull. In this case, the inclusion of the new point would not modify the hull. In the second case, the new point will be exterior to the hull and the hull will change with this inclusion. The focus and scope of the paper is to extend DEA so that it can be applied in a new type of dynamic data environment we are calling ‘‘streaming data’’. Our purpose is to propose a formal theoretical framework to handle streaming data and to answer the question of how fast data can be processed using this new framework. Our treatment assumes that it is desirable to maintain all of DEA’s functionality including classifying all the points as efficient or inefficient, handling different returns to scale, providing benchmarking, and identifying reference sets. We also
´, F.J. Lo ´pez / Omega 41 (2013) 41–47 J.H. Dula
42
suggest that the methodology can be applied everywhere DEA is currently used. Note that all these functions are not simultaneously available in other techniques like support vector machines, artificial neural networks, or cluster analysis. Analytical comparisons between these and DEA have already been performed; see for example [21–23], or [2]. The new data environment we are introducing to DEA is different to the dynamic scenarios already in the literature. Most of the current methods measure the performance of the same DMUs using snapshots at different times, mainly by means of the Malmquist index; e.g., [5] or [19]. Other dynamic techniques consider problems in two or more phases or stages. See for example [6]. In the proposed framework, DMUs arrive one at a time and each one is immediately incorporated into the data. We discuss the issues and challenges that arise when data stream in and their impact on the VRS hull needs to be processed and analyzed quickly. We present theoretical results and introduce two procedures, FullStream and FrameStream, where the latter one reflects the state of the art in algorithms for speeding this type of processing. We report and discuss the results of a computational analysis using large data sets since this is likely to be the case in realistic applications. The rest of the paper is organized as follows. The notation, assumptions, and terminology used in the remainder of the article are introduced in Section 2. Section 3 analyzes the possible consequences, theoretical and geometrical, that the arrival of a new streamer may cause. Section 4 presents the two procedures, FullStream and FrameStream, developed for this project in order to process streaming data. This is where the formal description of these two methods is discussed. The computational results of implementing the two procedures appear in Section 5. This section analyzes and compares the performance of the two approaches. A section with the conclusions wraps up the paper.
2. Notation, assumptions, and terminology The data consists of an initial point set with n points aj , j ¼ 1, . . . ,n each with m dimensions. The set A collects these data points; i.e., A ¼ fa1 , . . . ,an g. As data stream in, a will denote the streamers. A sequence of streamers will be indicated by 1
2
3
superscripts; e.g., a , a , a , . . . A streamer may be a brand new entity or an existing entity which is re-cycled in the sense that it is re-incorporated into the analysis with updated observations. Both algorithms presented below process streamers in the sequence in which they arrive and they may be incorporated into the point set depending on their classification. This will result in separate sequences of nested point sets A1 A2 A3 . . .. In Dula´ [12] it was shown that data points in the set A that emerge as having the maximum for some weighted sum of its attributes from among all entities using the same weights are points on the frontier of a polyhedral set PðAÞ, where 8 9 n n < = X X m j PðAÞ ¼ z A R z r a lj ; s:t: lj ¼ 1, lj Z 0; 8j : : ; j¼1
j¼1
The set PðAÞ is the familiar VRS production possibility set in DEA, which has full dimensions. The correspondence is maintained when we consider two types of components for every point, those for which the focus should be directed towards the higher magnitudes (equivalent to attributes that are desirable outputs) and those for which smaller magnitudes are of interest (i.e., inputs). Using the credit card illustration, the dimensions for transaction value, special purchases, and distance from home, would correspond to outputs while a measure such as time
between purchases would be an input. In this model desirable attributes are positive and undesirable attributes are negative. These polyhedral sets will be referred to as the VRS hull of the data. We further assume that any data point on the boundary of the polyhedron PðAÞ is an extreme point. As practitioners well know, data points on the boundary that are not extreme efficient are rare (see [7, p. 448] ). This assumption is referred to as the ‘‘nondegeneracy’’ assumption in Ottmann et al. [20]. Refer to this work for directions on how to achieve it, the basic idea being to perturb the data using small random changes. All point sets used here will be non-degenerate in this sense and will not contain duplicates. An important concept in this work is that of the frame, F , of A [9]. The frame is the smallest subset of a point set A such that PðF Þ ¼ PðAÞ. Under the Variable Returns to Scale assumption in DEA, the frame is the set of extreme efficient DMUs. The frame is the set of points that define the efficient frontier. Here again, superscripts will be used to denote sequences of these sets. Additionally, we will use the following concepts and notation. A hyperplane with normal vector p and level value b is denoted Hðp, bÞ ¼ fy9/p,yS ¼ bg where / , S indicates the inner product of two vectors. A supporting hyperplane of a VRS hull is a hyperplane Hðp, bÞ such that /p,zS r b,8z A PðAÞ and /p,zS ¼ b for at least one z A PðAÞ. 3. Processing streaming data in DEA A new arrival a can have several consequences on the classification of the DMUs in the set A. Case 1. The point a is interior to PðAÞ. In this case PðAÞ ¼ PðA [ f a gÞ and none of the classifications of the points in A changes. A direct way of testing this is, of course, by solving one of many available DEA LPs. Preprocessors such as Dominator in Dula´ and Lo´pez [12] may identify such points quickly and without having to solve an LP. Dominator is based on the idea that a point a^ dominates another point a~ if each of the coordinates of a^ is better (larger in the case of desirable attributes and smaller in the case ~ of undesirable attributes) than the corresponding coordinate of a. Case 2. The new arrival is exterior to PðAÞ but none of the classifications for the points is affected. This can happen when a is relatively close to a facet of PðAÞ. In this situation, a will become an extreme point of PðA [ f a gÞ and all other extreme points will retain their classification. Next we provide a sufficient condition to test if a new arrival a is in this category. Result 1. Let a be an exterior point to PðAÞ. If there exists a unique hyperplane that supports PðAÞ at a facet and separates PðAÞ from a , then a is extreme for PðA [ f a gÞ and the classification of all the other points remains unchanged. Proof. Recall that a facet is a full dimensional face of a polyhedral set. Suppose there is a unique hyperplane, Hðp^ , b^ Þ, that supports PðAÞ at a facet and separates it from a . Therefore any other hyperplane, Hðp~ , b~ Þ, that supports PðAÞ at a facet, is such that /p~ , a So b~ . So, the inclusion of a into the data set does not affect the supporting relation of any Hðp~ , b~ Þ with the hull, which means that it keeps its original support set and the extreme points in the support set remain extreme. Because the production possibility set has full dimension, every extreme efficient DMU belongs to at least m different facets, each a support set of a different supporting hyperplane. With the inclusion of a , an original extreme point will still belong to at least m1 supporting hyperplanes different from Hðp^ , b^ Þ, guaranteeing that it retains its classification. Notice that the categorization of inefficient (i.e., internal) DMUs is unaffected. &
´, F.J. Lo ´pez / Omega 41 (2013) 41–47 J.H. Dula
Case 3. The new arrival a is exterior to PðAÞ and this makes some extreme points of PðAÞ interior in PðA [ f a gÞ. Note that the status (i.e., inefficient classification) of interior points to PðAÞ will remain unchanged in PðA [ f a gÞ (although the inefficiency score may change). The classification of current extreme points may remain the same or may change to ‘‘interior’’. Even though there are many ways that preprocessing can quickly identify changes in the status of some extreme points, a conclusive determination of the status of all the extreme points requires direct scoring of DMUs with LPs. It is arguable whether testing for the conditions in Case 2 is worth the effort and it is not necessary to check for this condition directly. The approach we take in this paper is to check whether the streamer is internal or external to the current hull. The procedures we present identify the situation and if the latter applies they proceed to reclassify the points. It is important to emphasize that we will not consider any sort of preprocessors in the design of our procedures. Preprocessors affect procedures in comparable ways and detract from understanding and comparing the performance of the underlying algorithms. This is not to say that preprocessors are not important (see the study on preprocessors by Dula´ and Lo´pez [12]). In fact, this is a topic for a different work. The rest of the paper is dedicated to developing two new procedures for processing streamers, one adapting the traditional static DEA computational practice and the other exploiting the dynamic environment and employing new frame algorithms for DEA by Dula´ [14]. 4. A framework for streaming DEA A DEA analyst familiar with standard DEA tools may approach the problem in streaming DEA by treating it as a temporary static situation each time there is a new arrival. This would entail incorporating the new streamer into the data set and processing the DMUs one at a time to reclassify them. This approach can be improved in several ways. First, it is not necessary to process all the DMUs if the streamer turns out to be interior. Additionally, the analyst can use many powerful accelerators and enhancements; for example, Restricted Basis Entry (RBE) [1], and ‘‘hot starts’’. RBE consists in removing from the LPs the data corresponding to interior (inefficient) entities as they are discovered. Such inefficient DMUs are not necessary since they will never be basic at optimality. Implementing RBE causes the size of LPs to get progressively smaller. Specifically, every time that an inefficient DMU is found, the next LP solved has one fewer column (in the DEA envelopment form) or one fewer row (in the DEA multipliers form). A procedure, FullStream, based on the adaptation of the static approach to the case of streaming DEA is presented next. Procedure FullStream. INPUT: Data set A0 and a sequence of streamers j
a ; j ¼ 1; 2,3, . . .. OUTPUT: For each j, classification of all DMUs in k S Aj ¼ A0 jk ¼ 1 f a g. For j ¼ 1; 2,3, . . . j
1.0 Determine if a is interior or exterior to PðAj1 Þ. j
1.1 If a interior, Do nothing: no change in DMU classifications, Go To Step 2.0; j
1.2 Else ( a exterior). j
Classify all DMUs in Aj1 [ f a g. End If j
2.0 Aj ¼ Aj1 [ f a g; 3.0 Next j.
43
Notes on Procedure FullStream. 1. In an application of streaming DEA such as fraud detection, a piece of information of immediate importance is whether or not a new transaction is external to the current frontier. Presumably, such a transaction will trigger special scrutiny. For this reason, procedure FullStream (and later procedure FrameStream) provides DMU classification as interior or exterior. Recall the DMU classification depends only on the production possibility set (i.e., the returns to scale assumption) and is independent of orientation or other such modeling considerations. 2. Step 1.2 is where most of the work is performed in this procedure. Without using enhancements or accelerators, this would require solving as many LPs as there are data points in Aj1 , all of them ‘‘full-sized’’ in the sense that the LP coefficient matrix contains all the data (the streamer is by now already classified). Enhancements and accelerators can greatly reduce the computational burden in this step. It is known from Barr and Durchholz [3] and Dula´ [11] that an enhancement, such as RBE, can reduce times typically by more than one half and accelerators, such as LP ‘‘hot starts’’ (schemes for incorporating advanced basis information from a previous optimal solution at the start of the simplex algorithm), can reduce times by almost one half. The savings of these two schemes, when combined, can be of as much as one order of magnitude [11]. 3. Procedure FullStream was implemented using RBE and hot starts. The reports of this implementation appear in the next section. 4. At each iteration, procedure FullStream requires the solution of at least one full size LP and if the streamer is exterior, as many additional LPs as the current cardinality of the point set Aj1 . The size of these LPs, however, is reduced every time an interior DMU is identified if RBE is employed. The dynamic nature of streaming DEA provides a structure that can be exploited to produce a procedure different from FullStream. Since all interior points to PðAÞ remain interior in PðA [ f a gÞ, the difference in classification of the DMUs before and after the arrival of a streamer affects only the frame. Checking whether the streamer is interior or exterior can be performed using the frame. This means that the only relevant information needed to process streaming data are the frames of the data sets. The process becomes one of updating and using frames. This represents an important potential computational advantage especially if, as in practice, the frame is a small proportion of the data set. Moreover, the process can further benefit from recent developments in DEA for finding frames of VRS hulls such as Dula´’s [14] BuildHull algorithm. This suggests a new procedure, FrameStream, to process streaming DEA which is formally described next. Procedure FrameStream. INPUT: Data set A0 and a sequence of streamers j
a ; j ¼ 1; 2,3, . . .. OUTPUT: For each j, classification of all DMUs in k S Aj ¼ A0 jk ¼ 1 f a g. 0.0 Initialization: Calculate F 0 , the frame of A0 . For j ¼ 1; 2,3, . . . j
1.0 Use F j1 to determine if a is interior or exterior to PðF j1 Þ. j
1.1 If a interior, F j ¼ F j1 , Go To Step 2.0;
´, F.J. Lo ´pez / Omega 41 (2013) 41–47 J.H. Dula
44 j
1.2 Else ( a exterior). j
Calculate F j , the frame of F j1 [ f a g. End If 2.0 Next j.
Notes on Procedure FrameStream. 1. Both the initialization (Step 0.0) and the condition in Step 1.2 require the calculation of the frame. The fastest way to find the frame of a point set currently is to apply the algorithm BuildHull, mentioned above. This algorithm finds the frame of a set of points by building a sequence of partial production possibility sets. BuildHull belongs to a family of outputsensitive algorithms to find frames of finitely generated polyhedral hulls (Dula´ and Lo´pez [10]). They are output-sensitive in the sense that their performance improves as the density of frame elements in a point set (percentage of extreme points) decreases. In this sense, processing DEA data with traditional methods and RBE is also output-sensitive since the effect of this enhancement is to progressively reduce the dimension of the LPs and the rate at which this happens is directly impacted by the frame density. The results of implementing Frame Stream using BuildHull are presented in the next section. 2. The numerical complexity of procedure FrameStream is determined by the amount of work required to calculate the frame of a point set. Algorithm BuildHull requires as many LPs as the cardinality of the point set. These LPs, however, begin typically with one or two columns and end with as many columns as the cardinality of the frame of the point set. An additional LP with dimension defined by the cardinality of the frame is required in Step 1.0. The frame algorithm needs to be applied at the initialization and whenever a point is exterior in Step 1.2.
points is partitioned into subsets depending on the number of dimensions. Membership of the points in the subsets is determined using a measure of proximity based on calculating all pairs of inner products. The unit sphere is distorted by uniformly scaling the points in each subset, using different random factors for different subsets. The final desired precise density is obtained using trial and error by cycling through the sequence of finding the frame, verifying the density, and judiciously applying random expansions or contractions on the points. The same data sets have been used in previous studies: [11,14,12,17,18]. These data files are available at: www.people.vcu.edu/~jdula/LargeScaleDEAdata/. For more information about how the points were generated refer to Dula´ and Lo´pez [12]. The test suite contains data files with 5, 10, 15, and 20 dimensions and with 1%, 13%, 25%, and 50% frame density all with cardinality of 5000 points for a total of 16 data files. Testing for streaming DEA was carried out by partitioning each data set into a first part, the ‘‘starters’’, which corresponds to the starting point set (A0 ) and the rest, which are treated as streamers by processing them one at a time. We fixed the partition so that it is 75-25 ‘‘starters’’ vs ‘‘streamers’’. That is, for the data sets with 5000 points that were used, the first 3750 of them are treated as starters and the remaining 1250 as streamers. Such a partition provides a relatively large starters’ set and this models better a process in steady state after all initializing effects have been overcome. Note the point at which steady state is attained is problem/application specific. In an actual application, specific tests must be performed to establish when there is enough data so that the algorithm is progressing in stable and predictable manner. The procedure, though, can start with a single observation. This and the DEA free-disposability region (the production possibility set recession cone) create a full dimensional production possibility set. Clearly, it is possible to test if a new arrival is interior or exterior to this production possibility set. It is expected that due to the random nature of the data, potential effects due to order ‘‘ a 2
The two algorithms differ in the way they react to the arrival of an external streamer. Procedure FullStream proceeds as if a new DEA problem has to be solved each time a new external streamer is detected processing the entities one by one until they are classified. Procedure FrameStream uses the frame of the current hull and applies a specialized frame-detection procedure to check how this frame will be modified by the arrival of the external streamer. A relevant measure of performance for any procedure for streaming DEA is the streamer processing rate. Uncertainty in the streamer’s arrival can cause congestion even if processing is deterministic. Only when the processing rate is greater than the streamer’s arrival rate is there hope that the procedure will be available for real-time processing. Processing rate is a performance measure we will report for the testing comparing FullStream and FrameStream in the next section.
5. Implementations Implementations FullStream and FrameStream for Streaming DEA were coded and tested. One objective of the study is to understand the performance of the two procedures relative to the dimension of the point set and the frame density. Another objective is to determine baseline processing rates that will provide an idea of the kinds of applications that may benefit from the new framework for DEA. We employ artificially generated DEA data sets created as follows. For a desired dimension, a set of points is randomly generated on the boundary of a unit hypersphere. The set of
1
1
2
a ’’ vs
‘‘ a a ’’ will even out, so the results represent an ‘‘average’’ or ‘‘expected’’ performance of the procedures. It was considered unnecessary to test different point set cardinalities since we can expect that the insights obtained can be extrapolated. The computational platform for the test was a dedicated Pentium Centrino PC running at 2.0 GHz with 3.0 GB of RAM. An important aspect of the implementations of FullStream and FrameStream is that they were coded in C. The LPs were solved using CPLEX 12.2, a commercial product sold by IBM [16]. The form of CPLEX used was the callable library. At the time of this writing CPLEX 12.2 provides one of the fastest and most stable LP solvers available. The results of this study therefore reflect the state of the art in the ability to process streaming DEA. The report from the results from the implementations of the two procedures on the 16 data sets appears in Table 1 in Appendix. The table records the computational times to process each data set using the last 25% of the points as streamers. These results are consistent with much of what has been learned about static DEA computations and frame algorithms such as BuildHull in past studies. Two typical such side by side comparisons for the pair of procedures appear in Fig. 1. In the first panel, CPU time is compared for a fixed frame density of 13% across the four dimensions. In the second panel the same measurement is used but the dimension is kept fixed at 15 and the readings range along the four frame densities. In Fig. 1a it is clear how both procedures exhibit the familiar gradual almost linear increase in computational time. Even though the number and size of LPs that are solved every time an external streamer arrives grows, overall, both procedures end up solving roughly the same number of LPs
´, F.J. Lo ´pez / Omega 41 (2013) 41–47 J.H. Dula
8.00E+03
1.00E+03
FrameStream
FullStream
Time (CPU secs.)
Time (CPU secs.)
1.20E+03 8.00E+02 6.00E+02 4.00E+02 2.00E+02 0.00E+00
45
5
10
15
20
FrameStream
6.00E+03
FullStream
4.00E+03 2.00E+03 0.00E+00
0%
10%
Dimension
20% 30% Frame Density
40%
50%
Fig. 1. FullStream vs FrameStream: (a) impact of dimension; (b) impact of frame density.
for each dimension. This explains the nearly linear increase in processing times seen in Fig. 1a for each of the procedures as the dimensions increase. This behavior was observed a long time back by Dantzig [8] and Gass [15]. In every case, however, FullStream is slower to process the streamers than FrameStream. The linear increase in times in both procedures as dimension increases implies a constant speedup on the part of FrameStream with respect to FullStream. This is what is observed when calculating the ratios of the processing times between the two procedures for each dimension. The actual speedup from using FrameStream compared to FullStream in the case of the data sets in Fig. 1a ranges narrowly between 0.22 and 0.27. The variation can be attributed to natural variations among the point sets. Frame density has a more dramatic impact on the performance of the two procedures. There are two effects at play in determining the impact of frame density on computational times. As frame density increases, the number of external streamers also increases. Recall that every external streamer requires that either a full scoring of the DMUs be executed in FullStream or a full calculation of the frame in FrameStream be carried out. Therefore, there is an effect due to the increase of the number of LPs that have to be solved as the frame density increases. The second effect is due to the fact that these LPs are also becoming larger. This is because external streamers become part of the coefficient matrix in the LPs immediately after they are identified in both FullStream and FrameStream. Notice that external streamers may subsequently become internal points with the arrival of new streamers. This was not something we witnessed much in this problem suite. It is known that the relation of increasing the data set and increasing computing times in DEA is quadratic [10] and computing times are directly proportional to the number of LPs. Therefore, since the dominating effect is quadratic, the relationship between computing times and frame density is quadratic. Fig. 1b illustrates well the impact of the frame density on the two algorithms and confirms that FrameStream also dominates along this data set characteristic. It was noted above that the two procedures are output-sensitive with respect to frame density in the sense that their performance depends on this data attribute. This sensitivity, however, is different for the two procedures. Procedure FullStream is based on a standard DEA approach with RBE and procedure FrameStream is based on BuildHull. Results from the static case obtained in Dula´ and Lo´pez [10, see Fig. 6(a)] show that frame based algorithms such as BuildHull derive greater benefit from lower frame densities than standard approaches do from RBE. The relations we see between FullStream and FrameStream in Fig. 2 are consistent with what was observed in the static case in Dula´ and Lo´pez [10]. In this figure we compare the relative effort (in terms of computational time) of FrameStream vs FullStream to process streamers in the case of five dimensions; the lowest in this analysis. From the figure we see that at a low frame density of 1%, FullStream takes almost one order of magnitude more effort to process streamers. This advantage, however, erodes as the frame density increases until
Full Stream
Frame Stream
100% 80% 60% 40% 20% 0% 1%
13%
25%
50%
Frame Density Fig. 2. Proportion of the time to process streamers by FullStream and FrameStream.
the case of 50% frame density where the two procedures take a comparable amount of effort to process streamers. The plots in Figs. 1 and 2 reveal another type of information that is of special interest for this work: namely, the processing rates of the procedures. Note that the number of streamers in our analysis is always 25% of 5000 that is 1250. We know that CPU times increase both when the dimension and the frame density increase. Therefore processing times for streaming data are inversely proportional to these two data characteristics. Processing times for all data sets appear in the last column of Table A1. The fastest rates for processing streamers in this study were observed in five dimensions and a 1% frame density. For these data characteristics, the processing rate for procedure FrameStream was 587.96 DMUs per second. At this rate, it is possible to envision a real-time application for this procedure although this performance is not sustained as the data characteristics change. Improvements on these rates are possible using, of course, faster machines. Given the fact the latest in LP solving technology has been used and the fact that the fastest known algorithms for finding frames are being applied, these results reflect the state of the art in processing streaming data in DEA. There are potential pitfalls with the procedures in applications such as fraud. If all external streamers are incorporated into the production possibility set, it will grow and the procedure will become less discriminating. In terms of fraud, this means that fraudulent records will be identified once but similar events later on might not be detected because the production possibility set will have grown to encompass them. To address this problem, a credit card company would have to check whether an external streamer is a case of fraud. If not, the point should be made part of the production possibility set, otherwise it should be simply discarded. This way, next time a similar point arrives it will be external and will be duly handled by the fraud department. This presents interesting geometrical challenges and is an idea that is worth pursuing especially if the procedures are to be used in
46
´, F.J. Lo ´pez / Omega 41 (2013) 41–47 J.H. Dula
practice. Also, a credit card fraudster may act strategically and generate sequences of small transactions that do not result in external points thus remaining ‘‘under the radar,’’ at least for a while, with respect to the mechanics of the procedure. Our procedures would not be able to detect this type of strategic behavior, which may be considered a drawback of our algorithms. The response can be a new dimension such as, say, a measure for frequency of transactions. As long as fraudsters are motivated in part by greed they will be directed towards excesses. There is hope that there will be dimensions that somehow will capture this and there are good geometrical reasons to believe that a frontier-based procedure will eventually detect them. Certainly, such a procedure would be one of several tools at the analyst’s disposal in the effort to detect and hopefully in the future, discourage fraud.
6. Conclusions The concept of polyhedral frontier analysis formalized and made practical by DEA in the context of efficiency measurement can be extended to more general applications in the area of data mining. DEA studies are typically performed on data that are static or cross-sectional using the same set of DMUs. One generalization of this approach to frontier analysis, and one that is necessary as applications for frontier analyses are extended, is the case when data stream in. Frontier analysis with streaming data presents a host of theoretical and computational challenges. This study has explored both. The results show that DEA with streaming data should be approached directly as a new type of problem and not as a simple generalization of the static case. A new framework for processing streaming data in frontier analysis based on the role played by frames was developed resulting in the specialized procedure: FrameStream. This procedure was compared with a generalization of the static approach: FullStream. The results show that the new procedure, FrameStream, is the faster approach especially when dimension and frame densities are small. Both implementations use the state of the art in LP solving technology and FrameStream applies fast algorithms to find frames. Therefore, the results provide a useful assessment of the current capacity to process streaming data. Our results show that streaming rates increase in inverse proportion to the dimension of the data points and the frame density. The fastest processing rate achieved was 588 streamers per second with FrameStream at five dimensions and 1% frame density and the slowest using FullStream at the opposite ranges of the dimension and frame density, were the processing rate is 0.1343 streamers per second, which means over seven seconds per streamer. It remains to be determined on what real world applications, if any, are such processing rates effective. This study is the first of its kind, so the results will serve as benchmarks for future work. Faster streaming is immediately possible with faster computers. A faster LP solver may come along and help increase the performance of the procedures. Ultimately, however, faster procedures will come from new theoretical developments. In future work we will adapt frontier analysis specifically for anomaly detection. At that point, we will have to shed claims of full DEA functionality for our methods and focus on the relevant metrics such as speed, accuracy, etc. that will allow head to head comparisons with competing methodologies.
Acknowledgments The authors are grateful to two anonymous referees for their comments and criticisms. The result of addressing them has been a better defined and more focused paper.
Table A1 Implementation results of procedures FullStream and FrameStream on a problem suite no. of data points: 5000. Number of streamers: 1250. Procedure
Dim Frame density Total CPU time Processing rate (%) (s) (streamers/s)
FullStream FrameStream FullStream FrameStream FullStream FrameStream FullStream FrameStream FullStream FrameStream FullStream FrameStream FullStream FrameStream FullStream FrameStream FullStream FrameStream FullStream FrameStream FullStream FrameStream FullStream FrameStream FullStream FrameStream FullStream FrameStream FullStream FrameStream FullStream FrameStream
5 5 5 5 5 5 5 5 10 10 10 10 10 10 10 10 15 15 15 15 15 15 15 15 20 20 20 20 20 20 20 20
1 1 13 13 25 25 50 50 1 1 13 13 25 25 50 50 1 1 13 13 25 25 50 50 1 1 13 13 25 25 50 50
31.86 2.13 510.60 112.40 1345.00 481.10 3957.00 2201.00 46.08 3.36 683.90 165.70 1814.00 706.70 6013.00 3410.00 49.27 4.28 782.90 213.00 2206.00 898.30 8011.00 4392.00 69.59 6.86 960.00 261.50 2580.00 1064.00 9305.00 5297.00
39.23 587.96 2.45 11.12 0.93 2.60 0.32 0.57 27.13 372.13 1.83 7.54 0.69 1.77 0.21 0.37 25.37 291.92 1.60 5.87 0.57 1.39 0.16 0.28 17.96 182.24 1.30 4.78 0.48 1.17 0.13 0.24
Appendix See Table A1. References [1] Ali I. Streamlined computation for data envelopment analysis. European Journal of Operational Research 1993;64:61–7. [2] Athanassopoulos AD, Curram SP. A comparison of data envelopment analysis and artificial neural networks as tools for assessing the efficiency of decision making units. Journal of the Operational Research Society 1996;47:1000–16. [3] Barr RS, Durchholz ML. Parallel and hierarchical decomposition approaches for solving large-scale data envelopment analysis models. Annals of Operations Research 1997;73:339–72. [5] Chang S-J, Hsiao H-C, Huang L-H, Chang H. Taiwan quality indicator project and hospital productivity growth. Omega 2011;39(1):14–22. [6] Cook WD, Liang L, Zhu J. Measuring performance of two-stage network structures by DEA: a review and future perspective. Omega 2010;38(6): 423–30. [7] Cooper WW, Ruiz JL, Sirvent I. Choosing weights from alternative optimal solutions of dual multiplier models in DEA. European Journal of Operational Research 2007;180:443–58. [8] Dantzig GB. Linear programming and extensions. Princeton, NJ: Princeton University Press; 1963. [9] Dula´ JH, Thrall RM. A computational framework for accelerating DEA. Journal of Productivity Analysis 2001;16:63–78. [10] Dula´ JH, Lo´pez FJ. Algorithms for the frame of a finitely generated unbounded polyhedron. INFORMS Journal on Computing 2006;18:97–110. [11] Dula´ JH. A computational study of DEA with massive data sets. Computers and Operations Research 2008;35:1191–203. [12] Dula´ JH, Lo´pez FJ. Preprocessing DEA. Computers and Operations Research 2009;36:1204–20. [13] Dula´ JH. A geometrical approach for generalizing the production possibility set in DEA. Journal of the Operational Research Society 2009;60:1546–55. [14] Dula´ JH. An algorithm for data envelopment analysis. INFORMS Journal on Computing 2011;23(2):284–96. [15] Gass S. Linear programming. New York: McGraw-Hill; 1958.
´, F.J. Lo ´pez / Omega 41 (2013) 41–47 J.H. Dula [16] IBM, CPLEX 12.2 /http://www-01.ibm.com/software/integration/optimiza tion/cplex-optimizerS; 2010. [17] Korhonen PJ, Siitari PA. Using lexicographic parametric programming for identifying efficient units in DEA. Computers and Operations Research 2007;34:2177–90. [18] Korhonen PJ, Siitari PA. A dimensional decomposition approach to identifying efficient units in large-scale DEA models. Computers and Operations Research 2009;36:234–44. [19] Simon J, Simon C, Arias A. Changes in productivity of Spanish university libraries. Omega 2011;39(5):578–88.
47
[20] Ottmann T, Schuierer S, Soundaralakshimi S. Enumerating extreme points in higher dimension. Nordic Journal of Computing 2001;8:179–92. [21] Po R-W, Guh Y-Y, Yang M-S. A new clustering approach using data envelopment analysis. European Journal of Operational Research 2009;199:276–84. [22] Poitier K, Cho S. Estimation of true efficient frontier of organisational performance using data envelopment analysis and support vector machine learning. International Journal of Information and Decision Sciences 2011;3(2):148–72. [23] Yeh C-C, Chi D-J, Hsu M-F. A hybrid approach of DEA, rough set and support vector machines for business failure prediction. Expert Systems with Applications 2010;37(2):1535–41.