An OpenCL Framework for Fuzzy Associative Classification and its Application to Disease Prediction

An OpenCL Framework for Fuzzy Associative Classification and its Application to Disease Prediction

Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 20 (2013) 362 – 367 Complex Adaptive Systems, Publication 3 Cihan ...

271KB Sizes 2 Downloads 48 Views

Available online at www.sciencedirect.com

ScienceDirect Procedia Computer Science 20 (2013) 362 – 367

Complex Adaptive Systems, Publication 3 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 2013- Baltimore, MD

An OpenCL Framework for Fuzzy Associative Classification and Its Application to Disease Prediction Erhan Guvena,*, Anna L. Buczaka a

Johns Hopkins University Applied Physics Laboratory, 11100 Johns Hopkins Rd, Laurel, MD 20723-6099, USA

Abstract Recently, the broad availability of online and soft real-time data has been attracting corporations and researchers towards data analytics. Though newer and faster algorithms are developed, as the available dataset sizes increase exponentially, computational processing has been falling behind. The trend on computational resources is mostly towards multiple cores and parallel processing, while the CPU clock speed improvements are slowing down. An important computational resource is the Graphics Processing Unit with multiple-cores, surpassing 2,000 processing elements in one processing unit and operating at almost 5 teraflops capability, such as the recently introduced GeForce GTX Titan. In this study, an Open Computing Language parallel processing framework for fuzzy associative classification is described. The hybrid CPU-GPU implementation is developed and employed for prediction of infectious disease outbreaks, specifically Influenza, using environmental and disease data readily available online. A comparison of the implemented performance with respect to another in-house developed Fuzzy Association Rule Mining operator, FARM, and the performances on four distinct parallel processing environments, specifically a four processor, 64 threads capable, Opteron server; a two processor, 24 threads capable, Xeon server; a GeForce GTX 680 GPU card; and a Radeon HD 7950 GPU card, is presented. The advantages and disadvantages of the OpenCL implementation on parallel processors are discussed. © 2013 The Authors. Published by Elsevier B.V. Selection and peer-review under responsibility of Missouri University of Science and Technology Keywords: Fuzzy associative classification; disease prediction; rule mining; OpenCL; GPU; parallel-processing

1. Introduction In data mining, the associations between independent and dependent variables can be discovered by a wellresearched methodology called association rule mining [1]. Same approach can also be used to predict the unknown categories of the new observations of a dependent variable. When independent variables are fuzzified and used as an input to the classification, the nominal prediction method is called fuzzy associative classification [2]. The job of discovering the fuzzy association rules from a dataset can be a daunting task as the number of possible combinations of association rules increases exponentially. Several algorithms, such as Apriori, Max-Miner, and FP-

* Corresponding author. E-mail address: [email protected].

1877-0509 © 2013 The Authors. Published by Elsevier B.V. Selection and peer-review under responsibility of Missouri University of Science and Technology doi:10.1016/j.procs.2013.09.287

363

Erhan Guven and Anna L. Buczak / Procedia Computer Science 20 (2013) 362 – 367

growth, have been proposed to extract complete rules quickly [3,4,5]. In the literature there are reports where the implementations are realized as parallel running programs on Open Computing Language (OpenCL) platforms for speed up reasons [6,7]. OpenCL is a framework that supports computing on heterogeneous platforms including Central Processing Units (CPUs), Graphics Processing Units (GPUs), Digital Signal Processors (DSPs), etc. Among core benefits of the framework is the support of task-parallel and data-parallel processing [8,9]. To our knowledge, up to date literature does not have a reference to fuzzy association rule mining on parallel processing environments. In this work, an OpenCL based framework for fuzzy associative classification, its implementation and its realworld performance analysis using online available disease data is presented. 2. Fuzzy Associative Classification Subject matter experts prefer to work with fuzzified data (i.e. linguistic variables) rather than with the exact numbers. It is in a way, a quantization of the numerical attributes into categories directly understandable by the experts, hence closing the gap between the data analyst and the domain expert. The quantization of the attributes is also weighted by a degree called fuzzy membership to the related Membership Function (MF). Methods to generate these fuzzy membership functions can be fully automated, semi-supervised and fully manual [10]. Associative classification of the data engages independent predictor variables (also called explanatory variables) and a scalar dependent variable to be predicted by the means of the association between them [2]. A representation of this association will be the antecedents (fuzzified independent variables) and consequents (categorized dependent variable) as in the following. Given a set of M data points X and the pre-labeled binary categories Y,

X

Rt ,

x1 , x2 ,...,xi ,...,xM , xi

Y

y1 , y2 ,...,yi ,...,yM , yi

,

C1 , C2

Each xi is called a feature vector of size t or a vector of attributes in the form of,

xi

a1 a2 ... a j ... at

T

R

, aj

for each dimension aj (or attribute) of xi, Given a set of fuzzy membership functions f (.) j Di

x'i

...

f11 (a1 ) f12 (a1 )

...

f1e1 (a1 ) f 21 (a2 ) f 22 (a2 )

f lj (a j )

...

T

f tet (at ) , e j

Z

Each ej represents the cardinality of the set of membership functions for the attribute aj. Note that, each designed set of MFs might have a different size. Define each antecedent A as one of the elements from Di,

Ak

f lj (a j ),

1 k

N,

N

Di

Then a rule mined from Di is defined as,

A j ,...,Ak ,...,Ar

C,

C

C1 , C2 ,

j

... k

... r,

1

j,...,k,...,r

N

Where each Ak corresponds to the output of one of the fuzzy membership functions of attribute aj and there are Q antecedents (i.e. 1 Q N) in the rule. The support and confidence measures of a single antecedent A and a single rule (antecedent(s) with a consequent) in the given dataset X is defined as,

Supp( Ak )

1 M

M

Aik ,

Supp( Ak

C1, 2 )

i 1

1 M

M

Aik , i 1 yi C1, 2

Conf( Ak

C1, 2 )

Supp( Ak

C1, 2 )

Supp( Ak )

The support of a rule with more than one antecedent is defined as,

Supp( A j ,...,Ak ,...,Ar

C1, 2 )

1 M

M

g( Aij ,...,Aik ,...,Air ) i 1 yi C1, 2

Where g(.) is the membership inclusion [10] function (t-norm) and selected as the minimum (i.e. fmin) function. Generating all existing non-fuzzy or fuzzy association rules from a given dataset is a hard problem. The Apriori algorithm is a fast method which determines the rules based on the upward closure property of association rule

364

Erhan Guven and Anna L. Buczak / Procedia Computer Science 20 (2013) 362 – 367

mining [3]. Another algorithm, Max-Miner, uses the set-enumeration tree search framework which is a strategy to enumerate all possible rules efficiently, hence it speeds up the Apriori algorithm [4]. In this work, the rules are generated using the Apriori and set-enumeration tree algorithms without any pruning of the generated rule tree. The dataset D is pre-processed and exported as an M by N matrix, representing M data points (rows), and N fuzzified variables (columns). As an example, if five membership functions (e.g. none, low, medium, high, and missing) are defined for each independent variable, then N is equal to the 5 times the number of variables. Another frequently cited (non-fuzzy) association rule generation algorithm, FP-growth, is not especially effective for the fuzzy associative classification due to the fact that the data matrix D may not be highly sparse as required for the efficiency of the algorithm [5]. Assuming each variable is fuzzified into 5 membership functions, the maximum sparseness of D will be 80% assuming at least one membership for each variable is non-zero. In the average, D will be 60% sparse, when two memberships are non-zero. And in the worst case scenario, D will be less than 50% sparse, when more than two memberships are non-zero (i.e. several membership functions overlap with each other frequently). In addition, each membership degree can have different numerical values, unlike the regular association rule mining which FP-growth is based upon, forcing the FP-growth algorithm to store a big portion of the data matrix D in the FP-tree data structure, hence making the process equivalent to the process described in this study. Once the fuzzy association rules are generated, a classifier is determined by selecting the rules with the highest confidence, support, etc., and selecting the rules that encompass the given data points X [12]. 3. The Dataset for Influenza Incidence Rate Prediction In this work, the data for the independent variables is collected from online resources (see Table 1.), and then fuzzified manually by a human expert into membership functions, such as triangular, trapezoidal, etc. Table 1. Predictor variables

Variable Name Rate of Influenza Like Illnesses (ILI) Reported Rate of Hospitalization due to Influenza Rate of Positive Lab Tests for Influenza Pneumonia and Influenza (P&I) Mortality Rate

Time Geographical Resolution Location/Resolution

Source

Weekly

Region 3 (Baltimore)

CDC website, http://www.cdc.gov/flu/weekly/index.htm

Weekly Weekly

Flu-Net (entire country) Region 3 (Baltimore)

CDC website, http://www.cdc.gov/flu/weekly/index.htm

Weekly

Region 3 (Baltimore)

CDC website, http://www.cdc.gov/flu/weekly/index.htm

Dew Point (mean, max)

Daily

Reagan Airport Station

Derived Relative Humidity (mean, max)

Daily

Reagan Airport Station

Derived Specific Humidity (mean, max)

Daily

Reagan Airport Station

Temperature (day, night) Rainfall Week number

Daily Daily NA

National Capital Region National Capital Region NA

CDC website, http://www.cdc.gov/flu/weekly/index.htm

NOAA, NCDC Quality Controlled Local Climatological Data (QCLCD), http://www.ncdc.noaa.gov/land-based-station-data/quality-controlled-local-climatological-data-qclcd

NOAA, NCDC Quality Controlled Local Climatological Data (QCLCD), http://www.ncdc.noaa.gov/land-based-station-data/quality-controlled-local-climatological-data-qclcd

NOAA, NCDC Quality Controlled Local Climatological Data (QCLCD), http://www.ncdc.noaa.gov/land-based-station-data/quality-controlled-local-climatological-data-qclcd

NASA USGS Land Processes Distributed Active Archive Center, https://lpdaac.usgs.gov/ NASA Tropical Measuring Mission (TRMM), http://mirador.gsfc.nasa.gov NA

The disease incidence rate (Rate of Influenza Like Illnesses Reported in Table 1.) is to be predicted weekly, using the past values of collected predictor variables data. The daily variables are converted to week resolution. The geographical location of the data is the National Capital Region (of the United States). There are a total of 13 disease and climate variables, and for each, 11 past values are mapped to columns, representing the past 11 weeks. Each of these variables is fuzzified into 5 membership functions. In addition, the week number is employed and fuzzified into 13 door functions. Hence, the total number of columns N is equal to 728 (13 11 5 + 13). After the pre-processing, the number of data points M is equal to 525, representing a feature vector of size 728 for 525 weeks. The matrix D holds 32-bit floating point numbers (membership values) occupying a memory space of 1493 KB. The size of the matrix D is extremely important due to the fact that the rule generation kernel will operate entirely on it and the kernel performance is dependent on the memory location of D, and its type and speed. If the matrix fits in the local L1 cache of the computing unit, the computations can be more than 10 times faster [14]. The consequent is the disease variable Rate of ILI Reported to be predicted, set to the incidence rate on the 5th

365

Erhan Guven and Anna L. Buczak / Procedia Computer Science 20 (2013) 362 – 367

week into the future, which has two categorical values as low and high determined by the + 1.5 of the disease incidence rate [11]. The association between the antecedents and consequents are determined by the support and confidence measures from the data matrix D. Each set of antecedent(s) and a consequent that is generated from D is called a rule as described in the previous section. When a rule has high enough support (i.e. greater than a user defined threshold), it can be used to classify unknown data. From the large number of rules extracted from data, there are methods to pick certain rules to build the classifier [12]. If a rule cannot be found for a particular data point to be classified, a default rule is employed (e.g. based on prior probability of consequents). The proof of concept implementation is up to 4-antecedents, though easily expandable to 5 or more. The kernel support4 (see the code in the following section) generates support and confidence measures for 4-antecedent rules. 4. OpenCL based Fuzzy Associative Classification Framework The Open Computing Language [6] framework is used to generate the fuzzy association rules on a parallel computing platform. The OpenCL supports several elements for efficient mathematical operations, such as, loops, vectors, synchronization, threads, memory types and built-in functions [14]. The code is also highly portable, as the same executable can be run on different hardware platforms, such as a GPU or a multi-CPU server, without a recompilation. Though, one drawback might be the implementation has to be fine-tuned according to the hardware specific capabilities, such as the number of computing elements, local cache size, shared global memory size, etc. The fuzzy associative classification is realized by first generating the fuzzy association rules and then building a classifier. The implementation follows the Apriori breadth first candidate generation with a tree-set enumeration strategy. At each level (i.e. L number of antecedents), rule candidates are generated according to the following. Given a data matrix D, support and confidence thresholds, 1.

Generate level 1 rules L1 = {Ai} basically all individual columns of D Compute support and confidence measures by the kernel support1 for L1 Keep the rules that satisfy the thresholds Generate level 2 rules L2 using L1 for all Ai in L1, append Aj if j>i and Aj 1 Compute support and confidence measures then keep the rules in L2 that satisfy the user thresholds Generate level 3 rules L3 using L2 and L1 for all Aij in L2, append Ak from L1 if k>i,j and Ak 1 Compute support and confidence measures then keep the rules in L3 that satisfy the user thresholds Generate level 4 rules L4 using L3 and L1 for all Aijk in L3, append Al from L1 if l>i,j,k and Al 1 Compute support and confidence measures then keep the rules in L4 that satisfy the user thresholds Follow the same strategy for each successive level 5, 6, etc.

2. 3. 4. 5.

The OpenCL kernel to compute level 4 support and confidence measures are given in the following listing. __kernel void support4 (__global float* antes_g, __constant int* conseq_g, int M, int N, __global int4* rules, __global float4* result) { int gid, j, n, r0, r1, r2, r3; float mf, sp0=0, sp1=0, spt, km=1.0f/M, conf0, conf1; gid= (int)get_global_id(0); r0= rules[gid].x; r1= rules[gid].y; r2= rules[gid].z; r3= rules[gid].w; for (j=0; j0) { spt= 1.0f/(sp0+sp1); conf0= sp0*spt; conf1= sp1*spt; } else { conf0=0; conf1=0; } sp0 *= km; sp1 *= km; result[gid].x = sp0; result[gid].y = sp1; result[gid].z = conf0; result[gid].w = conf1;

}

As can be seen from the kernel, the D matrix and the consequent vector is placed in the GPU global memory and they are shared with all GPU threads. The rule list for a 4-antecedent kernel is a 4-element vector. The result list is another 4-element vector storing the computed two support measures and two confidence measures for C1 and C2. As an example, the support4 kernel requires 36 bytes of memory for each rule (4 antecedents, 1 consequent; 4-byte integers each) and result (2 supports, 2 confidences; 4-byte floating point numbers each).

366

Erhan Guven and Anna L. Buczak / Procedia Computer Science 20 (2013) 362 – 367 Table 2. Computing architectures (source: online, manufacturer websites) Computing Environment GPU-1 GPU-2 Server-1 (OpenCL) Server-2 (OpenCL) Server-3 (FARM)

Parallel Local Global mem. Global mem. Work Item Compute memory size cache size Size Units (KB) (KB) (MB)

Processor

Processor Clock (GHz)

Memory Approximate Bandwidth Cost (GB/s) (USD)

Radeon HD 7950 (single) GeForce GTX 680 (single)

28

256

32

16

2048

0.925

240.0

1,500

8

1024

48

128

2048

1.0

192.0

1,500

Xeon E5-2630 (dual)

24

1024

32

256

131072

2.3

85.2

5,000

Opteron 6282 SE (quad)

64

1024

32

2048

131072

2.6

102.4

10,000

Opteron 6282 SE (quad)

64

NA

NA

NA

524288

2.6

102.4

20,000

The implemented proof of concept framework is tested on four distinct platforms, specifically a four processor, 64 threads capable, Opteron server; a two processor, 24 threads capable, Xeon server; a GeForce GTX 680 GPU card; and a Radeon HD 7950 GPU card. The hardware architectures of these platforms are described in Table 2. The experiments are conducted on these four OpenCL platforms without recompiling the source code, but only specifying the thread allocation on the resources which is determined by the hardware specifications such as the number of computing elements, etc. These results and performances are compared to the in-house developed Fuzzy Association Rule Mining program, FARM, which is a highly optimized operator running on the open source (Java) RapidMiner data mining environment [13]. The performances of the implementations are compared in Figure 1. As seen from this graph, though the OpenCL implementation is primitive, the rule generation is faster due to the raw processing power of the GPU platforms. The OpenCL implementation uses around 108 GB host (CPU) and 3 MB target (GPU) memory space for 3 billion rules in one batch, and it is less memory limited compared to the Java program which uses around 500 GB for the same number of rules. As can be seen from the graphs, the OpenCL implementations on the CPU platforms (server-1,2) remain slower compared to the FARM program (server-3).

900 800

Elapsed Time (sec)

700 GPU-1

600 500

GPU-2

400

Server-1

300 Server-2

200 100

Server-3 (FARM)

0 0

500

1000

1500

2000

4-Antecedent Rules (millions)

2500

3000

In Figure 1., rules are generated using a confidence threshold of 0.55 for both low and high consequents. Support threshold for high consequent is fixed to 0.001 (since the prior probability is 35/525 for high) and the support threshold for the low consequent is varied between 0.25 and 0.001, resulting in generating 17M to 3085M rules. Lower than usual support thresholds are required to generate a high number of processed rules so that the scalability of the implementation can be demonstrated. Typically, in practice, support thresholds are higher than the ones used in this experiment.

Fig. 1. Fuzzy association rule generation

Though the FARM operator is considerably optimized and completes the job fast, its memory and processor requirements carry a high server cost. Note that the GPU platform, which has a single GPU card, can be expanded to utilize three more GPU cards (500 USD each) to have a theoretical 4 times higher performance than the reported one, and can still keep the server cost 8 times lower than the FARM environment. Clearly seen from Figure 1., OpenCL is designed to be scalable and the performances increase linearly with respect to the number of rules

Erhan Guven and Anna L. Buczak / Procedia Computer Science 20 (2013) 362 – 367

367

generated. On the other hand, a clear advantage of the RapidMiner environment is its user friendly Graphical User Interface (GUI) where a data engineer can design a bottom up data mining pipeline. A hybrid solution can be the most beneficial, where a portion of the associative classification is handled by the GPU as a coprocessor. 5. Conclusions and Future Work In this work, an OpenCL framework for the fuzzy associative classification is described. A proof of concept OpenCL implementation is realized and applied for the prediction of Influenza from related climate data. A comparison to the existing Java based parallel-threaded implementation reveals that even in its primitive form the OpenCL approach has several benefits, regarding raw processing power and reduced memory dependence. Though both FARM and OpenCL implementations are highly scalable, the latter has higher limits in terms of determining rules with greater number of antecedents (i.e. 5 or more) than the former, as the memory requirements, Java garbage collection, etc. places constraints on the Java program. There are some disadvantages to the OpenCL framework such as the development language is C, which is a hard to maintain low level programming language. Second, OpenCL platforms are still not mature as much as the Java development environments, perhaps it is still more of a research environment than an industry-grade computing environment. In addition, the computation kernels (e.g. support4) have to be optimized according to the target device memory and computing architecture. As an example, if the matrix D can be fit in the GPU L1 cache memory, the performance will be significantly faster, probably in an order of magnitude or more. The future work will include coalesced memory reads and writes, partitioning the dataset D to be placed in the local L1 cache, and integrating the implementation as a coprocessor to the existing Java platform. Second, the framework can be extended to utilize multiple CPU cores on the server concurrently, while the GPU(s) works on the rule generation, using synchronization and host target communication features of the OpenCL framework. References 1. R. Agrawal, T. Imielinski, and A. Swami, Mining association rules between sets of items in large databases, ACM SIGMOD Record, vol. 22, no. 2, 207-216, 1993. 2. K. Kianmehr, M. Kaya, A. M. ElSheikh, J. Jida, and R. Alhajj, Fuzzy association rule mining framework and its application to effective fuzzy associative classification, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1, no. 6, 477-495, 2011. 3. R. Agrawal, R. Srikant, Fast algorithms for mining association rules, Proc. 20th Int. Conf. Very Large Data Bases, 1215, 487-499, 1994. 4. J. Bayardo, and J. Roberto, Efficiently mining long patterns from databases, ACM Sigmod Record, vol. 27, no. 2, 85-93, 1998. 5. J. Han, J. Pei, and Y. Yin, Mining frequent patterns without candidate generation, ACM SIGMOD Record, vol. 29, no. 2, 1-12, 2000. 6. A. Munshi, The opencl specification, Khronos OpenCL Working Group 1, 2009. 7. J. E. Stone, D. Gohara, and G. Shi, OpenCL: A parallel programming standard for heterogeneous computing systems, Computing in Sci. & Eng. 12, no. 3, 66, 2010. 8. W. Fang, M. Lu, X. Xiao, B. He, and Q. Luo, Frequent itemset mining on graphics processors, Proceedings of the Fifth International Workshop on Data Management on New Hardware, 34-42, 2009. 9. H. Li, and N. Zhang, Mining maximal frequent itemsets on graphics processors, Fuzzy Systems and Knowledge Discovery (FSKD), Seventh International Conference, vol. 3, 1461-1464, 2010. 10. M. Delgado, N. Marín, D. Sánchez, and M-A. Vila, Fuzzy association rules: general model and applications, Fuzzy Systems, IEEE Transactions on 11, 2, 214-225, 2003. 11. A. L. Buczak, P. T. Koshute, S. M. Babin, B. H. Feighner, and S. H. Lewis, A data-driven epidemiological prediction method for dengue outbreaks using local and remote sensing data, BMC Medical Informatics and Decision Making 12, no. 1, 124, 2012. 12. B. Liu, W. Hsu, Y. Ma, Integrating classification and association rule mining, Proc. 4th Int. Conference on KDD Mining, 80-86, 1998. 13. I. Mierswa, et al., Yale: Rapid prototyping for complex data mining tasks, Proc. of the 12th ACM SIGKDD Int. Conf., ACM, 2006. 14. D. Kirk, NVIDIA CUDA software and GPU parallel computing architecture, Proceedings of the 6 th International Symposium on Memory Management, vol. 21, no. 22, 103-104, 2007.