J. Parallel Distrib. Comput. 73 (2013) 664–676
Contents lists available at SciVerse ScienceDirect
J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc
Fractal self-similarity measurements based clustering technique for SOAP Web messages Dhiah Al-Shammary a,∗ , Ibrahim Khalil a , Zahir Tari a , Albert Y. Zomaya b a
School of Computer Science & IT, RMIT University, Melbourne, Australia
b
School of Information Technologies, University of Sydney, Sydney, Australia
article
info
Article history: Received 13 June 2012 Received in revised form 4 January 2013 Accepted 15 January 2013 Available online 8 February 2013 Keywords: SOAP Fractal Clustering Web services
abstract The significant increase in the usage of Web services has resulted in bottlenecks and congestion on bandwidth-constrained network links. Aggregating SOAP messages can be an effective solution that could potentially reduce the large amount of generated traffic. Although pairwise SOAP aggregation, that is grouping only two similar messages, has demonstrated significant performance improvement, additional improvements can be done by including similarity mechanisms. Such mechanisms cluster several SOAP messages that have high degree of similarity. This paper proposes a fractal self-similarity model that provides a novel way of computing the similarity of SOAP messages. Fractal is proposed as an unsupervised clustering technique that dynamically groups SOAP messages. Various experimentations have shown good performance results for the proposed fractal self-similarity model in comparison with some wellknown clustering models by only consuming 31% of the clustering time required by the K-Means and 23% when using principle component analysis (PCA) combined with K-Means. Furthermore, the proposed technique has shown ‘‘better’’ quality clustering, as the aggregated SOAP messages have much smaller size than their counterparts. © 2013 Elsevier Inc. All rights reserved.
1. Introduction SOAP is a common XML-based communication protocol that provides interoperability between network nodes. This protocol has dramatically accelerated its dominance over the Internet [1, 4]. It is generally considered to be the preferred Web development protocol for the Internet and its adoption has increased by 300% during the year 2010 [18]. Although Web services provide significant benefits to the network, there are some serious performance limitations in comparison with other communication technologies such as CORBA and Java-RMI [9,15,19]. SOAP Web services have inherited the disadvantage of XML by consuming considerably large amounts of network resources due to high network traffics [21,25]. Moreover, the increasing demand of Web services as means of sharing information around the world has resulted in performance bottleneck and congestion by slowing down its transportation [9]. 1.1. Motivation Compression [20], caching [5] and textual aggregation [2] models have been proposed to significantly minimize network traffic
∗
Corresponding author. E-mail addresses:
[email protected],
[email protected] (D. Al-Shammary),
[email protected] (I. Khalil),
[email protected] (Z. Tari),
[email protected] (A.Y. Zomaya). 0743-7315/$ – see front matter © 2013 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2013.01.005
and thus improving performance of Web services. Aggregation is a modern and effective model to reduce network traffic by aggregating SOAP messages and multicasting them to Web clients and later splitting them at the closest routers [3]. Fig. 1 illustrates that several SOAP responses can be combined in such as a compact packet by aggregation. Web services usually suffer from bottlenecks and congestion as a result of high network traffic caused by Web applications like stock quote service [14,3]. In our previous work [2,4], we introduced a new SOAP message aggregation strategy based on utilizing compression concepts. This aggregation model is strengthened by the redundancy awareness feature of compression as an alternative similarity measurement to aggregate messages into one compact structure. The developed aggregation model consists of two main activities: transforming the XML tree of SOAP messages into minimized SOAP textual expression and then encoding it with either fixed-length or Huffman encoding techniques. Although this aggregation technique can aggregate as many messages as requested by the Web server, advanced clusterbased similarity measurements are still required in order to find out which sets of SOAP messages are optimum to be aggregated as an alternative to the traditional pair-based SOAP similarity measurement. Generally, standard clustering techniques such as K-Means and vector space model [17,26] could be alternatives to the SOAP similarity measurements. However, they do not represent an efficient similarity measurement because of the following drawbacks.
D. Al-Shammary et al. / J. Parallel Distrib. Comput. 73 (2013) 664–676
665
Fig. 1. Clustering and aggregation support for stock market quote cloud Web services over the Internet.
• High complexity: They have high complexity which could result in long clustering time. This is caused by high computations of iterative methods for cluster finalization [13]. • Inefficient prediction of clusters: A fixed number of clusters usually results in inefficient prediction of clusters. Messages are likely to be clustered without having high similarity degree with other messages in the same cluster. Moreover, a fixed number of cluster based models do not work efficiently with non-globular clusters such as SOAP clusters with high redundancy [13,22]. • Inaccurate and inefficient centroids selection: Most clustering models start with initial partitions that might be selected randomly usually resulting in inaccurate and inefficient clustering of messages. Unsupervised and dynamic clustering models are effective techniques to solve the problem of SOAP clusters with high common redundancy. 1.2. Contribution In this paper, a novel clustering model based on the fractal self-similarity principle for SOAP traffic is proposed. Fractal, as a mathematical model, provides powerful self-similarity measurements for the fragments of regular and irregular geometric objects in their numeric representations [16,23]. Partitioning iterated function system (PIFS) represents the power of fractal in depicting the similarity of smaller parts in the same numeric object [6]. PIFS explains the dynamics of creating fractals by uniting several copies of the same object with different scales where every copy is made up of smaller scaled copies of itself. In comparison with other traditional XML similarity measurements, fractal can provide similarity measurement to complete parts (set of features) of the objects at once and not only investigating XML features separately. With the aim to provide efficient clustering predictions, this paper investigates fractal fragments in SOAP messages as their XML tree could be segmented into several fractal objects. Fig. 2 shows SOAP fractal segments in comparison with Mandelbrot fractal set. Thus, this paper presents an extended version of our previous work in [3] by showing an advanced and detailed analysis of
fractal similarity measurements for SOAP messages. Fig. 3 states the main components of the proposed clustering technique. The main contributions made in this paper are as follows:
• Efficient prediction: Fractal mathematical parameters are introduced to compute SOAP message similarities that are applied based on the numeric representation of SOAP messages. The proposed technique aims to create clusters with a very high degree of similarity by dynamically grouping them together where the proposed technique does not require a predefined number of clusters. Experimental results show significant predictions for similar SOAP messages by the proposed technique in comparison with K-Means and PCA combined with K-Means. The accurate predictions of the proposed technique is capable of achieving better compression ratios than other clustering models [17,12]. • Low complexity clustering: SOAP fractal similarities are developed to devise a new unsupervised auto clustering technique. These similarity measurements are based on computing fractal coefficients of numeric fragments that construct a single numeric object [6]. The proposed technique provides a low complexity clustering in comparison with iterative models. The proposed technique requires potentially less clustering time in comparison with other iterative clustering models [17,12]. Fractal clustering requires only 30% and 19% of the required time by K-Means and PCA combined with K-Means respectively. • Efficient dataset for accurate clustering: The proposed dataset of SOAP messages is a set of numeric vectors showing the local and global loads of XML items. These vectors are broken up into equally sized blocks. Fractal coefficients of the vector blocks represent the similarity parameters that are compared with blocks of other vectors to be the key metric for clustering SOAP messages. The proposed structure for the dataset has accurately reflected the features of SOAP messages and enabled the clustering technique to efficiently measure their similarities.
1.3. Evaluation strategy In order to evaluate the performance of the proposed fractal clustering technique, the compression-based aggregation model
666
D. Al-Shammary et al. / J. Parallel Distrib. Comput. 73 (2013) 664–676
Fig. 2. Fractal similarity of smaller parts of graphical objects like Mandelbrot fractal and fractal similarity of smaller parts of SOAP message tree branches.
in the generated dataset, 2. discusses the fractal mathematical models and how they can be utilized in clustering SOAP messages, and 3. explains the computations of the fractal coefficients of SOAP messages in separate algorithm and the fractal root mean square error criteria. Section 4 describes the experimental evaluation of the proposed clustering technique. Finally, the conclusions and future works are presented in Section 5. 2. Related work
Fig. 3. Main model components.
developed in our previous work [2] is used to compute the achievable SOAP messages size reduction after aggregating the clustered messages by the proposed fractal technique, and both the K-Means and PCA combined with K-Means, for relative performance comparison. The evaluation showed that the fractal clustering technique enables the compression-based aggregation model [2] to achieve higher message size reduction than other techniques. Furthermore, local error (i.e. error rates between every single message and the center message of the same cluster) and global error (i.e. error rates between every center message and centers of other clusters) have been evaluated. Moreover, the processing time of the proposed model is investigated and compared with the processing time of other techniques where it is found to be significantly lower than other models. 1.4. Organization of the paper The rest of this paper is structured as follows. First, the related work is described in Section 2. Section 3 describes the proposed technique: 1. explains the process of computing XML documents dataset and how to represent XML messages as numeric vectors
With the aim to develop efficient clustering techniques, a number of studies have proposed several clustering models for text and XML documents [17,26,12,8,10]. Most of these clustering models have been based on exploiting the structural similarities of XML documents that mainly concentrate on measuring XML trees edit distance. On the other hand, other models consider the content of XML and text documents. Liu et al. 2004 [17] developed a new XML clustering approach using principal component analysis (PCA) technique. The proposed technique first extracts features from XML documents by constructing ordered and labeled XML trees and then transforming them into vectors. The generated vectors contain the occurrences of the considered features in the XML documents. PCA was applied to minimize the dimensions of the dataset vectors by summarizing all the considered features and generating new reduced dimension vectors. Then, the K-Means clustering technique was used to cluster the XML documents based on the minimized features. In order to evaluate the performance of the proposed approach, two sets of XML documents are considered as input datasets. The performance of the developed PCA technique is compared with the performance of the K-Means technique without reducing the dimensionality of the dataset vectors. The outcome of the experiments showed that PCA has significantly improved the accuracy of K-Means clustering. Flesca et al. (2005) [10] introduced a new clustering approach for XML messages that mainly investigates the structural similarities of XML documents in a generated time series. The basic strategy of the proposed technique consists of linearizing the structure of XML documents by encoding the XML tags into signal pulses in
D. Al-Shammary et al. / J. Parallel Distrib. Comput. 73 (2013) 664–676
(a) S1.
(b) S2.
(c) S3.
667
(d) S4.
Fig. 4. SOAP message responses to the requests getStockQuote(X) and getQuoteAndStatistic(Y).
order to transform the XML document into a numeric form. Then, XML documents are distributed into clusters according to the analysis of their numerical sequences. Discrete Fourier Transformation (DFT) is proposed to effectively compare the encoded XML documents (frequency domain). The experimental results showed the effectiveness of the proposed technique in comparison with treeedit clustering techniques. Cui et al. 2005 [8] have proposed Particle Swarm Optimization (PSO) technique as a fast and high quality clustering algorithm of text documents. In this research, PSO is implemented in addition to the K-Means clustering technique in order to give an accurate comparison of the performance of the proposed technique with a well-known clustering method. The dataset has been generated using the vector space model (VSM). Although the experiments show that PSO is significantly better than the K-Means in terms of the execution time, K-Means is still more efficient for large documents than PSO. Based on these results, a hybrid PSO is presented in this research that can take the advantage of K-Means to replace the refine stage in the proposed PSO technique. K-Means, PSO and hybrid PSO techniques were applied on four generated datasets each with a different number of documents. The hybrid PSO showed the best results in comparison with other clustering techniques. In 2007, Hwang and Gu [12] implemented a clustering approach for XML documents based on the weight of the frequent structures in the considered XML trees of the XML documents. The proposed approach is mainly based on the concept of recognizing the large items in each XML document and clustering them according to the similar large items. The clustering metrics include the path information of XML tags and data items in the XML tree. The basic strategy of the proposed technique is to compute the average of the accumulated frequency of structures in the XML tree and then distribute the new XML documents based on that average. To evaluate the performance of the proposed technique, it is compared with Hierarchical Agglomerative Clustering (HAC) and K-Means techniques. The experiments show that the proposed technique is more efficient than both of the HAC and K-Means techniques. Yongming et al. (2008) [26] developed a novel method for computing the similarities of XML documents and clustering them based on both the structure and content. The proposed approach is an extension of the traditional vector space model (VSM) as it includes the structural similarities as a part of the clustering technique process. The dataset of the considered XML documents is generated by extracting features of the XML trees such as recording the paths of leaf nodes and nested elements. Furthermore, VSM is used to extract another type of features by computing the weight
of the XML tags and data elements. The proposed techniques have been evaluated using two different datasets. Both the entropy and purity have been investigated. The experiments showed that the performance of the extended vector space model is higher than the basic VSM because it has higher purity and lower entropy. 3. Proposed technique Fig. 3 shows the main components of the proposed technique. Preparing the XML dataset is the main target of the first three components in the proposed model resulting in a numeric dataset. The final three components are completely related to the fractal functions starting by computing fractal coefficients and finishing by dynamically clustering the dataset numeric objects based on the histograms of the fractal root mean square error metric. 3.1. Building XML dataset Clustering techniques are usually developed to work on a specific dataset format that represents the considered XML documents. The XML tree is the main structure of XML messages. Therefore, the first step of the proposed XML document preparation is to build the XML tree for all the XML messages. The generated XML trees for the given SOAP messages in Fig. 4 are shown in Fig. 5. Level-order traversal is used to traverse all the generated XML trees to build the matrix form of XML messages (see Fig. 6). Matrix form is the basic format of the transformed XML messages that is required to convert them into time series representations (see Eq. (5)). With the aim of transforming the XML document from the textual domain to the frequency domain (time series), modification of the XML dataset starts with computing the vector template that has a unique copy of every single XML item in the XML documents. The frequencies of XML items represent the time series attributes of the XML document. The dataset vectors contain frequencies of the XML items that are ordered in the same way as their distinctive textual contents of the composed vector template. Eq. (1) shows the general format of the vector template. Vtemplate = [Nd1 , Nd2 , Nd3 , . . . , Ndn ]
(1)
where Ndi is the node content of the ith XML item of the generated XML tree. Eq. (2) represents the dataset system that consists of the generated time series items of the XML documents: V1 = [X1 , X2 , X3 , . . . , Xn ] V2 = [X1 , X2 , X3 , . . . , Xn ]
.. .
Vm = [X1 , X2 , X3 , . . . , Xn ]
(2)
668
D. Al-Shammary et al. / J. Parallel Distrib. Comput. 73 (2013) 664–676
(a) S1.
(b) S2.
(c) S3.
(d) S4.
Fig. 5. XML messages trees of SOAP messages (S1, S2, S3, and S4).
(a) S1.
(b) S2.
(c) S3.
(d) S4.
Fig. 6. Generated matrix form of SOAP messages (S1, S2, S3, and S4).
where
where
• Vi is the frequency vector of the ith XML message in the dataset. • Xi is the frequency of ith XML item of the XML message. • n is the total number of all the distinctive XML items in the generated XML matrix forms.
• m represents the total number of the XML messages in the dataset. In most clustering approaches, the datasets are usually generated as a set of vectors V = {x1 , x2 , . . . , xn }, where every single element xi refers to a single item that represents a single feature of the document. In this research, the Term Frequency with Inverse Document Frequency (TF–IDF) weights [12] is used to generate the dataset of the XML documents in order to prepare them to be clustered by fractal clustering technique. The XML tag content is formalized as a frequency that shows the weight of the corresponding XML item in a two-dimensional space that consists of a number of frequency vectors. In other words, every single vector V in the dataset (V = {x1 , x2 , . . . , xn }), such that xi = wi where wi (i = 1, 2, . . . , n) represents the weight of the XML item for the term ti in the XML document. This set of frequencies shows the significance of these terms in their XML documents. The TF–IDF scheme reflects the weight of XML items within the XML document in addition to the entire set of vectors (i.e. transformed document). In other words, the significance of XML items is determined by both local (within XML document) and global (entire set of vectors) factors. XML documents have great similarity with other documents in the dataset if they share similar frequencies of their XML items. The weight of the XML item wi in the XML document d is computed as follows:
wi (d) = tfi × log
D dfi
(3)
• tfi is the XML item frequency in the document d (local information).
• dfi is the number of documents containing the ith XML item. • D is the total number of XML documents in the dataset. Eq. (4) represents the generated vector template of SOAP messages (S1 , S2 , S3 , and S4 ), and Eq. (5) represents the generated dataset of the same SOAP messages. The process of generating vectors of the dataset in this section is summarized in Algorithm 1. Occurrences of every single feature is counted locally in the messages and globally in other messages. Then Eq. (3) is applied to compute the final weight of features. Vtemplate = [StockQuoteResponse, StockQuote, Company, QuoteInfo, AFI , Price, LastUpdated, 20.06, 01/09/2010, AMI , 31.52, QuoteAndStatisticResponse, QuoteAndStatistic , Statistic , Symbol, Change, OpenPrice, Holden, 24.54,
+ 0.50, 24.02, Ford, 28.56, −0.10, 28.66]
(4)
V1 = [.3, .3, .3, 0, .6, 0, .3, .6, .6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] V2 = [.3, .3, .3, 0, 0, 0, .3, 0, 0, .6, .6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] V3 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, .3, .3, .3, .3, .3, .3, .6,
.6, .6, .6, 0, 0, 0, 0] V4 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, .3, .3, .3, .3, .3, .3, 0, 0, 0, 0, .6, .6, .6, .6].
(5)
D. Al-Shammary et al. / J. Parallel Distrib. Comput. 73 (2013) 664–676
Algorithm 1 Build the XML Dataset //Notation Description: //D holds the number of XML vectors in the dataset //V [D][n] holds the frequencies of XML items //Vt [n] holds the nodes content of XML items (Vector template) //n holds the total number of the distinctive XML items in the vector template //Mx holds the current matrix form of the current XML document i ← 0//Counter Initialization j ← 0//Counter Initialization repeat Mx ← load the current (ith) matrix form for all nodes content in the Vt do Nd ← Vt [j] // get the current node in the v ector template j←j+1 F ← Count (Nd, Mx) //Count all the occurrences of Nd in Mx G ← Count (Nd, D) // Count the number of documents hav ing Nd V [i][j] ← F × log DG end for i←i+1 Until i = n
3.2. Fractal for Web services Fractal is defined as a fragmented geometric shape that can be divided into several parts; each one is approximately a smaller copy of the whole shape [23]. Fractal has been found to form a mathematical description for the enormous and irregular shape of objects [24]. The term ‘‘fractal’’ was established by Mandelbrot who derived it from the Latin fractus which is an adjective for the irregular and fragmented objects [7]. Mandelbrot standard set shows that the fractal pattern of the whole object is the same fractal pattern of many other particular regions of the same pattern [7]. Fig. 2 shows Mandelbrot set and the fractal similarities within its smaller objects. These particular regions are only smaller and the same way it goes from the largest scales to the smallest. In other words, fractals are the repetition of the same structural form. Geometric shapes are represented by fractal in a numeric form or geometric mathematical models. Fractal models are applied on geometric shapes in their numeric forms. In other words, fractal can be defined as the repetition of the same or approximately same structural form of any numeric object. Fractal can be applied in Web service applications using fractal self-similarity principle and other fractal characteristics that could be used in a variety of applications. In this research, the proposed fractal model suggests utilizing fractal characteristics in Web services after creating their time series representation. Fractal is proposed to compute SOAP message similarities in order to cluster them in the frequency domain of SOAP messages. This is suitable when we have a large number of messages that must be clustered quickly and accurately. 3.2.1. XML fractal self-similarity Self-similarity is the basic principle of fractals and it is the key solution to most fractal applications [11]. Fractals can be classified according to the type of self-similarity. There are three types of self-similarity found in fractals:
• Exact self-similarity: This is the strongest type of self-similarity where fractal appears identical at different scales. Fractals defined by iterated function systems often display exact selfsimilarity. • Quasi-self-similarity: This is a loose form of self-similarity where fractal appears approximately (but not exactly) identical at different scales. Quasi-self-similar fractals contain small copies of the entire fractal in distorted and degenerated forms. Fractals
669
defined by recurrence relations are usually quasi-self-similar but not exactly self-similar. • Statistical self-similarity: This is the weakest type of selfsimilarity where fractal has numerical or statistical measures which are preserved across scales. Most reasonable definitions of ‘‘fractal’’ trivially imply some form of statistical selfsimilarity. In the proposed technique, fractal self-similarity is applied on the numeric form of SOAP Web service messages manipulating every single message as a numeric segment. A numeric object is constructed from all the considered segments and fractal selfsimilarity is investigated with all the numeric segments in order to cluster them according to their similarity values. 3.2.2. Fractal iterated function system Fractal is made up of the union of several copies of itself, each copy being transformed by a function. Iterated function system (IFS) fractal is made up of several possibly-overlapping smaller copies of the same object, each of which is made up of copies of itself [6]. Traditionally, IFS fractals are computed in 2D but they can be of any number of dimensions. For example, 3D Sierpinski triangle is a well-known example showing the self-similarity of objects in three dimensions. Fractal takes advantage of the fact that real life objects are to a great extent self-similar [24]. In other words, many parts of the object can be approximated by transforming another part of the same object by applying some affine transformation which is usually linear. Based on the fractal theory, for a given object P, fractal process tries to find a partitioned iterated function system (PIFS), F = fi : i = 1, . . . , k, which are non-overlapping tiles usually called range blocks of the object, where each of the ‘‘tiles’’ is formed by applying an affine transformation fi on a section of P: F (P ) =
k
fi (di )
(6)
i=1
where k is the number of range blocks, di is an arbitrary section of the numeric object, called domain. The ‘‘tile’’ approximated by fi (di ) is referred to as range or ri . Each transformation fi (di ) gives the best possible approximation of ri . 3.2.3. Fractal mathematical form The general form of fractal transformation is R´ = S × D + O
(7)
where R´ is the approximated range value, D is a part of the same object (usually called domain), and S and O are the scaling and shifting (offset) factors. The formula of the PIFS is applied to compute the fractals of all parts (Pi ) of the object: d´ = S × d(pi ) + O
(8)
where d´ is equivalent to the approximated range block, and d(pi ) is a part of the domain section. The optimal values of the coefficients can be obtained by calculating the following: n
n
d(pi )r (pi ) −
i=1
S=
n
d(pi )
i=1
n
n
d(pi ) − 2
i=1
n
r (pi )
i=1
n
2 d(pi )
(9)
i=1
and O=
1 n
n i=1
r (pi ) − S
n i =1
d(pi )
(10)
670
D. Al-Shammary et al. / J. Parallel Distrib. Comput. 73 (2013) 664–676
n n n n n 1 2 2 RMS = r (pi ) + S S d(pi ) − 2 d(pi )r (pi ) + 2O d(pi ) + O nO − 2 r (pi ) . n
i=1
i =1
i =1
i=1
(11)
i=1
Box I.
where n is the number of values in the object fragment, d(pi ) is the value of the ith item in numeric object d, and r (pi ) is the value of the ith item in numeric object r. (See Eq. (11) given in Box I.) A given object is typically partitioned into k vectors. Eq. (11) represents the criteria to investigate the similar vectors in order to assign them to their clusters. The similar messages are investigated by computing the scale and offset fractal factors for all the considered vectors derived from XML messages. Then, they are clustered based on their fractal similarity that is reflected by having the small root mean square error (RMS) inside the same group of XML messages. 3.3. Fractal coefficients and RMS Fractal mathematical models are well-known as time consuming techniques [23]. With the aim to reduce the required computations of fractal technique, fractal redundant coefficients are calculated in advance. As a result, the required processing time for the fractal clustering algorithm is reduced significantly as most of the major coefficients have already been computed and buffered. This process is summarized as in Algorithm 2. Algorithm 2 Fractal Coefficients Computations //Notation Description: //FBS holds the fractal block size //FBn holds the number of blocks in XML vector //Sn holds the number of XML vectors in the dataset //Vs holds the number of frequencies per vector //V [Sn][Vs] vectors of the generated dataset //Flg [Sn][FBn] holds the flags to recognize the ignored blocks for i = 0 To Sn - 1 do// All v ectors for j = 0 To FBn - 1 do All blocks in v ector Flg [i][j] ← False// Flag initialization for co = 0 To FBS - 1 do All frequencies in the block FSL ← j × FBS // Determine the start location of the required frequency If V [i][FSL + co] ̸= 0 Flg [i][j] ← True// not ignored block Break the loop End If end for If Flg [i][j] = True// not ignored block R[i][j] ← 0, Rs[i][j] ← 0//Initialization for co = 0 To FBS - 1 do All frequencies in the block R[i][j] ← R[i][j] + V [i][FSL + co] // ri fractal coefficient Rs[i][j] ← Rs[i][j]+ Sqr (V [i][FSL + co]) // ri 2 fractal coefficient end for 2 Das[i][j] ← Sqr (R[i][ j]), //( di ) fractal coefficient D[i][j] ← R[i][j] // di fractal coefficient Ds[i][j] ← Rs[i][j], // di 2 fractal coefficient End If end for end for
After investigating the main fractal Eqs. (9)–(11), five major fractal coefficients are selected to be computed in advance as listed below:
•
•
in the generated dataset. 2
ri : summation of the ith range block in the considered vector
ri : summation of the squared values of the ith range block of the XML vector.
•
di : summation of the ith domain block in the considered vector in the generated dataset. 2 • di summation of the squared values of the ith domain block of the XML vectors. • ( di )2 : summation of the squared value of the ith domain block. Another strategic step of the proposed fractal clustering model is considering the same range blocks of the generated XML vectors as domain blocks excluding the currently selected range block. Therefore, every single block in the generated dataset vectors is having the same fractal coefficients domain block. as range and Technically, the fractal coefficients ri and ri 2 will be equal to 2 di and di respectively. Therefore, only one set needs to be computed as they are duplicated from the range block coefficients. As previously stated, the resultant frequencies in the generated vectors of the dataset represent the actual properties of the XML messages. According to the proposed fractal strategy which breaks up these vectors into equal sized blocks, some of these blocks have zeros only as some of the XML items are non-existent in their XML messages. In Algorithm 2, these zero frequency blocks are identified by checking and flagging them as ignored blocks as they do not have any impact on the clustering distributions of the final XML messages. Flagging these blocks and removing them from the computations of the fractal coefficients can potentially minimize the processing time (see Fig. 7). Fig. 7 shows the fractal similarities inside the SOAP messages numeric particles and explains the fractal similarity based clustering process. Fractal factors (scale, offset, and RMS) are computed for the current selected featureblock with all other feature-blocks that are located on the same column in order to find the closest matching block that has the smallest RMS. Blocks are assigned with the message index of similar block in other messages. Fractal root mean square error (RMS) is the basic metric of the proposed clustering technique that determines block similarities as all the generated vectors in the dataset are broken up into equal sized blocks. The computations of the RMS metric are based on the comparison of the resultant RMS values of the blocks that are located on the same column in the dataset with different vectors (blocks on the same column reflect the same features). The smallest RMS value with the considered block means higher similarity of their template features (XML items). Algorithm 3 is required to compute the RMS metric and fractal similarity. It creates a decision matrix that refers to the closest XML sample of every single block. The generated decision matrix is the key solution to finalize allocating messages based on the maximum histogram of sample indexes in every single vector. Algorithm 4 is required to compute the histogram of all the existent sample indexes of every single vector in the generated dataset and distribute them according to the maximum histogram of these sample indexes in the decision matrix. For example, in Fig. 7, vector blocks of the first indexed messages ([0.3, 0.3, 0.3, 0, 0.6] and [0, 0.3, 0.6, 0.6, 0]) are checked with other vector blocks located on the same column and the best matched blocks are [0.3, 0.3, 0.3, 0, 0] and [0, 0.3, 0, 0, 0.6] respectively from the second indexed message. The smallest RMS for the best matches are 0.189 and 0.222 respectively and therefore both messages are clustered together based on their resultant histograms for the similar indexes.
D. Al-Shammary et al. / J. Parallel Distrib. Comput. 73 (2013) 664–676
671
Fig. 7. Fractal similarity of SOAP messages inside the numeric dataset and final clusters. The fractalization process starts after building the numeric form of SOAP messages with a one dimensional numeric vector for each SOAP message allocated as one row in the final two-dimensional matrix (dataset) that represent all messages. The major steps are: 1. break the vectors into smaller blocks; 2. flag zero blocks as ignored to be excluded from the fractal computations because these blocks mean their features are not existent in the message (non-feature blocks); 3. Compare every block with all other feature blocks located in the same column with fractal factors (scale, offset, and RMS) to find the most similar one that created the smallest RMS; 4. assign the block with the message index of the similar block; 5. histogram assigned indexes (indexes represent message references) for every single vector and cluster them with the message of the highest index appearance of the histogram.
Algorithm 3 fractal RMS metric
Algorithm 4 Histogram Vectors Distribution
//Notation Description: //FBS holds the fractal block size //FBn holds the number of blocks in XML vector //Sn holds the number of XML vectors in the dataset //Vs holds the number of frequencies per vector //V [Sn][Vs] vectors of the generated dataset //Flg [Sn][FBn] holds the flags to recognize the ignored blocks
//Notation Description: //FBS holds the fractal block size //FBn holds the number of blocks in XML vector //Sn holds the number of XML vectors in the dataset //Sindex holds the current sample index //Hist [Sn] holds similar samples histogram //Flg [Sn][FBn] holds the flags to recognize the ignored blocks //FClust [Sn] holds the final samples distribution
for i = 0 To Sn - 1 do// All v ectors for j = 0 To FBn - 1 do All blocks in v ector If Flg [i][j] = True// not ignored block RMSo = 500000// Initializing the RMS error w ith high v alue for k = 0 To Sn - 1 do If k ̸= i RD ← 0//Initialization FSL ← j × FBS for co = 0 To FBS - 1 do RD ← RD + V [i][FSL + co]× V [k][FSL + co] end for Scale ← (FBS × RD − R[i][j] × D[k][j]) /(FBS × Ds[k][j] − Das[k][j]) Offset ← (R[i][j] − Scale × D[k][j])/FBS RMSn ← Sqrt ((Rs[i][j] + Scale × (Scale× Ds[k][j] − 2 × RD + 2 × Offset × D[k][j]) + Offset × (FBS × Offset −2 × R[i][j]))/FBS ) If RMSn < RMSo S [i][j] ← k RMSo ← RMSn End If, End If end for, End If end for,end for
for i = 0 To Sn - 1 do// All v ectors for j = 0 To FBn - 1 do All blocks in v ector SIndex ← S [i][j] Hist [Sindex] ← Hist [Sindex] + 1 end for MaxIndex ← 0 for j = 0 To Sn - 1 do If Hist [j] > MaxIndex MaxIndex ← Hist [j] SIndex ← j End If end for FClust [i] ← SIndex end for
4. Experiments and discussion With the aim of showing an accurate assessment of the proposed fractal clustering technique, the experimental evaluation has considered a wide variety of SOAP messages size. These samples include real small messages (i.e. only 140 bytes) as
672
D. Al-Shammary et al. / J. Parallel Distrib. Comput. 73 (2013) 664–676
(a) Fractal clustering for small messages.
(c) Fractal clustering for large messages.
(b) Fractal clustering for medium messages.
(d) Fractal clustering for very large messages.
Fig. 8. Compressed size based on 10% fractal block size clustering of small, medium, large, and very large aggregated messages with 40 messages each group.
well as very large messages that could be as large as 53 KB. Messages of the dataset are built based on the stock quote WSDL (Web Service Description Language) at http://www.w3.org. Moreover, the compression-based aggregation model of SOAP Web messages [2] is considered as a tool to demonstrate the superior ability of the proposed fractal clustering model when compared with other standard clustering techniques. The basic testing criteria is by applying the compression-based aggregation model on the resultant clusters and measuring the achievable compression ratios on these clusters. Then, comparing them with other standard techniques as the one that could achieve higher compression ratio is the one that can achieve better clustering. Furthermore, the local error rate (RMS) inside clusters in addition to the global error rate which investigates the similarity level are measured for every single cluster. A testbed of 160 SOAP messages has been configured that consists of four groups each with 40 messages. The testing SOAP messages are allocated to these groups based on their size as they are classified as small, medium, large, and very large sized messages. They have the ranges of 140–800, 800–3000, 3000–20 000, and 20 000–55 000 bytes respectively. Both K-Means and PCA combined with K-Means [17] are implemented in this research to evaluate the proposed technique by comparing their resultant compression ratios that could be achieved on their resultant SOAP message clusters in addition to their required processing time. K-Means and PCA combined with K-Means are applied on the generated dataset with different cluster numbers that start from just two and up to ten clusters. On the other hand, the fractal model is developed and applied to work on a fractal block size that can be pre-determined by the developer. In this work, fractal block sizes are 10%, 20%, 25%, 50%, and 100% of the overall considered vectors size in the generated dataset of the XML messages. All techniques showed significant results by enabling the compression-based aggregation tool to achieve potentially high
compression ratios on the clustered SOAP messages. Table 1 summarizes the performance of all the clustering techniques on SOAP message groups. It shows the resultant average compression ratios of all the experiments with different numbers of cluster and fractal block size percentage. Furthermore, it states that the average time to cluster 40 SOAP messages of each group in the generated dataset. In this table, the fractal model displays better performance than other models in terms of enabling the aggregation model to achieve higher compression ratios. Moreover, the table shows that the processing time required to cluster SOAP messages is potentially reduced by the fractal based clustering technique in comparison with other standards. Fig. 8 depicts the significant ability of reducing the overall size of the clustered SOAP messages. This is achieved by aggregating messages of each cluster after clustering them using the fractal clustering technique with 10% fractal block size of the overall vector size which in fact is the smallest block size used in the experiments. Furthermore, Fig. 9 shows the ability of minimizing the aggregated messages size with the maximum fractal block size (100%) used in the experiments. Figs. 10–12 illustrate the detailed results of the achievable average compression ratios by the compression-based aggregation tool after clustering SOAP messages using K-Means, PCA combined with K-Means, and the proposed technique respectively with different numbers of clusters and block sizes. It is clear that, as the number of clusters decreases, the higher compression ratio can be achieved. Both Huffman and fixed-length based aggregations [2] are applied on the resultant clusters of both fixed cluster numbers clustering techniques and Huffman. It clearly showed better performance by achieving higher compression ratios than fixed-length encoding. On the other hand, fractal block size has significant impact on the overall performance of the fractal clustering technique as decreasing the block size better aggregation can be achieved (i.e. higher compression ratios). The
D. Al-Shammary et al. / J. Parallel Distrib. Comput. 73 (2013) 664–676
(a) Fractal clustering for small messages.
673
(b) Fractal clustering for medium messages.
(c) Fractal clustering for large messages.
(d) Fractal clustering for very large messages.
Fig. 9. Compressed size based on 100% fractal block size clustering of small, medium, large, and very large aggregated messages with 40 messages each group.
Table 1 Average compression ratios and clustering time of K-Means, PCA + K-Means, and fractal clustering models for small, medium, large, and very large messages. K-Means
PCA + K-Means
Fractal
3.920295 3.820822 50.8831
3.850353 3.7136433 65.3333
4.002463 3.92449 15.6249
6.766314 7.715699 52.3342
6.797503 7.843987 62.8888
7.275127 7.985582 15.8441
12.943645 16.020012 54
12.815021 16.279294 68.1111
13.100785 16.633852 15.7241
15.109293 20.163554 53.6231
15.127478 20.253857 70.4444
15.334334 21.7001 15.6383
40 Small messages Fixed-length average cr. Huffman average cr. Av. clustering time (ms) 40 Medium messages Fixed-length average cr. Huffman average cr. Av. clustering time (ms) 40 Large messages Fixed-length average cr. Huffman average cr. Av. clustering time (ms) 40 Very large messages Fixed-length average cr. Huffman average cr. Av. clustering time (ms)
basic explanation of this fact is that when XML messages are broken up into more blocks, more object features can be caught precisely leading to better fractal similarity measurements. Thus, the more similar messages are clustered in one cluster. The fractal clustering technique has shown better performance in terms of supporting the aggregation model to reduce the network traffic significantly in comparison with other techniques. Furthermore, fractal clustering based compression ratios of aggregating small, medium, large, and very large have been investigated with 10% and 100% fractal block size as in Figs. 13 and 14. Experimentation results clearly demonstrated that the proposed technique has clustered SOAP messages with high level of
Fig. 10. Average compression ratios of the aggregated SOAP messages based on the K-Means clustering technique.
similarity by selecting messages with the smallest error (RMS) with the centroid point of the cluster. Tables 2 and 3 show the minimum, maximum, and average error values (RMS) in order to investigate both local and global similarities within the same cluster and with other clusters. The results showed higher RMS values for the global similarities than local measurements as messages have less error rate inside their clusters. With the aim to evaluate the processing time of the technique, clustering time has been investigated and compared with both K-Means and PCA combined with K-Means. Fig. 15 shows the average processing time of all the considered clustering techniques in details for small, medium, large, and very large XML messages. PCA combined with K-Means requires more processing time than K-Means as it requires more computations to implement the PCA
674
D. Al-Shammary et al. / J. Parallel Distrib. Comput. 73 (2013) 664–676 Table 2 Clusters local RMS errors and resultant clusters numbers and sizes.
Fig. 11. Average compression ratios of the aggregated SOAP messages based on the K-Means with the PCA clustering technique.
Messages size
Block size percent (%)
Min. RMS
Small
10 20 25 50 100
0 0 0.302 0.258 0
Medium
10 20 25 50 100
Large
Very large
Max. RMS
Average RMS
Clusters number
0.504675 0.919078 0.987644 0.97368 0.818122
0.26045 0.453063 0.52268 0.541722 0.53623
6 11 12 12 12
0 0 0 0 0
1.631742 1.881752 2.139269 1.821 1.844825
0.885071 1.081786 1.243711 1.182979 1.183659
11 11 13 12 8
10 20 25 50 100
0 0 0 0 0
4.482418 5.046171 5.264756 4.829986 4.529678
2.469384 2.87157 3.044822 3.067312 2.920643
12 8 8 10 7
10 20 25 50 100
0 0 3.504 3.364 0
6.913658 10.07866 8.724271 8.862357 7.808793
4.609075 5.726257 5.815706 5.851158 5.563749
10 11 10 10 11
for the SOAP messages. However, the proposed technique shows a significant processing time in comparison with other techniques as it mainly requires about 15 ms to cluster 40 SOAP messages while other techniques require between 50 and 70 ms. Processing time for the dataset generation is investigated to give an accurate evaluation to the overall requirements of the clustering models and it is obvious that it changes with the size of SOAP messages where large messages require more processing time (see Fig. 16). 5. Conclusion and future work
Fig. 12. Average compression ratios of the aggregated SOAP messages based on the fractal clustering technique.
Web scenarios and applications using the aggregation of SOAP messages are significantly strengthened by clustering models as potential alternatives to the traditional simple similarity measure-
Fig. 13. Compression ratio based on 10% fractal block size clustering of small, medium, large, and very large aggregated messages with 40 messages each group.
Fig. 14. Compression ratio based on 100% fractal block size clustering of small, medium, large, and very large aggregated messages with 40 messages each group.
D. Al-Shammary et al. / J. Parallel Distrib. Comput. 73 (2013) 664–676 Table 3 Clusters global RMS errors. Messages size
Block size percent (%)
Min. RMS
Small messages
10 20 25 50 100
0.187966 0.228095 0.188495 0.261493 0.247343
0.514816 0.906718 0.84212 1.040522 0.979189
0.330745 0.488379 0.546586 0.581153 0.652142
Medium messages
10 20 25 50 100
0.640369 0.845921 0.809719 1.115513 1.307663
1.637905 2.086727 2.191491 2.503145 1.916698
0.960197 1.290225 1.301323 1.656588 1.614943
Large messages
10 20 25 50 100
1.586618 2.389237 2.114001 3.102103 3.299164
5.678355 5.422692 6.160388 8.144498 5.342282
3.393421 3.444342 4.097186 4.965433 4.21083
10 20 25 50 100
3.339759 3.475809 4.211633 5.469511 5.203311
8.11057 10.19773 9.365102 11.3655 9.729922
5.561411 6.441515 7.110114 7.910134 7.023571
Very large messages
Max. RMS
Average RMS
Fig. 15. The average clustering time of K-Means, PCA combined with K-Means and fractal model of small, medium, large, and very large SOAP messages.
Fig. 16. The required processing time to generate the dataset vectors of small, medium, large, and very large SOAP messages.
ments. Network traffic would be highly reduced by aggregating large number of SOAP messages. Fractal coefficients are intro-
675
duced as efficient similarity measurements for SOAP messages utilizing the self-similarity principle of fractal mathematical model. The proposed clustering model is developed to compute fractal similarity of the proposed numeric form of SOAP messages. The experimental results have shown that the proposed technique outperformed other well-known clustering techniques such as K-Means and PCA combined with K-Means. Finally, the improvement in the performance of SOAP would immensely support the growth of Web services and also promote its adoption in low bandwidth environments operating with devices like PDAs and regular phones. In the future, we are planning to conduct simulation experiments to measure its performance and scalability in large scale inter-cloud scenarios. References [1] N. Abu-Ghazaleh, M. Lewis, Differential deserialization for optimized soap performance, in: Supercomputing, 2005, in: Proceedings of the ACM/IEEE SC 2005 Conference, Nov. 2005, p. 21. [2] D. Al-Shammary, I. Khalil, Compression-based aggregation model for medical web services, in: Engineering in Medicine and Biology Society (EMBC), 2010 Annual International Conference of the IEEE, 31 2010-Sept. 4 2010, pp. 6174–6177. [3] D. Al-Shammary, I. Khalil, Dynamic fractal clustering technique for soap web messages, in: IEEE International Conference on Services Computing (SCC), 2011, july 2011, pp. 96 –103. [4] D. Al-Shammary, I. Khalil, Redundancy-aware soap messages compression and aggregation for enhanced performance, Journal of Network and Computer Applications 35 (1) (2012) 365–381. [5] D. Andresen, D. Sexton, K. Devaram, V. Ranganath, Lye: a high-performance caching soap implementation, in International Conference on Parallel Processing, 2004. ICPP 2004, Aug. 2004, vol. 1, pp. 143–150. [6] Z. Baharav, Fractal arrays based on iterated functions system (ifs), in: Antennas and Propagation Society International Symposium, IEEE, Santa Clara, CA, USA, vol. 4, 11–16 July 1999, pp. 2686–2689. [7] A.K. Bisoi, J. Mishra, Enhancing the beauty of fractals, in: ICCIMA’99. Proceedings. Third International Conference, New Delhi, India, vol. 16, Issue 4, 23–26 Sept. 1999, pp. 454–458. [8] X. Cui, T. Potok, P. Palathingal, Document clustering using particle swarm optimization, Jun. 2005, pp. 185–191. [9] D. Davis, M. Parashar, Latency performance of soap implementations, in: 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2002, May 2002, p. 407. [10] S. Flesca, G. Manco, E. Masciari, L. Pontieri, A. Pugliese, Fast detection of xml structural similarity, IEEE Transactions on Knowledge and Data Engineering 17 (2) (2005) 160–175. [11] J.C. Hart, Fractal image compression and recurrent iterated function systems, in: Computer Graphics and Applications, vol. 16, IEEE Computer Society Press Los Alamitos, CA, USA, 1996, pp. 25–33. [12] J.H. Hwang, M.S. Gu, Clustering xml documents based on the weight of frequent structures, Nov. 2007, pp. 845–849. [13] M.N. Jain, A.K. Murty, P.J. Flynn, Data clustering: a review, ACM Computing Survey 31 (3) (1999) 264–323. [14] Z.T. Khoi, Anh Phan, P. Bertok, Similarity-based soap multicast protocol to reduce bandwidth and latency in web services, IEEE Transactions on Services Computing 1 (2) (2008) 88–103. [15] J.-S. Kim, H. Andrade, A. Sussman, Principles for designing data-/computeintensive distributed applications and middleware systems for heterogeneous environments, Journal of Parallel and Distributed Computing 67 (7) (2007) 755–771. [16] W.Z.-S. XU Liang, W. Wen-bing, Fractal linear arrays, Chinese Physics Letters (CHIN.PHYS.LETT), China 15 (2) (1998) 140–142. [17] J. Liu, J. Wang, W. Hsu, K. Herbert, XML clustering by principal component analysis, Nov. 2004, pp. 658–662. [18] S. Orrin, The soa/xml threat model and new xml/soa/web 2.0 attacks & threats, Intel Corp., http://www.defcon.org/images/defcon-15/dc15presentations/dc-15-orrin.pdf, Downloaded in June 2011. [19] R. Prodan, T. Fahringer, Zenturio: a grid middleware-based tool for experiment management of parallel and distributed applications, Journal of Parallel and Distributed Computing 64 (6) (2004) 693–707. [20] M.-C. Rosu, A-soap: Adaptive soap message processing and compression, in: Proceedings of the IEEE International Conference on Web Services. Salt Lake City, Utah, USA, 2007, pp. 200–207. [21] J. Salceda, I. Diaz, J. Tourio, R. Doallo, A middleware architecture for distributed systems management, Journal of Parallel and Distributed Computing 64 (6) (2004) 759–766. [22] K.V. Steinbach M., Karypis G., A comparison of document clustering techniques, in: TextMining Workshop, KDD, 2000. [23] P.R. Massopust, Fractal Functions, Fractal Surfaces, and Wavelets, Academic Press, Inc, San Diego, California, 1994. [24] E.C.M.L. Yu Tao, Y.Y. Tang, Extraction of fractal feature for pattern recognition, in: Pattern Recognition, 2000. Proceedings. 15th International Conference, Barcelona, Spain, vol. 2, 3–7 Sept 2000, pp. 527–530.
676
D. Al-Shammary et al. / J. Parallel Distrib. Comput. 73 (2013) 664–676
[25] C. Werner, C. Buschmann, Compressing soap messages by using differential encoding, in: Web Services, 2004, in: IEEE International Conference on Proceedings, july 2004, pp. 540–547. [26] G. Yongming, C. Dehua, L. Jiajin, Clustering XML Documents by Combining Content and Structure, Vol. 1, 2008, pp. 583–587.
Dhiah Al-Shammary received his M.Sc. (Masters) in Computer Science as the top student in his department for the year 2005 from the college of science at Al-Nahrain University, Baghdad, Iraq. In 2002, Dhiah was awarded as the top student in computer science on Iraq after his participation in the annual scientific competition exam for the top bachelor students. He is currently a Ph.D. student at the School of Computer Science and Information Technology at RMIT University, Melbourne, Australia. His research interests include performance modeling (of computer systems), Web services, compression and encoding techniques, and distributed systems. He has worked as a lecturer at a number of the Iraqi Universities in the areas of software engineering, computer systems and machine language. Dhiah has several publications in the areas of improving the performance of Web services and encoding techniques. Ibrahim Khalil is a senior lecturer in School of Computer Science & IT, RMIT University, Melbourne, Australia. Ibrahim completed his Ph.D. in 2003 from University of Berne, Switzerland. He has several years of experience in Silicon Valley based companies working on Large Network Provisioning and Management software. He also worked as an academic in several research universities. Before joining RMIT, Ibrahim worked for EPFL and University of Berne in Switzerland and Osaka University in Japan. Ibrahim’s research interests are quality of service, wireless sensor networks and remote healthcare.
Zahir Tari received the honors degree in operational research from the Universite des Sciences et de la Technologie Houari Boumediene (USTHB), Algiers, Algeria, the masters degree in operational research from the University of Grenoble I, France, and the Ph.D. degree in artificial intelligence from the University of Grenoble II, France. He is the head of the Distributed Systems and Networking Discipline, School of Computer Science and Information Technology, RMIT University, Melbourne. His current research has a special focus on the performance of Web servers and SOAP-based systems, SCADA system security, and Web services, including protocol verification and service matching. He has organized more than 12 international conferences as either a general cochair or a PC cochair. He regularly publishes in reputable journals, such as ACM and IEEE Transactions. He is a senior member of the IEEE.
Albert Y. Zomaya is currently the Chair Professor of High Performance Computing & Networking and Australian Research Council Professorial Fellow in the School of Information Technologies, The University of Sydney. He is also the Director of the Centre for Distributed and High Performance Computing which was established in late 2009. Professor Zomaya is the author/coauthor of seven books, more than 370 papers, and the editor of nine books and 11 conference proceedings. He is the Editor in Chief of the IEEE Transactions on Computers and serves as an associate editor for 19 leading journals. Professor Zomaya is the recipient of the Meritorious Service Award (in 2000) and the Golden Core Recognition (in 2006), both from the IEEE Computer Society. He is a Chartered Engineer (CEng), a Fellow of the AAAS, the IEEE, the IET (UK), and a Distinguished Engineer of the ACM.