SMCA: An efficient SOAP messages compression and aggregation technique for improving web services performance

SMCA: An efficient SOAP messages compression and aggregation technique for improving web services performance

Journal of Parallel and Distributed Computing 133 (2019) 149–158 Contents lists available at ScienceDirect J. Parallel Distrib. Comput. journal home...

2MB Sizes 1 Downloads 29 Views

Journal of Parallel and Distributed Computing 133 (2019) 149–158

Contents lists available at ScienceDirect

J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc

SMCA: An efficient SOAP messages compression and aggregation technique for improving web services performance ∗

Nassima Haroune-Belkacem a , , Fouzi Semchedine a,b , Ahmed Al-Shammari c , Djamil Aissani a a

LaMOS Research Unit, Faculty of Exact Sciences, University of Bejaia, 06000, Algeria Institute of Optics and Precision Mechanics (IOMP), University of Setif, Algeria c Department of Computer Science and Software Engineering, Faculty of Science Engineering and Technology, Swinburne University of Technology, Melbourne, Australia b

highlights • • • •

This technique aggregates and compresses SOAP messages to improve the performances of web services. We propose SMCA technique for aggregating all the messages into a single compressed format. SMCA technique allows to eliminate redundancy of tags and indexed paths between all the messages. The proposed technique outperforms a well-known existing technique in terms of the number of groups, the compression ratio and the execution time according to the datasets’ types.

article

info

Article history: Received 1 December 2017 Received in revised form 22 April 2019 Accepted 7 July 2019 Available online 12 July 2019 Keywords: Web services SOAP XML Aggregation Compression

a b s t r a c t The Simple Object Access Protocol (SOAP) is an eXtensible Markup Language (XML) based messaging protocol, which is widely used over the Internet. It supports interoperability by creating access between users and their service providers from the same or different platforms. However, the huge number and the large size of exchanged SOAP messages cause congestions and bottlenecks. Existing techniques based on grouping of XML messages have shown some shortcomings in terms of execution time and compression ratio. Therefore, in this paper, we propose a new technique called SMCA for efficiently compressing and aggregating the SOAP messages. Technically, the proposed technique requires only one passage on all the XML messages to perform aggregation and compression processes. Based on the SMCA technique, the XML data of the same paths are regrouped in one container. The experimental results on real XML dataset verify the efficiency and the effectiveness of the proposed technique. © 2019 Elsevier Inc. All rights reserved.

1. Introduction The Web service is defined as a middleware that performs two primary tasks: exchanging and storing the data over the Internet [14]. SOAP is a main communication protocol in Web services [12,22,35]. The syntax of SOAP [26] is based on eXtensible Markup Language (XML) [1,49]. The XML messages are transferred via HTTP (Hyper Text Transfer Protocol) and TCP (Transmission Control Protocol) [20,39]. In general, SOAP inherits some advantages of XML, such as the flexibility and the interoperability [15,30]; however, the redundancy problem in XML documents results in large document size [49]. The large size and the huge number of XML messages cause high latency and bottlenecks on the Internet [12,21]. For example, Fig. 1 shows ∗ Corresponding author. E-mail address: [email protected] (N. Haroune-Belkacem). https://doi.org/10.1016/j.jpdc.2019.07.001 0743-7315/© 2019 Elsevier Inc. All rights reserved.

the text message and its representation in XML format. The message size in XML format (377 bytes) is six times larger than the text format (63 bytes). Several compression and aggregation techniques have been proposed to overcome this problem [9,36,40,41,51]. Existing techniques for aggregation and compression of XML messages depend on the grouping models to determine the similarities. The objective of grouping messages is to make messages in a same group sharing a high degree of similarity while being very dissimilar to messages from other groups. Further, messages of each group are aggregated and compressed. However, these techniques are inefficient due to the high computational costs. Therefore, in this paper, we propose a new technique, called SMCA, that reduces the number and the size of messages sharing the same destination. The number of messages is reduced by aggregating all the messages having the same destination to a single message, and their size is reduced by using a compression technique. The proposed technique is not

150

N. Haroune-Belkacem, F. Semchedine, A. Al-Shammari et al. / Journal of Parallel and Distributed Computing 133 (2019) 149–158

Fig. 1. Example of a message in text format (a) and its representation in XML format (b).

based on a grouping model and similarities are detected in the analysis phase. Our technique requires only one passage on all the messages and by using aggregation, it gives only one aggregated format. This technique saves all the names of tags and attributes (respectively, the data of the same paths) of all the XML messages in a same dictionary (respectively, a container). The rest of the paper is organized as follows: Section 2 overviews some previous works and their classification. Section 3 presents our proposed technique (SMCA) and explains the process of analyzing each XML message and distributing each token to its container, as well as the compression process. This section also defines the decompressing process of the compressed format to return to the original messages’ format. Section 4 discusses the performances of our technique compared to some well-known existing works and we conclude the work in Section 5. 2. Related work This section summarizes the related works on compression and clustering of XML messages. The traditional techniques are broadly classified into two main groups as follows: (1) XML compression techniques, (2) XML clustering and aggregation techniques. The details of these techniques are given below. 2.1. XML compression The XML messages can be compressed using some text compressors such as: gzip [25], PPM [18] and bzip [45]. However, they do not give better performances than the proposed specialized XML compressors [32]. Existing XML compression techniques are classified into two main classes: non-queriable and queriable. Non-queriable compression techniques regroups those that support the compression of a single message [4,16,29,31,46,48] and those that support the compression of several messages [7,8]. While, queriable XML compression techniques [13,17,23, 36,50] support only compression of a single message and allow query execution on the compressed format. Table 1 shows some XML messages compression techniques and their classifications. XMILL [32] is the first technique proposed for XML messages compression. It separates the structure from the content data, which are stored in different containers. The container is compressed with gzip or bzip text compressors. Bzip gives a better compression ratio compared to gzip, but it is the slowest one. This technique gives a better compression ratio compared to other techniques, but it is not queriable, and the accumulated compressed size of the XML messages is large [8]. Although XMILL is a one pass technique, it does not consider aggregation. XGRIND [47] is the first queriable compression technique of XML documents. It maintains the structure of the XML message in a

compressed format. Tags and contents are compressed, respectively, by dictionary-based encoding and semi-adaptive Huffman coding [27]. XGRIND technique is time consuming because it requires two passages on the XML message to perform the compression process. Authors of [43] modified their previous work RFX [42,44] in order to allow execution of all types of queries. They used containers instead of the data layer. The loader analyzes the XML message and saves the elements and the attributes in sequential order. The data of the same ID elements or attributes are inserted in the same container, and each one is compressed separately with a text compressor. The compressed message size of this technique is large compared to other techniques. 2.2. XML clustering and aggregation There is a need for advanced solutions to achieve a high compression ratio. The authors in [8] have verified that XML grouping and aggregation compression techniques achieve significantly higher compression ratios for similar XML messages than compressing them separately. To the best of our knowledge, the proposed XML compression and aggregation techniques are based on the clustering concept [3,9]. Technically, XML grouping and aggregation techniques apply two passages on all XML messages to do the grouping and compression processes. In the first passage, redundancies are located to determine the messages of each group. Then, a second passage is applied to all messages in the groups to aggregate and compress them separately by using a text compressor. Consequently, the duration of grouping and aggregation is relatively long because the calculation of frequencies and similarities in messages is a complex problem [5,11,28]. These techniques have three disadvantages: (1) the compression ratio, (2) the duration of grouping, and (3) the number of aggregated formats. The similar messages are grouped into different groups where the same tags are stored several times. Then, this leads to an increase in the size of the compressed format. For example, in [3] the messages 12, 19, 27, 32 and 36 have the same root and many tags are repeated, but they belong to five different groups. Besides, the grouping and aggregation processes require a long execution time. Moreover, the number of aggregated formats is equal to the number of groups depending on the degree of similarities. If there are no similarities between messages, the number of aggregated messages is equal to the number of messages that share the same destination path. Then, it causes a delay in grouping and the size of the compressed formats increases too. Existing techniques for aggregation and compression of XML messages use grouping models to determine similarities. The basic idea of the clustering is to assign the XML messages to a specific group, which is sharing a high degree of similarity while being very dissimilar to messages from other

N. Haroune-Belkacem, F. Semchedine, A. Al-Shammari et al. / Journal of Parallel and Distributed Computing 133 (2019) 149–158

151

Table 1 Classification and comparison of some XML aggregation and compression techniques. Xmill XGrind QRFXFreeze [24] [45] [10] [37] Dynamic grouping Our technique (SMCA)

Time

Passage

Queriable

Number of messages

Number of compressed

Based on grouping

Fast Long Fast Long Long Long Long Long Fast

one two one two two two two two One

No Yes Yes No No No No No No

01 01 01 many many many many many many

01 01 01 many many many many many 01

No No No Yes Yes Yes Yes Yes No

groups. Moreover, the messages of each group are aggregated and compressed separately. Authors of [24] proposed an aggregation technique for the XML messages. A labeled tree represents these messages and transformed into linear vectors. According to the values of these vectors, the messages are distributed to different groups. The Discrete Fourier Transform (DFT) technique is used to compare the frequencies of the encoded XML documents. The DFT technique is efficient compared to tree-edit grouping technique. However, this technique requires two passages on all the XML documents; relatively, the grouping time is long. In [3], the authors proposed an aggregation and a compression technique that require two passages on all the messages. They used a dynamic model for grouping XML messages. In the first passage, all the messages are traversed with the Breadth-First technique. They transform each message into a linear vector where each vector contains redundancy frequencies of each tag and content. Then, Euclidean distance [19] is applied to the matrix of these vectors to determine which messages are going to be in each group. Finally, the Huffman encoding technique is used to reduce the size of XML messages in each group. The number of compressed formats is equal to the number of groups. Authors of [10] proposed a dynamic grouping method for aggregating and compressing XML messages. In the first passage, an XML tree is created for each XML message at the aim of removing any end tags. Then, Frequency–Inverse Document Frequency (TF–IDF) scheme is applied to determine the frequencies of each element (tags and contents) in the same message and all other messages. As an output, a two-dimensional matrix is constructed. Further, Euclidean distance measurement is applied to separate the messages dynamically into groups. In the second passage, the Huffman algorithm is applied to all the groups. Then, each group is aggregated and compressed in one output. The number of outputs is equal to the number of groups. In [37], authors deal with Web Services that use SOAP protocol for client– server communication. They indicate that when the number of messages increases, Web Services load increases and creates bottlenecks. The authors present a technique for dynamic grouping of structural similarity between XML messages based on the Graph Matching algorithms [33]. The proposal roughly consists of linearizing the structure of each XML message by representing it as a numerical sequence and then, comparing such sequences analysis to their frequencies. Moreover, theory of discrete Fourier Transform [24] is used to compare encoded documents. This model supports Huffman compression based on an aggregation tool. Two grouping models proposed by [3,37] show a reduction of XML Web messages size of 30% in comparison with the vector space model [7] and the dynamic fractal model [8]. A new version of dynamic grouping XML messages based on compression and an aggregation model [3] is proposed in this paper. This model includes two main components: (1) grouping, (2) compression and aggregation. First, grouping process, which is based on [6], generates for an XML message in the dataset a rooted tree. XML tree presents two types of nodes such as structure and content nodes. Structure and content nodes refer to the elements and the data values of elements respectively.

The depth-first search algorithm is used to traverse and index the XML message nodes. Then, it generates a vector for each message, which is an association of the structure and the content vectors respectively. A measure scheme is used to calculate the weight of each term in the vector. To distribute messages into groups, they start with sorting pairwise distance between every two messages. The pair with minimum distance is the first checked whether it is less than a threshold. Furthermore, the Euclidean pairwise distance between centroids of every two groups is computed and stored in increasing order. On the other hand, XML messages of each group are aggregated and compressed with the Huffman encoding technique. All aggregation and compression techniques proposed in the literature require to perform two passages on all the messages sharing the same path of destination. These passages are used to group, aggregate and compress messages. Then, this causes a very long execution time, and grouping process does not remove all redundancies between messages. However, the existing techniques have shown serious limitations. Therefore, the main objective of this paper is to propose a new technique for efficiently aggregating and compressing a set of XML documents using only one passage technique. 3. The proposed solution In this section, we present our proposed technique called SMCA for the XML messages aggregation. Technically, SMCA technique solves the network traffic problem in the Web services. It reduces the number and the size of the XML messages in a short time. We processed all the messages that share the same destination as a single message. The messages share the tag dictionary, path dictionary, structure container, and data containers. All these dictionaries and containers are compressed with a text compressor. Also, the compressed formats are regrouped in a single aggregated and compressed message. 3.1. Aggregation and compression architecture Fig. 2 shows the different steps for aggregating and compressing the XML messages. First, several XML messages (M1 , M2 , M3 , etc.) are passed by the organizer that will pass each message Mi /i = 1, . . . , n in a parser. The SAX (Simple API for XML) [34] parser transforms the XML document to a set of tokens that are sent to the processor. The tokens can be the start of a document, the start of a tag, the name of an attribute, the content, the end of a tag, the end of a document, and so on. Then, the processor defines, according to the type of each token, its adequate processing and its containers’ destination. There are four types of containers: tags and attributes container, path container, structure container, and content containers. The start tags container includes tags dictionary. The path container includes the dictionary of the indexed paths which are in all the XML documents and are saved only once. In the structure container, we save the structure of all the XML messages to facilitate the decompression process. The

152

N. Haroune-Belkacem, F. Semchedine, A. Al-Shammari et al. / Journal of Parallel and Distributed Computing 133 (2019) 149–158

Fig. 2. Architecture of SMCA technique for the aggregation and the compression of XML messages.

content containers contain all the XML messages data. Further, all these containers are compressed with a text compressor. Finally, the compressed formats are accumulated and saved in a single aggregated and compressed message. In the following, we use the example of Fig. 1 to better explain the process of each type of token.

• Token type is a start tag: the tag name is added to the tag dictionary. Tag names are saved only once in this dictionary. The code of each start tag is its position in the dictionary and it is used for indexing the tag names in the path dictionary. If the tag contains attributes, the name of attribute is saved in the tag dictionary. Attribute names are also saved only once. The code of the attribute name matches its position in the tags’ dictionary. The tag dictionary corresponding to the example of Fig. 1 is: departure, departureDate, EmiratesDeparture, Airport, QantasDeparture, EtihadDeparture and arriving, and their codes, respectively, are 0, 1, 2, 3, 4, 5 and 6. The algorithm corresponding to the start tag token is shown in Algorithm 1. • Token type is a content: the content concerns attributes and characters values of each path. In this case, the path corresponding to this content is added to the path dictionary. Tag names for each path are indexed with the tag codes. For example, paths of departing/departureDate, departing/EmiratesDeparture, departing/Airport, departing/ QantasDeparture, departing/EtihadDeparture and departing/ arriving are indexed, respectively, by 0/1, 0/2, 0/3, 0/4, 0/5 and 0/6. These codes correspond to their positions in the dictionary of tags. Each path is added only once to the path dictionary. The code of each path corresponds to its position in the path dictionary. Each time, a path is added to the dictionary, a new content container is also created. The code of each path corresponds to its position in the path list. These codes are used in the structure container and they also help us to determine the container number of each path. Each time, the content is inserted into a container, the path code is inserted into the structure container. This code is followed by a @, /, or C, respectively, if the content is an attribute, followed by an end tag, or followed by a start tag. The container of structure allows to save the original structure of the XML documents and it facilitates the decompression process. The algorithm corresponding to the content token is shown in Algorithm 2. • Token type is an end tag: all the end tags are not saved in the aggregated format and they are replaced only by the character/in the structure container.

Algorithm 1 Start tag token Begin If tag name exists in tag dictionary then Add position (pos) of tag in dictionary to path: path = path+ "/"+pos , else Add tag name to tag dictionary, Add position (pos) of tag to path: Path = path + "/" + pos, end if while there are attributes do If name of attribute exists in tag dictionary then Search path position (pos) in dictionary, else Add name of attribute to tag dictionary, Search path position (pos) in dictionary; end if If path exists in path dictionary then Add contents of attribute to container number (pos), else Add path (path + "/" + pos) to path dictionary, Create a new container, pos = size (path dictionary), Add contents of attribute to container number (pos), end if end while Add path pos and @ character in structure container, End

Algorithm 2 Content token If (path) exists in path dictionary then Search path position (pos) in path dictionary, Add content to container number (pos), else Add path (path + "/" + pos) to path dictionary, Create a new container, pos = size (path dictionary), Add content to container number (pos), end if If type of token followed this content is start tag then Add character C to structure container, end if

3.2. Decompression process The decompression technique allows returning to the original formats of all the XML messages. First, all the dictionaries (tags and indexed paths dictionaries) and containers (content and structure containers) are decompressed with the text compressor used for the compression. Then, we processed the structure container and according to the returned code (/, C, @, etc.), the main processor defines the token type (end of tag, content followed by a start tag, attribute, etc.) to be constructed for the XML message.

N. Haroune-Belkacem, F. Semchedine, A. Al-Shammari et al. / Journal of Parallel and Distributed Computing 133 (2019) 149–158

153

3.3. Complexity analysis In this section, we calculate the temporal complexity of the proposed technique. We consider three notations to compute the complexity of SMCA technique as follows: (1) number of XML messages (n), (2) number of characters (m), (3) the complexity of bzip (k). Where n is not fixed and (k) is the number of existing characters to be compressed for all XML messages. Basically, the complexity of bzip is less than O(k), however, in the worst case scenario is less than O(k)2 . The modification of bzip compressor can significantly reduce the high computational cost. Therefore, the complexity of SMCA technique is O(nm + k). 4. Simulations and discussions

Fig. 3. Compression and decompression size of the SMCA technique according to the number and the size of subsets Si,i=1...16 .

In this section, we present various experiments conducted to evaluate the performance of the proposed technique. 4.1. XML datasets To validate the efficiency of our technique, we used real XML messages. The sizes of XML messages vary between small messages, medium messages, large messages and very large messages, respectively from 140 to 800 bytes, from 800 to 3000 bytes, from 3000 to 20 000 bytes and from 20 000 to 55 000 bytes. These messages are produced using the Web Service Description Language (WSDL) [2]. The dataset of these messages is the same as that used in [3]. These messages are divided into 4 subsets (small, medium, large and very large) and each one contains 40 messages. Our technique is implemented using Visual Basic on an Intel(R) Xeon(R) CPU E5-1630 v3@ 3.70 GHz 16 GB RAM. Regarding the compression, we used a bzip text compressor. Based on the used subsets of the XML messages, we have created 16 subsets S1 , S2 , . . . , S16 . Subsets S1 , S2 , S3 , and S4 contain, respectively, the first 10, 20, 30, and 40 messages of the small subset. Subsets S5 , S6 , S7 , and S8 contain 40 messages of the small subset and adding respectively the first 10, 20, 30, and 40 messages of the medium subset. Subsets S9 , S10 , S11 , and S12 contain 40 messages of the small subset, plus 40 messages of the medium subset and adding respectively the first 10, 20, 30, and 40 messages of the large subset. Finally, subsets S13 , S14 , S15 , and S16 contain 40 messages of the small subset plus 40 messages of the medium subset plus 40 messages of large subset and adding respectively the first 10, 20, 30, and 40 messages of the very large subset. Each subset Si contains i × 10 XML documents; for example, the subset S16 contains 160 messages. The total size of these subsets ranges from 4672 bytes to 2072 247 bytes as presented in Table 2. 4.2. Methodology evaluation The bests of the compression ratio, the compression time and the decompression time are three measures which proved the effectiveness of our technique. The compression ratio is a gain percentage obtained in terms of compressed messages size, in order to avoid congestion in Web services. This metric is considered one of most important measures when evaluating the performances. It is calculated according to Eq. (1). The compression time (in milliseconds) is the total time to aggregate and to compress all the messages. The decompression time is the time required to return to the original formats from the aggregated and the compressed formats of all the messages. Compression ratio = 100 × (1 −

compressed size ) original size

(1)

Fig. 4. Compression ratio of the SMCA technique according to the number and the size of subsets Si,i=1...16 .

4.2.1. Compression ratio Fig. 3 illustrates the original messages’ size and its aggregated and compressed formats using the proposed technique for the different subsets Si,i=1...16 . The figure shows that the total message size is directly proportional to the messages’ number in each subset. It depicts that the total message size increases slightly for the subsets S1 , S2 , . . . S8 . However, it increases considerably for the rest of the subsets. After aggregating each subset Si,i=1...16 , the compressed size of each subset is very small compared to the total size. In fact, this is guaranteed by the efficiency of the SMCA technique that deletes the redundancy of the structure and the data in the compressed format. The presented values in Fig. 4 indicate that the compression ratio of SMCA technique is better when the number of messages increases in subsets Si,i=1...16 . Whenever, the number and the size of messages increase in each subset, the gain percentage is relatively better. This is because the SMCA technique seeks to eliminate the structure and the data redundancies in all the XML messages that share the same destination path. These messages can be of any size and can contain any structure and content. 4.2.2. Compression and decompression time Fig. 5 shows the compression and the decompression time in milliseconds of the SMCA technique according to the subsets Si,i=1...16 . These subsets contain different types of XML messages such as: small, medium, large and very large. The number of messages in each subset varies from 10 to 160 messages. The

154

N. Haroune-Belkacem, F. Semchedine, A. Al-Shammari et al. / Journal of Parallel and Distributed Computing 133 (2019) 149–158

Table 2 Number of messages, total size, compressed size, compression ratio, compression time and decompression time of SMCA technique according to subsets Si, i = 1..16. Datasets

Number of XML messages

Total size

Compressed size

Compression Ratio (%)

Compression time (ms)

Decompression time (ms)

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160

4672 8566 12 146 16 474 33 562 50 715 71 527 90 784 211 794 336 197 456 941 555 799 939 149 1339 833 1717 315 2072 247

1148 1572 1946 2377 3575 4700 5898 6880 12 432 17 839 22 954 27 037 42 773 59 316 75 180 89 698

75,47 81,67 83,99 85,58 89,35 90,74 91,76 92,42 94,13 94,69 94,98 95,14 95,45 95,57 95,62 95,67

13 15 20 27 45 62 76 78 91 138 187 219 311 405 500 655

13 14 15 16 17 30 33 45 76 93 110 123 157 188 235 300

Fig. 5. Compression and decompression time of SMCA technique according to the number and the size of subsets Si,i=1...16 in milliseconds.

Fig. 6. Average compression ratio of different techniques according to the size of XML messages.

aggregation and the compression time increases slightly when the number and the size of the XML messages increase. This proves the efficiency of SMCA in terms of compression time. The decompression time of the SMCA technique increases with the increasing of the number and the size of messages in the subsets Si,i=1...16 . The decompression time is smaller than that of the compression time, since the compression process requires checking the dictionaries of tags and paths.

in terms of compression ratio of these techniques are recovered from [3]. They used the same datasets for simulations. In this part, the performances of the proposed SMCA technique are evaluated in terms of compression ratio, execution time and the number of groups. On the other hand, we were able to recover the source code of a new technique called Dynamic grouping technique implemented in Visual Basic and we have implemented the proposal using the same language to make a fair comparison. Dynamic clustering technique uses grouping model proposed by Al-Shammari et al. [6] and Huffman compression technique.

4.3. Comparative analysis Base on our research in the literature and to our knowledge, several grouping techniques of XML documents were proposed. These techniques are used in many domains, such as: data mining, data analysis, pattern recognition, and image segmentation [38]. However, new proposed techniques for aggregation and compression of XML documents are the work of Nikhil et al. [37] and Abbas et al. [3]. These techniques provide the same performances compared to other techniques [7,8] in terms of compression ratio and grouping time constraints [3,37]. Unfortunately, authors of [37] did not give details on the used XML messages and the exact values of their experiments. In this section, we compare our technique with three works. The first technique [3] uses dynamic TF–IDF grouping model and Huffman compression based aggregation tool. The second technique [8] uses Vector Space grouping model and Huffman compression based aggregation tool. The third technique [7] uses fractal dynamic grouping model and Huffman compression based aggregation tool. Results

4.3.1. Compression ratio Table 3 shows the total size, the compressed size and the average compression ratio of the different techniques according to the subsets: small, medium, large and very large. The total size is the sum of the XML messages size in each subset. Each subset contains 40 messages. The compressed size is the size of XML messages after aggregating and compressing all the messages in each subset. The average compression ratio is calculated by the following equation (Eq. (2)): Av erage compression ratio =

Size of original message Size of compressed message

(2)

Fig. 6 shows the average compression ratio of the SMCA technique in comparison with the implemented techniques by considering the different datasets sizes, namely: small, medium, large and very large. The highest values in the figure give a better result. Then, we show that the proposed SMCA technique clearly outperforms all the implemented techniques: dynamic grouping,

N. Haroune-Belkacem, F. Semchedine, A. Al-Shammari et al. / Journal of Parallel and Distributed Computing 133 (2019) 149–158

155

Fig. 7. Average compression ratio of SMCA and the modified dynamic grouping technique according to the number and the size of datasets.

Fig. 8. Execution time of SMCA and the modified dynamic clustering technique by varying the number and the size of datasets.

dynamic TF–IDF, dynamic fractal and VSM techniques for all the datasets. The implemented techniques suffer from the suppression process of redundancies between all the XML messages. However, our technique removes redundancy in XML messages by using the same tags’ dictionary and paths’ dictionary for all the messages. By setting the number of messages from 10 to 40 in the four datasets, we obtain the results summarized in Fig. 7 which

compare the average compression ratio of SMCA and the modified dynamic grouping technique according to messages’ number. From the figure, we easily remark that the compression ratio of each technique increases progressively with the messages’ number. The average compression ratio of SMCA technique is considerably longer compared to the modified dynamic grouping technique. This is due to the fact that SMCA technique removes redundancy of the structure and the data of all the messages.

156

N. Haroune-Belkacem, F. Semchedine, A. Al-Shammari et al. / Journal of Parallel and Distributed Computing 133 (2019) 149–158

Fig. 9. The number of generated groups using SMCA and the modified dynamic clustering technique by varying the number and the size of datasets.

Table 3 Compressed size of the different techniques according to subsets: small, medium, large, and very large. Datasets

SMCA

Dynamic grouping

Dynamic TF–IDF [3]

Dynamic fractal [7]

VSM [8]

Small Medium Large Very large

2375 5701 22 512 64 911

5601 8862 26 786 76 239

5601 8862 26 786 76 241

4197 9305 27 955 69 882

5314 10 041 29 618 83 735

4.3.2. Execution time By setting the number of messages in the four datasets, we obtain the results summarized in Fig. 8 which compare the execution time (in millisecond) of SMCA and the modified dynamic grouping technique according to the messages’ number. From the figure, we easily remark that the execution time of each technique increases progressively with the increase of the messages’ number. The execution time of SMCA technique is considerably lower compared to the modified dynamic grouping technique, mostly for the large and the very large sizes. In fact, for the dynamic grouping technique, when the messages’ size is large, new structures and data values appeared and consequently, the grouping time increases considerably. We estimate that for the Large (respectively Very large) datasets, the SMCA technique is 12 (respectively 45) times better than the modified dynamic clustering technique. 4.3.3. Number of compressed format By setting the number of messages in the experimented datasets, we obtain the results summarized in Fig. 9 which compare the groups’ number of SMCA and the modified dynamic grouping technique according to the messages’ number. From the figure, it is easy to observe that the SMCA technique outperforms the modified dynamic grouping technique on all the subsets of each dataset. The obtained groups’ number for the dynamic grouping technique shows a dynamic grouping. The groups’ number depends on the frequencies lead assigning the XML messages to a less number of groups.

5. Conclusion The compression techniques of XML messages using aggregation models are used to remove the data and the structure redundancies between multiple XML messages sharing the same destination. However, all these techniques are based on grouping and have shortcomings in terms of time and aggregation ratio when trying to decrease the bottlenecks and the congestion. In this paper, we have proposed a new technique called SMCA for compressing and aggregating the XML messages to improve the performances of Web services. The technique tries to aggregate all the messages into a single compressed format. This technique allowed to eliminate redundancy of tags and indexed paths between all the messages. Our technique is the first one that used one passage to aggregate and compress all the messages into one output. Regarding the compression ratio, the proposed technique (SMCA) allowed an improvement of about 42.40%, 56.77%, 76% and 77.51%, respectively, for small, medium, large and very large datasets. On the other hand, our proposed technique outperformed a well known existing technique in terms of the number of groups, the compression ratio and the execution time according to the datasets’ types. As a more general extension of this paper, we intend to apply semantic compressors on different containers such as: gzip, PPM, XZ, etc. Therefore, we will test our technique on a dataset containing more than 160 XML messages and give an effective technique to minimize the network congestion. Moreover, we will improve our technique by parallelizing our algorithm in order to speed up the aggregation and the compression processes. Declaration of competing interest No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.jpdc.2019.07.001.

N. Haroune-Belkacem, F. Semchedine, A. Al-Shammari et al. / Journal of Parallel and Distributed Computing 133 (2019) 149–158

References [1] XML, 2017, https://www.w3.org/XML/. [2] w3c, Web service description language (WSDL), 2017, Available from: http://www.w3.org. [3] A.M. Abbas, A.A. Bakar, M.Z. Ahmad, Fast dynamic clustering SOAP messages based compression and aggregation model for enhanced performance of web services, J. Netw. Comput. Appl. 41 (2014) 80–88. [4] J. Adiego, P. de la Puente, G. Navarro, Merging prediction by partial matching with structural contexts model, in: Data Compression Conference (DCC), Proceedings, IEEE, 2004, p. 522. [5] C.C. Aggarwal, N. Ta, J. Wang, J. Feng, M. Zaki, Xproj: a framework for projected structural clustering of xml documents, in: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2007, pp. 46–55. [6] A. Al-Shammari, C. Liu, M. Naseriparsa, B.Q. Vo, T. Anwar, R. Zhou, A framework for clustering and dynamic maintenance of XML documents, in: International Conference on Advanced Data Mining and Applications, Springer, 2017, pp. 399–412. [7] D. Al-Shammary, I. Khalil, Dynamic fractal clustering technique for soap web messages, in: IEEE International Conference on Services Computing (SCC), 2011, pp. 96–103. [8] D. Al-Shammary, I. Khalil, Redundancy-aware SOAP messages compression and aggregation for enhanced performance, J. Netw. Comput. Appl. 35 (1) (2012) 365–381. [9] D. Al-Shammary, I. Khalil, Z. Tari, A distributed aggregation and fast fractal clustering approach for soap traffic, J. Netw. Comput. Appl. 41 (2014) 1–14. [10] D. Al-Shammary, I. Khalil, Z. Tari, A.Y. Zomaya, Fractal self-similarity measurements based clustering technique for SOAP web messages, J. Parallel Distrib. Comput. 73 (5) (2013) 664–676. [11] A. Algergawy, M. Mesiti, R. Nayak, G. Saake, XML data clustering: An overview, ACM Comput. Surv. 43 (4) (2011) 25. [12] D. Andresen, D. Sexton, K. Devaram, V.P. Ranganath, LYE: a highperformance caching SOAP implementation, in: International Conference on Parallel Processing,ICPP, IEEE, 2004, pp. 143–150. [13] A. Arion, A. Bonifati, G. Costa, S. D’Aguanno, I. Manolescu, A. Pugliese, XQueC: Pushing queries to compressed XML data, in: Proceedings of the 29th International Conference on Very Large Data Bases-Volume 29, VLDB Endowment, 2003, pp. 1065–1068. [14] D. Booth, H. Haas, F. McCabe, E. Newcomer, M. Champion, C. Ferris, D. Orchard, Web services architecture, 2004, http://www.w3.org/TR/ws-arch/. [15] D.A. Chappell, T. Jewell, Java web services, Tecniche Nuove, 2002. [16] J. Cheney, XMLPPM: XML-conscious PPM compression, 2011, See http: //www.cs.cornell.edu/People/jcheney/xmlppm/xmlppm.html. [17] J. Cheng, W. Ng, XQzip: Querying compressed XML using structural indexing, in: International Conference on Extending Database Technology, Springer, 2004, pp. 219–236. [18] J. Cleary, I. Witten, Data compression using adaptive coding and partial string matching, IEEE Trans. Commun. 32 (4) (1984) 396–402. [19] P.-E. Danielsson, Euclidean distance mapping, Comput. Graphics Image Process. 14 (3) (1980) 227–248. [20] D. Davis, M.P. Parashar, Latency performance of SOAP implementations, in: 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, IEEE, 2002, p. 407. [21] K. Devaram, D. Andresen, SOAP optimization via parameterized clientside caching, Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS), 2003, pp. 785–790. [22] M.D. Dikaiakos, D. Katsaros, P. Mehra, G. Pallis, A. Vakali, Cloud computing: Distributed internet computing for IT and scientific research, IEEE Internet Comput. 13 (2009). [23] P. Ferragina, F. Luccio, G. Manzini, S. Muthukrishnan, Compressing and searching XML data via two zips, in: Proceedings of the 15th International Conference on World Wide Web, ACM, 2006, pp. 751–760. [24] S. Flesca, G. Manco, E. Masciari, L. Pontieri, A. Pugliese, Fast detection of XML structural similarity, IEEE Trans. Knowl. Data Eng. 17 (2) (2005) 160–175. [25] J. loup Gailly, M. Adler, The gzip home page, 2003, Available from: http: //www.gzip.org. [26] M. Gudgin, SOAP Version 1.2 part 1: Messaging framework (second edition) w3c recommendation, 2017, http://www.w3.org/. [27] D.A. Huffman, A method for the construction of minimum-redundancy codes, Proc. IRE 40 (9) (1952) 1098–1101. [28] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Comput. Surv. 31 (3) (1999) 264–323. [29] C. League, K. Eng, Schema-based compression of XML data with relax NG, J. Comput. Phys. 2 (10) (2007) 9–17. [30] W. Lee, K. Lee, S. Lee, Intermediary based architecture for mobile web services, The 8th International Conference on Advanced Communication Technology, ICACT, Vol. 3, IEEE, 2006, p. 6.

157

[31] G. Leighton, J. Diamond, T. Muldner, AXECHOP: a grammar-based compressor for XML, in: Data Compression Conference, Proceedings, IEEE, 2005, p. 467. [32] H. Liefke, D. Suciu, Xmill: an efficient compressor for XML data, in: ACM Sigmod Record, Vol. 29, ACM, 2000, pp. 153–164. [33] G. Lin, Y. Bie, G. Wang, M. Lei, A novel clustering algorithm based on graph matching, JSW 8 (4) (2013) 1035–1041. [34] D. Megginson, Simple API for XML (SAX), 1998, http://www.saxproject.org/. [35] M. Nakagawa, K. Nozaki, S. Shimojo, Web-based distributed simulation and data management services for medical applications, in: 19th IEEE International Symposium on Computer-Based Medical Systems, CBMS, IEEE, 2006, pp. 125–130. [36] W. Ng, W.-Y. Lam, P.T. Wood, M. Levene, XCQ: A queriable XML compression system, Knowl. Inf. Syst. 10 (4) (2006) 421–452. [37] R. Nikhil, C. Sankarram, Dynamic clustering SOAP messages based on compression and aggregation model to improve web services performance, Int. J. Emerging Technol. Eng. Res. 5 (2017) 37–42. [38] R. Omidvar, A. Eskandari, N. Heydari, F. Hemmat, M. Feyli, An improved SSPCO optimization algorithm for solve of the clustering problem, J. Adv. Comput. Res. 9 (1) (2018) 1–16. [39] M.-C. Rosu, A-soap: Adaptive soap message processing and compression, in: IEEE International Conference on Web Services, ICWS, IEEE, 2007, pp. 200–207. [40] S. Sakr, XML compression techniques: A survey and comparison, J. Comput. System Sci. 75 (5) (2009) 303–322. [41] J. Salceda, I. Díaz, J. Touriño, R. Doallo, A middleware architecture for distributed systems management, J. Parallel Distrib. Comput. 64 (6) (2004) 759–766. [42] R. Senthikumar, S.P. Varshinee, S. Manipriya, M. Gowrishankar, A. Kannan, Query optimization of RFX compact storage using strategy list, in: 16th International Conference on Advanced Computing and Communications, ADCOM, IEEE, 2008, pp. 427–432. [43] R. Senthilkumar, G. Nandagopal, D. Ronald, QRFXFreeze: Queryable compressor for RFX, Sci. World J. (2015). [44] R. Senthilkumar, P. Varshinee, A. Kannan, Designing and querying a compact redundancy free XML storage, Open Inf. Syst. J. 3 (1) (2009) 98–107. [45] J. Seward, The bzip2 and libbzip2 official homepage, 2000, http:// sourceware.cygnus.com/bzip2/. [46] P. Skibinski, S. Grabowski, J. Swacha, Fast transform for effective XML compression, in: 9th International Conference-the Experience of Designing and Applications of CAD Systems in Microelectronics. CADSM’07, IEEE, 2007, pp. 323–326. [47] P.M. Tolani, J.R. Haritsa, XGRIND: A query-friendly XML compressor, in: 18th International Conference on Data Engineering, Proceedings, IEEE, 2002, pp. 225–234. [48] V. Toman, Compression of XML Data, Charles University, Prague, 2003. [49] C. Werner, C. Buschmann, Compressing SOAP messages by using differential encoding, in: IEEE International Conference on Web Services, IEEE, 2004, pp. 540–547. [50] R.K. Wong, F. Lam, W.M. Shui, Querying and maintaining a compact XML storage, in: Proceedings of the 16th International Conference on World Wide Web, ACM, 2007, pp. 1073–1082. [51] G. Yongming, C. Dehua, L. Jiajin, Clustering XML documents by combining content and structure, International Symposium on Information Science and Engineering, Vol. 1, IEEE, 2008, pp. 583–587.

Nassima Belkacem is a Ph.D. student at the University of Bejaia, faculty of exact sciences, Algeria. Member at the LaMOS Research Unit (Modeling and Optimization of Systems). Her research interests include XML compression, querying the compressed format, aggregation of SOAP messages, clustering XML documents and image compression.

Fouzi Semchedine is currently a Professor at the University of Sétif, Algeria. He received his Ph.D. in 2011 in computer science from the University of Béjaïa, (Algeria). He works in the area of routing, security and quality of service aspects in wireless sensor networks, vehicular networks and hybrid sensor and vehicular networks. He interests also to the localization problem and the compression of data in wireless networks. Actually, he is a head of the research project ‘‘Security and Reliability of Wireless Networks (Ad hoc and Sensors)’’, code: B*00620130027.

158

N. Haroune-Belkacem, F. Semchedine, A. Al-Shammari et al. / Journal of Parallel and Distributed Computing 133 (2019) 149–158

Ahmed Al-Shammari received the B.S. degree in computer science from the University of Al-Qadisiyah, Iraq in 2010, and M.S. degree with an excellent grade in information technology from the National University of Malaysia in 2013. He is currently a Ph.D. candidate and a member of Web and Data Engineering (WDE) research group at the Swinburne University of Technology, Australia from 2015 to 2019. He is a member of Web and Data Engineering (WDE) research group. His research interests include Web data mining, machine learning, database, health informatics, and service computing. He serves as a senior reviewer of Journal of Network and Computer Applications (JNCA)-Elsevier. He is also a member of Australian Computer Society (ACS) and IEEE.

Djamil Aïssani was born in 1956 in Biarritz (Basque Country, France). He started his career at the University of Constantine (Algeria) in 1978. He received his Ph.D. in November 1983 from Azerbaidjan State University (Bakou) and Kiev State University (Soviet Union). He is at the University of Bejaia since its opening in 1983/1984. Director of Research, Head of the Faculty of Science and Engineering Science (1999–2000), Director of the research unit LaMOS (Modeling and Optimization of Systems), Scientific Head of the Computer Science Doctorate School ReSyD, he has taught in many universities (USTHB Algiers, Annaba, Rouen, Dijon, ENITA, INPS Ben Aknoun, Boumerdes, Tizi Ouzou, Sétif, EHESS Paris, . . . ). He has supervised more than 20 Ph.D. Theses. He has published many papers on Markov chains, Queuing systems, Reliability theory, Performance evaluation and their applications in such industrial areas as Electrical, Telecommunication networks and Computer systems.