Steganalysis against substitution-based linguistic steganography based on context clusters

Steganalysis against substitution-based linguistic steganography based on context clusters

Computers and Electrical Engineering 37 (2011) 1071–1081 Contents lists available at ScienceDirect Computers and Electrical Engineering journal home...

745KB Sizes 0 Downloads 77 Views

Computers and Electrical Engineering 37 (2011) 1071–1081

Contents lists available at ScienceDirect

Computers and Electrical Engineering journal homepage: www.elsevier.com/locate/compeleceng

Steganalysis against substitution-based linguistic steganography based on context clusters q Zhili Chen a,b,⇑, Liusheng Huang a,b, Haibo Miao a,b, Wei Yang a,b, Peng Meng a a b

NHPCC, Depart. of CS. & Tech., University of Science and Technology of China, Hefei 230027, China Suzhou Institute for Advanced Study, USTC, Suzhou 215123, China

a r t i c l e

i n f o

Article history: Received 30 January 2011 Received in revised form 30 June 2011 Accepted 6 July 2011 Available online 4 August 2011

a b s t r a c t Linguistic steganalysis has been an increasing interest stimulated by the emerging research area of linguistic steganography during the past few years. However, due to limitations of computer natural language processing capability, linguistic steganalysis is a challenging task. Existing steganalysis methods are inefficient to analyze most substitution-based linguistic steganography methods which preserve the syntactic and semantic correctness of cover texts. This paper provides a new steganalysis scheme against substitution-based linguistic steganography based on context clusters. In this scheme, we introduce context clusters to estimate the context fitness and show how to use the statistics of context fitness values to distinguish between normal texts and stego texts. Finally, under this scheme, we present the steganalysis method for synonym substitution-based linguistic steganography. Our experimental results show that the proposed steganalysis method can analyze synonym substitution-based linguistic steganography efficiently and the steganalysis accuracy reaches as high as 98.86%. Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction As an effective way to hide information into natural language texts, Substitution-based Linguistic Steganography (SLS) embeds information by substituting certain parts of natural language texts, such as words, phrases, and even sentences, with semantically equivalent peer parts according to the hidden information. Such a way of information hiding is widely used in information transmission and storage by people including malicious ones, posing a potential threat to information security concerning privacy, society and nation. Nevertheless, linguistic steganalysis is a challenging task for researchers and there is still little research on it. In general, linguistic steganalysis makes use of statistical methods to differentiate between normal texts and stego texts. However, due to the few modifications in the stego texts that processed by SLS methods, existing statistical steganalysis methods against SLS are not efficient enough [1]. Although there have been recent research efforts on analyzing SLS methods using semantic methods [2,3], it is still hard for them to meet practical accuracy requirements. In short, the performance of previous steganalysis needs to be improved. In this paper, we propose a new steganalysis scheme to analyze SLS. As SLS methods cannot embed hidden information without causing any effects on context fitness of the substituted parts, certain changes in context fitness provide a critical clue for SLS steganalysis. We investigate estimation of context fitness by introducing context clusters. Here, a context cluster is a composition including a substitution element (a part being substituted of the text) and some of the context elements

q

Reviews processed and approved for publication to Khan.

⇑ Corresponding author at: Suzhou Institute for Advanced Study, University of Science and Technology of China, Suzhou 215123, China. E-mail address: [email protected] (Z. Chen). 0045-7906/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.compeleceng.2011.07.004

1072

Z. Chen et al. / Computers and Electrical Engineering 37 (2011) 1071–1081

(parts indicating the context of the text), wherein the elements have a strong correlation. We introduce the definition of Context Cluster Score (CCS) to measure the strength of the correlation. A substitution element may have several related context clusters and the average CCS value of them is used to indicate its context fitness. Then, we present the steganalysis scheme against SLS making use of the context fitness values of substitution elements in texts. Finally, as an instance, we examine the steganalysis against synonym substitution-based linguistic steganography using this steganalysis scheme. Experimental results show that the steganalysis is fairly promising. 2. Related work According to substitution elements, SLS can be classified into those based on synonym substitution [4–11], synonymous rule substitution [12], synonymous sentence substitution [13], machine translation [14–16] and so on. Among these, Synonym Substitution-based Linguistic Steganography (SSLS) is most widely used. In SSLS system, hidden messages are embedded by substituting a word with one of its synonyms. The stego text preserves the same meanings before and after synonym substitution. In this section, we introduce T-Lex system, one of the few implemented SSLS systems, review the previous steganalysis methods against this system and discuss their weakness. 2.1. T-Lex system The key problem that SSLS system faces is how to define synonym sets. In natural language, words often have many meanings in different contexts. How to determine the exact meanings in certain context is a hard problem known as semantic disambiguation in Natural Language Processing (NLP) area. The definition of synonym sets must guarantee that all synonym sets are mutually disjoint in order not to cause semantic disambiguation problems. However, the multi-sense property of words makes this definition difficult. For example, words a and b are synonyms in one context, words b and c are also synonyms in another, but words a and c can have very different meanings in any context. Winstein proposed a solution for synonym set definition in T-Lex system. He used WordNet [17] to select synonyms with correct senses. WordNet is a large lexical database of English, developed under the direction of Professor George A. Miller in Princeton University. In WordNet database, nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synset), each of which expressing a distinct concept. Synsets are interlinked both conceptual-semantically and lexically. In T-Lex system, only those words completely in the identical synsets in WordNet database are grouped in a synonym set. For example, assume that words a, b, c only belong to the two synsets S1:{a, b}, S2:{a, b, c}. In this case, even though they have more than one sense, words a, b still can be interchanged semantically in all contexts. Applying the criteria described above, Winstein obtained synsets containing about 30% of 70,803 single word entries in WordNet as the synonym database of T-Lex system. The average synset size is 2.56 while the maximum is 13 and the minimum is 2. T-Lex system currently only hides text messages in the cover texts, but it is easy to modify it to hide any kind of messages. A given text message is embedded into the cover text using the synsets as follows. First, the letters of the message text are Huffman coded according to letter frequencies. Then, the Huffman coded binary string is represented in mixed-base form. For example, suppose that the binary string to be embedded is (0 1 0)2 and currently the following sentences are being considered.

8 9 > > < roadside = grass . . . . . . A bicycle was lying upon the 0 : wayside > > : ; 1 : roadside 8 9 shrewdly > > > > > > > > > > 0 : shrewdly > > < = . . . and he had a pair of 1 : astutely careless boyish eyes: . . . > > > > 2 : sagaciously > > > > > > > > : ; 3 : sapiently

In the two sentences, the first words with no numbers leading in the braces are the original words in cover texts and the subsequent words constitute their corresponding synonym sets. In the mixed-base form, each digit has a different base. For hidden information (0 1 0)2 = 2, we have

4a1 þ a0 ¼ 2 with the constraints 0 6 a0 < 4 and 0 6 a1 < 2. Thus, we get a0 ¼ 2 and a1 ¼ 0. This indicates that ‘‘roadside’’ and ‘‘shrewdly’’ should be replaced by ‘‘wayside’’ and ‘‘sagaciously’’. 2.2. Previous steganalysis methods against SLS system In paper [1], two shortcomings of T-Lex system are pointed out. One is that it sometimes substitutes words with their synonyms that do not agree with the correct English usages; the other is that the words after substitution do not agree with the genre and the author style of the cover text. More generally, T-Lex system compromises the context fitness while doing substitution and most of the existing steganalysis methods make use of this observation.

Z. Chen et al. / Computers and Electrical Engineering 37 (2011) 1071–1081

1073

Up to now, there are a few steganalysis methods against SLS system, most of which focus on SSLS system. We mainly discuss the Steganalysis based on Language Model and Support Vector Machine (SALM–SVM) [1], the Steganalysis based on Synonym Pair (SASP) [2] and the Steganalysis based on Context Information (SACI) [3]. SALM–SVM utilizes n-gram language modal to obtain some classification features of testing sentences, and then uses SVM classifier to classify them to normal sentences and stego sentences. Note that SALM–SVM analyzes the sentences instead of texts without the synonym dictionary, which is more difficult. However, although it is reported that SALM–SVM has a high recall rate of 84.9%, it has a low accuracy and precision [1]. SASP introduces the notion of synonym pair and uses it to analyze the Chinese texts hidden information utilizing synonym substitution algorithm. The experimental results show that the false negative rate is approximately 4% and the false positive rate is approximately 9.8%, the steganalysis accuracy (=1  False Negative Rate  False Positive Rate) is about 86.2% [2]. SACI makes use of the context information of synonyms to analyze the differences between normal texts and stego texts processed by T-Lex system. The word frequency information needed in the context measurement is obtained by querying Google search engine instead of a specific static corpus. Experiment shows that the accuracy is about 90% [3]. As we can see, although some steganalysis methods have been proposed, the steganalysis performance still needs to be enhanced. The steganalysis method proposed in the paper can well improve the steganalysis performance.

3. Definitions and notation As we aim to provide the general steganalysis scheme for the SLS system, some definitions and notation need to be introduced to aid the presentation. 3.1. Definitions 3.1.1. Substitution element Certain part of natural language texts that is suitable for substitution. A substitution element may be a word, a phrase, a sentence and so on, depending on the SLS system used, e.g., in the scenario of synonym substitution, a substitution element should be a synonym word. 3.1.2. Context element Certain part of natural language texts that is used for the estimation of a substitution element’s context. Context element is normally the peer text part of the substitution element, e.g., in case of synonym substitution, context elements are words used for estimation of the synonym word’s context. 3.1.3. Context window The set of context elements within a certain distance from the substitution element. It is assumed that only the context elements within the context window can effect on the context of the substitution element. 3.1.4. Context cluster In natural language texts, a context cluster is a composition of a substitution element and some of the context elements in its context window. Generally, there should be a strong correlation among the member elements of context cluster. 3.1.5. Context cluster score (CCS) A score value associated with a context cluster indicating the correlation of elements in the context cluster. 3.1.6. Substitution set The set of substitution elements that can be exchanged each other. Substitution elements are grouped into sets so that the elements in the same sets are equivalent in certain aspects, such as syntax, semantics and the like. 3.1.7. Substitution dictionary (SD) The dictionary containing the substitution sets, e.g., in case of synonym substitution, the substitution dictionary is the synonym dictionary. 3.1.8. Substitution list A list of substitution sets from the SD used in a certain text when applying substitution-based linguistic steganography methods. 3.1.9. Context fitness The extent to which a substitution element can fit into its context according to the statistics of a known text corpus.

1074

Z. Chen et al. / Computers and Electrical Engineering 37 (2011) 1071–1081

3.2. Notation We use the notation as follows. A testing text T is to be analyzed. The substitution dictionary is denoted by D and the substitution list is denoted with L ¼ fS0 ; S1 ; . . . ; Sn1 g, where n is the count of substitutions made and Si ð0 6 i < nÞ is the substitution set used. The corresponding set of context windows is denoted by C ¼ fC 0 ; C 1 ; . . . ; C n1 g, where C i ¼ fci;0 ; ci;1 ; . . . ; ci;W1 ; ci;W ; ci;Wþ1 ; . . . ; ci;2W1 g ð0 6 i < nÞ is the context window of substitution element si (Note that the position of si in C i is between ci;W1 and ci;W ) and W is half of the context window size. 4. Steganalysis scheme against SLS As SLS system substitutes an original substitution element in cover texts with one of the substitution elements in the same substitution set according to the hidden information, the result is probably that the original one is replaced by another. The substitution may cause that the new substitution element possibly does not fit the original context well. In this section, we discuss how to estimate the fitness of the context for the substitution element and propose a steganalysis scheme making use of the estimation. 4.1. Context cluster and context cluster score (CCS) We now consider the estimation of context fitness of substitution elements in texts. Usually, we estimate how a key word fits into its context by counting the context words that are close to the word itself. However, this method treats all the related context words equally and the fact is that context words often are not of the same importance. Some context words are more correlated to the word considered than others due to the language usage or the distance from this word. For this purpose, we consider the context fitness measurement more thoroughly. By observation, we found that some context elements occur very frequently with certain substitution element. Therefore, we can use this property to estimate how the substitution element fit into the context. According to the notation in Section 3.2, we can get the element composition of substitution element si ð0 6 i < nÞ, which is the candidate of the context cluster, as follows:

1i;j ¼ fsi ; ci;i0 ; ci;i1 ; . . . ; ci;iK j 2 g

ð1Þ

where 0 6 ik 6 2W  1ð0 6 k 6 K j  2Þ, K j ð2 6 K j 6 2W þ 1Þ is the size of the element composition 1i;j and the set fi0 ; i1 ; . . . ; iK j 2 g is the jth combination from the set f0; 1; . . . ; W  2; W  1; W; W þ 1; . . . ; 2W  2; 2W  1g (Here ik should be ij;k , but we leave out the subscript j for clarity.) and ci;ik is of certain order, such as alphabetical order. We can see that the size of the element composition K (we leave out the subscript j for convenience) is between 2 and 2W þ 1, the number of element compositions for a substitution element is not more than 22W  1. Suppose that the frequencies of the elements in the element composition 1i;j are f0 ; f1 ; . . . ; f K1 respectively and the frequency of the element composition is f1 , then we define the score of the element composition which measures the correlation among the member elements as follows:

f1 K a V 1 ¼ PK1 i¼0 lgð1 þ fi Þ

ð2Þ

where a is called the accelerating exponent. It is used as compensation for the severe decrease in the frequency of the element composition as its size K increases. In our latter experiments, we set it as 3 by experiences. In Eq. (2), the score of the element composition V 1 increases as the frequency and size of element composition 1 increase, decreases as the frequencies of member elements increase, which agrees with the experience and intuition, for the more frequently the member elements occur, the more probably the composition takes place and the more the inherence of the composition weakens.

Fig. 1. A context cluster and an element composition.

Z. Chen et al. / Computers and Electrical Engineering 37 (2011) 1071–1081

1075

Among element compositions of a substitution element, we can find the ones that have high score values. These element compositions have strong correlations among their member elements, so we call them as context clusters. Mathematically, we need a threshold l, then if its score is not less than threshold value, an element composition is a context cluster and the score is its context cluster score (CCS). Fig. 1. shows the context cluster and element composition when 2W ¼ 6 and l ¼ 10. Now we estimate the context fitness of a substitution element using the CCS values of context clusters containing the substitution element. Suppose that the substitution i has ni context clusters and its context cluster set is Ui , the context fitness of the substitution denoted by ci is defined as follows:

ci ¼

1 X V1 ni 12U

ð3Þ

i

where V 1 is the CCS value of context cluster 1. 4.2. Context Maximum Rate (CMR) and Context Maximum Deviation (CMD) On the basis of context fitness, we define two classification features: Context Maximum Rate (CMR) and Context Maximum Deviation (CMD) of a text. Suppose that the context fitness of the substitution element si in the text and the maximum context fitness of the substitution set are denoted by ci and ci;max , we calculate the CMR and CMD, denoted by k and h respectively, as follows:



n1 1X ½c ¼ ci;max  n i¼0 i

ð4Þ



n1 1X ðc  ci;max Þ2 n i¼0 i

ð5Þ

where

( ½ci ¼ ci;max  ¼

1; 0;

ci ¼ ci;max ci – ci;max

Since the context fitness estimates how a substitution element fit into its context and the substitution elements in normal texts fit their context better than those in the stego texts, we can infer that the context fitness values in the normal texts are generally greater than those in the stego texts. Therefore, a normal text should have a greater CMR value and a less CMD value than a stego text. Figs. 2 and 3 show the comparisons of k and h values of 100 normal texts and 100 stego texts, which well demonstrate the inference. 4.3. Steganalysis scheme When hiding information, the SLS algorithm first scans a cover text to match the elements in the substitution dictionary. And then if there is any part of the text matching the element in the dictionary, the algorithm substitutes the part by one of

Fig. 2. Comparison of k values of 100 normal texts and 100 stego-texts.

1076

Z. Chen et al. / Computers and Electrical Engineering 37 (2011) 1071–1081

Fig. 3. Comparison of h values of 100 normal texts and 100 stego-texts.

Fig. 4. The steganalysis scheme against substitution-based linguistic steganography.

the substitution elements in the matched substitution set according to the hidden bits. This procedure is repeated until the hidden message is finished or the end of the cover text is reached. The information hiding process is very straightforward. According to this process, we provide a steganalysis scheme against SLS as shown in Fig. 4. In the steganalysis scheme, the input is a testing text T and the output is the steganalysis result. Besides, we need a training corpus to obtain the basic features of normal texts and the training text sets to yield training feature sets and then to yield the classification model of the SVM classifier [18]. In the testing flow, the testing text T is first scanned and the Substitution Information (SI) and Context Information (CI) are extracted. Then the classification feature generator evaluates k and h values using SI, CI and the basic features. Finally, the SVM classifier classifies the testing text to normal texts or stego texts according to the classification features, namely k and h values.

Z. Chen et al. / Computers and Electrical Engineering 37 (2011) 1071–1081

1077

In the classification training flow, the inputs are the training texts with their types known (either normal texts or stego texts). The evaluation of k and h values is the same as that in the testing flow. However, the evaluated k and h values of these type-known texts are collected as training feature sets and used for training the SVM classifier to produce the classification model instead. In the basic training flow, the basic features such as the frequencies of substitution elements and context elements, the frequencies of the context clusters are obtained using a SD and a large training corpus. Among the flows described above, the two training flows are not necessary to execute for every time of steganalysis. Once the outputs of the training flows are generated, it is unnecessary to run them until some parameters related are changed. 5. Steganalysis against T-Lex system In this section, we use the steganalysis scheme described in the previous section to analyze T-Lex system. In case of synonym substitution, we use the names corresponding to those in case of general SLS as shown in Table 1. In the steganalysis scheme shown in Fig. 4, we designate the basic features as the frequencies of synonym words, context words and word clusters related to the synonym words. The synonym words come from the synonym dictionary and the context words are words that occur in the training corpus excluding synonym words. We set the context window size 2W ¼ 10 and then get the word clusters of sizes not more than 3 (K 6 3) within the context window. We design the feature generator that generates the classification features, namely k and h values of texts, using the notions and equations described in previous section. The steganalysis method consists of the basic feature training algorithms and classification feature generation algorithm, which are shown as Algorithms 1–3. Algorithm 1. Description of basic feature training algorithm obtaining word frequencies Step 1. For each text T in training corpus, do (1) Scan text T word by word (2) Count the frequency of each word in text T Step 2. Accumulate the frequencies of each word to get its word frequency in training corpus and get the Total Word Frequency Table (TWFT)

Algorithm 2. Description of basic feature training algorithm obtaining word cluster frequencies Step 1. For each text T in training corpus, do (1) Scan text T word by word (2) Match words in synonym dictionary (3) Count the frequencies of the word clusters of each synonym word in text T Step 2. Accumulate the frequencies of each word cluster in training corpus and get the Word Cluster Frequency Table (WCFT) Step 3. Use TWFT and WCFT to generate Word Cluster Score Table (WCST) by Eq. (2)

Algorithm 3. Description of algorithm for classification feature generation Step 1. For each word w in the testing text T, do If word w is a synonym word, do Compute the context fitness values of synonym words in word w ‘s synonym set by Eq. (3) using WCST and push the context fitness value of word w ci and the maximum value ci;max of all to arrays L and Lmax Step 2. Calculate the CMR and CMD values k and h using the arrays L and Lmax by Eqs. (4) and (5)

Table 1 Name mappings. SSLS case

General SLS case

Synonym dictionary Synonym word Synonym set Context word Word cluster

Substitution dictionary Substitution element Substitution set Context element Context cluster

1078

Z. Chen et al. / Computers and Electrical Engineering 37 (2011) 1071–1081

6. Experiment and analysis In our experiment, we use a corpus of 1000 classic English literature works containing thousands text files as a basic corpus. We build corpuses B-Corpus, C-Corpus, CD-Corpus, and S-Corpus from the basic corpus. The corpuses B-Corpus, C-Corpus, and S-Corpus consist of works whose authors’ last names begin with ‘‘B’’, ‘‘C’’ and ‘‘S’’, respectively, while CD-Corpus consists of those written by Charles Dickens. The remaining works in the basic corpus compose another corpus named T-Corpus, which we use as the training corpus. Besides the training corpus, the SVM classifier requires two text sets called SVM training text set and SVM testing text set, both of which have subsets named ‘‘Normal Text Set’’ and ‘‘Stego Text Set’’. The corpuses and their usages are listed in Table 2. Note that in Table 2, ‘‘Normal Text Set’’ and ‘‘Stego Text Set’’ are sets of natural texts and stego texts respectively, while ‘‘[X-Corpus]’’ means that the set of stego texts consists of texts from X-Corpus processed by T-Lex system. Figs. 5 and 6 show the classification feature distribution of the texts from both SVM training and SVM testing text sets. The x-axis and y-axis represent the CMR and CMD values of a text, which are denoted by k and h respectively. We can see that the normal texts have greater k values and smaller h values than stego texts. As a result, red start points representing stego texts fall in the left-upper corner and blue plus points representing normal texts fall in the right-lower corner. The classification feature distribution characteristics accord with the inference that we have made in the analysis at the end of Section 4.2. In fact, we only make use of the first several substitutions in the steganalysis. Here, a substitution is a synonym word that can be hidden information in texts. For example, substitution count is 10 means that only the first 10 synonym words in each text are examined in the steganalysis. Using the data as described in Table 2 with different substitution count, the accuracies are shown in Fig. 7. From the figure, we can see that the steganalysis accuracy increases on the whole as the substitution count increases. The accuracy soon gets fairly high even when the substitution count is still small. For example, when the substitution count is only 10, 20, and 30, the steganalysis accuracy exceeds 90%, 95%, 97%, respectively. This means that even when only 10 bit information is hidden in each cover text resulting in 10 substitutions, the proposed method is still able to detect the stego texts with an accuracy of about 90%. In practice, the hidden information is normally far greater than 10 bit resulting in more substitutions, so the steganalysis accuracy should be higher according to Fig. 7. In order to strictly assess the performance of our steganalysis system, we apply the notion of precision and recall in addition to frequently used accuracy [19]. From Fig. 7, we can see that differences of accuracies are small when the substitution count is greater than 20 and so are other measures. As an example, Table 3 shows the different parts of the steganalysis results of our experiment when the substitution count is 80.

Table 2 Corpus structure for the steganalysis experiment. Type

Subtype

From-Corpus

Count of text files

Training corpus SVM training text set

—— Normal text set Stego text set Normal text set Stego text set

T-Corpus CD-Corpus [B-Corpus] S-Corpus [C-Corpus]

>5000 100 100 220 220

SVM testing text set

Fig. 5. Classification feature distribution of texts from training text set.

Z. Chen et al. / Computers and Electrical Engineering 37 (2011) 1071–1081

1079

Fig. 6. Classification feature distribution of texts from testing text set.

Fig. 7. Steganalysis accuracies with different substitution counts.

Table 3 Different parts of steganalysis results when substitution count is 80. Our steganalysis system

Steganalysis

Fact

Stego texts Natural texts

Stego texts

Normal texts

217 3

2 218

In Table 3, the true positive is tp = 217, the false positive is fp = 2, the false negative is fn = 3 and the true negative is tn = 218, then the precision, recall and accuracy are as follows:

precision ¼

recall ¼

tp 217 ¼ ¼ 99:09% tp þ fp 217 þ 2

tp 217 ¼ ¼ 98:64% tp þ fn 217 þ 3

accuracy ¼

tp þ tn 217 þ 218 ¼ ¼ 98:86% tp þ fp þ fn þ tn 217 þ 2 þ 3 þ 218

1080

Z. Chen et al. / Computers and Electrical Engineering 37 (2011) 1071–1081 Table 4 Comparison results between SACI method using yahoo search engine and our method. Steganalysis methods

Accuracy (%)

Recall (%)

Precision (%)

SACI Our method

71.43 95.71

62.83 91.43

75.86 100

We can see that the proposed steganalysis system has high rates of precision, recall and accuracy when the count of substitution is suitably large. Additionally, we do experiments to compare our steganalysis method with the previous one, SACI method, under the same experimental conditions. The reasons for selecting SACI method to make the comparison are as follows. First, for SALM–SVM method, its main focus is the analysis of stego sentences while ours aims to analyze the stego texts, so the direct comparison seems to make little sense. Second, the SACI method has a comparative highest accuracy as reported in the literature. The comparison results are shown in Table 4. Since the automotive query of Yahoo consumes a lot of time, we select 30 texts for training and 70 texts for testing. Then, only the first 30 substitutions in each text are used in the experiments. We can see that our method is of much better performance no matter in accuracy, recall or precision measurement. What should be pointed out is that the SACI method originally used Google search engine to obtain word frequency information, but in our current experiments, we apply Yahoo search engine instead, since the automotive query of Google is forbidden at present. Indeed, the two internet corpuses are somewhat different. The most obvious difference is that Google corpus seems to be much larger than Yahoo. For example, the document count of word ‘‘a’’ is about 25,270,000,000 in the Google corpus, while it is 7,540,000,000 in Yahoo corpus. It appears that the absence of Google corpus severely impacts the performance of the SACI method, however, even compared to the results reported in the literature, our method is still outperforming. 7. Conclusion and future work This paper has provided a new steganalysis scheme against substitution-based linguistic steganography based on context clusters. In the steganalysis scheme, the notion of context clusters has been introduced to evaluate context fitness differences between normal texts and stego texts. The steganalysis scheme has then been implemented to analyze T-Lex system and the experimental results have shown that the steganalysis is fairly promising. The experiments have illustrated that the steganalysis under the proposed scheme has a high accuracy even in the case that the substitution count is small. The highest accuracy of the steganalysis exceeds 98%. Further more, comparison with the previous steganalysis method using context information has shown that the proposed method has a great advantage. The steganalysis scheme can be applied in steganalysis of general SLS, not only of synonym substitution based linguistic steganography. The future work is to further validate the steganalysis scheme by analyzing other SLS methods such as linguistic steganography based on synonymous sentence substitution. Acknowledgements This work was supported by the Major Research Plan of the National Natural Science Foundation of China (No. 90818005), the National Natural Science Foundation of China (No. 60903217), the Natural Science Foundation of Jiangsu Province of China (No. BK2010255), and the Scientific and Technical Plan of Suzhou (No. SYG201010). The authors would like to thank the anonymous reviewers for their helpful comments and suggestions. References [1] Cuneyt MT, Umut T, Mercan T, Edward JD. Attacks on lexical natural language steganography systems. In: Proc. SPIE, San Jose, CA, USA; February 2006. p. 97–105. [2] Gang L, Xingming S, Lingyun X, Yuling L, Can G. Steganalysis on synonym substitution steganography. J Comp Res Dev 2008;45(10):1696–703. [3] Zhenshan Y, Liusheng H, Zhili C, Lingjun L, Xinxin Z. Detection of synonym-substitution modified articles using context information. In: Proc. 2nd Int. Conf. Future Generation Communication and Networking, Sanya, Hainan Island, China; December 2008, p. 134–9. [4] Keith W. Lexical steganography through adaptive modulation of the word choice hash. Available from: http://alumni.imsa.edu/~keithw/tlex/lsteg.ps (accessed April 2010). [5] Atallah MJ, McDonough CJ, Raskin V, Nirenburg S. Natural language processing for information assurance and security: an overview and implementations. In: Proc. 9th ACM/SIGSAC New Security Paradigms Workshop, New York, USA; September 2000. p. 51–65. [6] Richard B. Towards linguistic steganography: a systematic investigation of approaches, systems, and issues. Undergraduate final-year project, University of Derby, 2004. [7] Igor AB, Alexander G. Synonymous paraphrasing using WordNet and Internet. In: Proc. 9th Int. Conf. Applications of Natural Language to Information Systems, LNCS 3136, Salford; June 2004. p. 312–23. [8] Igor AB. A method of linguistic steganography based on collocationally-verified synonymy. In: Proc. 6th Information Hiding, LNCS 3200, Toronto, Canada; May 2004. p. 180–91. [9] Hiram C, Igor AB. Using selectional preferences for extending a synonymous paraphrasing method in steganography. In: Proc. CIC2004: XIII Congreso Internacional de Computacion; October 2004. p. 231–42. [10] Umut T, Mercan T, Mikhail JA. The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions. In: Proc 8th ACM Multimedia and Security Workshop, New York, USA; 2006. p. 164–74.

Z. Chen et al. / Computers and Electrical Engineering 37 (2011) 1071–1081

1081

[11] Yuling L, Xingming S, Can G, Hong W. An efficient linguistic steganography for Chinese text. In: Proc. IEEE Int. Conf. Multimedia & Expro (ICME), China; July 2007. p. 2094–97. [12] Steven EH. Stegparty. Available from: http://www.fasterlight.com/hugg/projects/stegparty.html (accessed April 2010). [13] Brian M. Syntactic information hiding in plain text. Master thesis, Trinity College Dublin, 2001. Available from: https://www.cs.tcd.ie/Brian.Murphy/ publications/murphy01 hidingMasters, (accessed April 2010). [14] Christian G, Krista G, Ludmila A, Ryan S, Mikhail A. Translation-based steganography. Technical Report TR 2005-39, Purdue CERIAS, 2005. [15] Christian G, Krista G, Ludmila A, Ryan S, Mikhail A. Translation-based steganography. Proc. Information Hiding Workshop, LNCS 3727, Barcelona, Spain; June 2005. p. 213–33. [16] Ryan S, Christian G, Mikhail A, Krista G. Lost in just the translation. In: Proc. 21st Annual ACM Symposium on Applied Computing. Dijon, France; April 2006. p. 338–45. [17] ‘WordNet: a lexical database for the English language’. Available from: http://wordnet.princeton.edu/ (accessed April 2010). [18] Chih-Chung C, Chih-Jen L. LIBSVM: a library for support vector machines. 2001. Software available from: http://www.csie.ntu.edu.tw/~cjlin/libsvm (accessed April 2010). [19] Christopher DM, Hinrich S. Foundations of statistical natural language processing. Beijing Publishing House of Electronics Industry; January 2005. Zhili Chen received his Ph.D. degree in computer science from University of Science and Technology of China (USTC) in 2009. He is currently a postdoctoral research fellow of School of Computer Science and Technology at USTC. His research interests include information hiding, linguistic steganography and authorship analysis. Liusheng Huang received his M.S. degree in computer science from University of Science and Technology of China (USTC) in 1988. He is currently a professor and Ph.D. supervisor of School of Computer Science and Technology at USTC. His research interests are in the areas of wireless sensor networks, information security, distributed computing and high performance algorithms. Haibo Miao is currently a Ph.D. student in School of Computer Science and Technology at University of Science and Technology of China (USTC). His research interests mainly include information security, information hiding and covert channel. Wei Yang received his Ph.D. degree in computer science from University of Science and Technology of China (USTC) in 2007. He is currently a postdoctoral research fellow of School of Computer Science and Technology at USTC. His research interests include information theory, quantum information and cryptology. Peng Meng is currently a Ph.D. student in School of Computer Science and Technology at University of Science and Technology of China (USTC). His research interests include steganography, steganalysis, and natural language processing.