Future Generation Computer Systems 94 (2019) 272–281
Contents lists available at ScienceDirect
Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs
Service matchmaking for Internet of Things based on probabilistic topic model ∗
Yezheng Liu a,b , Tingting Zhu a,b , Yuanchun Jiang a,b , , Xiao Liu c a
School of Management, Hefei University of Technology, Hefei, Anhui 230009, PR China Key Laboratory of Process Optimization and Intelligent Decision Making, Ministry of Education, Hefei, Anhui 230009, PR China c School of Information Technology, Deakin University, Melbourne, Australia b
highlights • • • •
Proposed a probabilistic topic model for service matchmaking in IoT. Utilized the latent service factor to calculate the text similarity of the services. Proposed a new signature matchmaking algorithm that can support IoT services. The proposed method outperforms all the other benchmarks in terms of precision and recall.
article
info
Article history: Received 30 August 2018 Received in revised form 1 November 2018 Accepted 26 November 2018 Available online 1 December 2018 Keywords: IoT Probabilistic topic model Service matchmaking Latent Dirichlet allocation
a b s t r a c t The Internet of Things (IoT) is one of the most rapidly growing technologies which enables things to interact with each other through a global network of machines and devices. In order to facilitate the user to query, search and discover appropriate services, service matchmaking is a critical and challenging task in the IoT. Currently, semantic modeling methods are mainly used to solve service matchmaking problems. However, as the text for service description is often short, these methods face many challenges such as sparse and high-dimensional features. To solve these issues, we propose a service matchmaking method based on Weighted-Word Latent Dirichlet Allocation (WW-LDA). By distinguishing and incorporating the importance of different words, WW-LDA is proposed as a probabilistic topic model to extract latent semantic factors. Based on the result of WW-LDA, we further design a new service matchmaking algorithm to discover IoT services. Experimental results show that our proposed method performs much better than other existing methods in terms of the precision rate and recall rate on realworld datasets. © 2018 Elsevier B.V. All rights reserved.
1. Introduction As one of the most rapidly growing technologies, the Internet of Things (IoT) can enable things to interact with each other through a global network of machines and devices [1,2]. IoT is regarded as one of the most vital and significant domains of future technology and has been receiving widely attention from all walks of life. In 2015, the number of sensing devices in IoT has exceeded the total number of smartphones and personal computers and it is estimated to approximate to 240 billion in 2020. Given such a massive scale of connection, Internet of Things enables the integration of the physical world with the computer world, which can bring ∗ Corresponding author at: School of Management, Hefei University of Technology, Hefei, Anhui 230009, PR China. E-mail addresses:
[email protected] (Y. Liu),
[email protected] (T. Zhu),
[email protected] (Y. Jiang),
[email protected] (X. Liu). https://doi.org/10.1016/j.future.2018.11.040 0167-739X/© 2018 Elsevier B.V. All rights reserved.
invaluable benefits such as improving efficiency, reducing human labor, and accelerating economic growth [3–6]. Service matchmaking is a critical and challenging task in IoT which enables IoT service recommendation, service composition and service provisioning to implement advanced functionalities. Service matchmaking also facilitates the user to query, search and discover appropriate services based on various user requirements. Currently, the most commonly used method is keyword matchmaking which discovers the matchmaking services from information generated in IoT. Although this approach is very popular, it cannot completely recognize user’s needs and intentions and it is more likely to produce errors on the semantic similarity. For example, we utilize ‘‘give me my apple’’ to search a service, the device may offer a delicious apple or a phone to us. To solve the problem of semantic recognition, most researchers proposed to use semantic modeling methods to extract features from the service descriptions and queries [7–9]. Semantic modeling can transform texts into structural data which keeps semantical meanings.
Y. Liu, T. Zhu, Y. Jiang et al. / Future Generation Computer Systems 94 (2019) 272–281
The most widely used semantic modeling method is the probabilistic topic model [10–12], which is a type of Bayesian method that seeks to extract the latent semantic structures from text collections. The probabilistic topic model maps high-dimensional word count vectors into a low-dimensional depiction and characterizes documents with latent semantic factors rather than the concept of word. Latent factors that is uncovered by topic models can solve problems in service matchmaking where different words are referred to the same meanings and the same words are referred to different meanings. Dimensionality reduction helps to decreases the steps that are needed for comparing services and can also simplify the matchmaking process. Based on these properties, many researchers proposed to utilize topic models for service matchmaking. Cassar et al. proposed a service matchmaking method that is based on non-logic. The method uses Probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA) to extract hidden structures from service descriptions text [13]. Besides using traditional topic models, researchers also match service using hybrid methods in IoT. For instance, Cassar et al. proposed a hybrid service matchmaking method [14]. The method combines latent semantic analysis with a weighted-link analysis based on a special signature. In spite of the results show that the method based on probabilistic topic model outperforms than other solutions, the proposed method ignores an important problem which is that service description documents are similar to short texts. Therefore, using statistical topic models cannot effectively estimate the implicit semantic factors. To address this issue, Wei et al. used the English Wikipedia to enrich service text description [15] and Liu et al. proposed a new crowdsourcing-based topic modeling method through incorporating human cognition [16]. These methods can alleviate the problem of data sparseness and high-dimensional features in IoT service matchmaking. However, introducing external knowledge and incorporating human cognition may be effective to solve short texts problems, but at the meantime they will inevitably increase the time overheads and the cost for manpower. To tackle these challenges, this paper proposes a weighted-word topic modeling method for extracting latent semantic factors from service descriptions. To alleviate the problem of data sparseness and highdimensional features, we firstly use TextRank to determine the importance of different words and assigning different weights to these words based on their importance. Then, we propose a new topic model (Weighted-Word Latent Dirichlet Allocation, WWLDA in short) which incorporates words with different weights is proposed to extract latent semantic factors. Finally, a service matchmaking algorithm is proposed based on the result of WWLDA. Compared with existing service matchmaking methods, the major contribution of this paper is threefold: (1) This paper proposed a probabilistic topic model (WW-LDA) and developed a new signature matchmaking algorithm for IoT service matchmaking. (2) We use the latent service factors to calculate the text similarity of the services. The proposed method reduces the number of service matches and accelerates the calculation of service text similarity. (3) Experimental results on a real-world IoT dataset show that the proposed model outperforms the benchmarks in terms of the precision rate and recall rate. The remainder of the paper is organized as follows. In Section 2, we review the related work on probabilistic topic models and service matchmaking in IoT. Section 3 illustrates the principle of the WW-LDA model and the new signature matchmaking algorithm. Experimental results are shown in Sections 4 and 5 provides the conclusions and future research directions.
273
2. Related works Creating a successful and efficient service matchmaking method is a challenge task because of the complexity of semantic recognition. Here we discuss some related work about probabilistic topic models and service matchmaking methods in the IoT area. 2.1. Probabilistic topic models Probabilistic topic models have been proposed to uncover the underlying semantic information from text collections. These kind of methods started from latent semantic analysis (LSA) [17] which decomposed the document–word matrix to reveal the major patterns from text corpus. pLSA [18] and LDA [19] are the two most well-known basic topic models. The document is regarded as a mixture of topics, while the topic is defined as a combination of words in pLSA. Dirichlet prior is added to document–topic distribution and topic–word distribution respectively compared with pLSA in LDA. Various kinds of extensions have been proposed based on pLSA and LDA, for example, the author–topic model [20], Bayesian nonparametric topic model [21], supervised topic model [22] and others [23,24]. All of them utilized word co-occurrences to enhance topic learning [25,26]. Due to the lack of word co-occurrences in short texts, using traditional topic modeling is not effective. Some researchers directly used traditional topic models for text analysis [27,28] or aggregated short texts into lengthy pseudo-documents to train topic models based on some additional information [29,30]. For example, Weng et al. proposed to aggregate the Twitter datasets which are released by the single user into a document before training LDA [29]. Wang et al. [31] proposed to aggregate Twitter into a document containing the same Hashtags. Hong et al. [32] advised that a new topic model for short texts is needed through conducting a complete empirical research on topic modeling. One way is to make stronger assumptions about short corpus. For example, Phelan et al. modeled each Twitter as a topic [33]. Gruber et al. assumed that the words in each sentence come from the identical topic [34]. Yan et al. believed that a biterm comes from a topic [35]. The biterm is defined as an unordered word-pair co-occurred in short contexts. Although the performance of these methods is better than LDA and pLSA, their hypothesis is strong which limits their effectiveness in practice. Another method is to introduce external knowledge to enrich short texts. Andrzejewski et al. used a Dirichlet Forest prior which is a mixture of Dirichlet tree distributions to incorporate domain knowledge [36]. These methods are proven to be helpful in several areas. However, these necessary extra manpower or external knowledge may be difficult to acquire. 2.2. Service matchmaking in IoT The current service matchmaking in IoT is mainly used by keyword-based solutions. Ding et al. proposed a hybrid search engine framework for IoT services matchmaking based on keyword [37]. The method has satisfactory performance for real-time searching of mass data in the IoT. Ascigil et al. used unordered keywords to locate IoT data [38] and Kurita et al. proposed a KeywordBased Content Retrieval (KBCR) method for IoT applications [39]. All these methods are effective in solving service matchmaking problems, but they provided limited semantic information to service developers and consumers. Adding semantics to IoT service matchmaking is more machineinterpretable and supports IoT devices to interact with each other. More specifically, semantic modeling is the basis for interoperating between heterogeneous systems. On accounts of these merits,
274
Y. Liu, T. Zhu, Y. Jiang et al. / Future Generation Computer Systems 94 (2019) 272–281
Fredj et al. presented an efficient method to improve the effectiveness for service matchmaking. They firstly applied a dynamic clustering for service descriptions and then measure the semantic similarity between services and requests [7]. Probabilistic topic model is a popular semantic modeling approach, some researchers utilized it to solve service matchmaking problems. Cassar et al. proposed a service matchmaking method which is based on nonlogic [13]. The method uses pLSA and LDA to find latent factors. Cassar et al. also proposed a hybrid method which proves to be valid in IoT service matchmaking [14]. The method used latent semantic analysis with a weighted-link analysis based on logical signature matchmaking. These methods not only reduce the complexity of the task but also keep the semantic meaning in the machine-interpretable semantics. However, they ignored the fact that the description document is a short text. Utilizing traditional topic models in short texts directly encountered some problems, like data sparseness and high-dimensional features. Based on the above disadvantages, Wei et al. [15] used the English Wikipedia corpus to train topic modeling method which can produce high quality result and semantically enriched service text description. It helped the topic model to extract latent topics of services more effectively. Liu et al. [16] proposed a new crowdsourcing-based topic modeling method through incorporating human cognition. It effectively alleviates the scalability and sparseness problem in IoT service matchmaking. These methods are proven to be helpful in IoT service matchmaking based on the property of short texts for service description documents. However, additional knowledge and human efforts are not always available. Meanwhile, introducing external knowledge and incorporating human cognition inevitably increase the time overheads and the cost for manpower. To tackle these challenges, we propose a weighted-word topic model to extract features and use the topic distribution to match services in IoT. 3. The proposed approach This section introduces our IoT service matchmaking method from three aspects: (1) firstly, we will compute the importance of words based on TextRank; (2) secondly, we will illustrate a new probabilistic topic modeling method which is called WeightedWord Latent Dirichlet Allocation (WW-LDA) and use it to extract implicit features from IoT service descriptions; (3) finally, a new matchmaking method is proposed for IoT service matchmaking. Fig. 1 shows the service matchmaking architecture for IoT based on WW-LDA. As it can be seen from the figure, request devices send a request to discover proper service providers. Meanwhile, service devices offer a series of service descriptions. In our matchmaking architecture, we utilize request descriptions and service descriptions to extract topic signatures and word signatures. Then we incorporate topic signatures and word signatures into service matchmaking method. Finally, the method will select the most related service based on the previous results and send it to the request devices. We will illustrate the technical details in the following sections. 3.1. The importance of words In this paper, we regard a service description or a request description as a single sentence. Traditional topic models use the bag-of-words model for topic extraction without distinguishing the importance of different words. However, the importance of words in the same sentence is different in real life [40,41] and effectively distinguishing the importance of words helps to improve the quality of topic extraction. In this paper, we use TextRank to calculate the importance of words in documents. TextRank is an innovative method of weighting words and extracting text keywords
from sentences [42]. TextRank is different from other methods in that it can calculate the keywords in a sentence using only a single sentence without any other documents. It can effectively reduce the computation time and can be applied to the datasets of different sizes. Given the above advantages, we choose TextRank to calculate the importance of different words. TextRank is a graph-based unsupervised algorithm. Since service descriptions and request descriptions are short texts, the degree of connectedness between words is hard to measure in the directed graph. We use TextRank’s undirected graph model to distinguish the importance of words in this paper. We use G = (V , E) to denote an undirected graph of the description text. It consists of a series of vertices V and a sequence of edges E which are words and the distances between words respectively. We also define V × V as the subset about E. In(Vi ) is the collection of words that point to Vi and Out(Vi ) is the collection of words that Vi points to. Based on the above definitions, the final score (weight) of a word Vi is determined by the following equation:
wji
∑
WS(Vi ) = (1 − d) + d ∗
∑ Vj ∈In(Vi )
( ) wjk
WS(Vj )
(1)
Vk ∈Out Vj
where wji is a weight and is added to the corresponding edge that connects the two words between Vi and Vj . In this equation, we define a damping factor d which is expressed as the probability of jumping from a known vertex to another random vertex. We set the damping factor d as 0.85 to be consistent with the existing work [43]. To apply graph-based ranking method, we should build a graph that connects words to meaningful relationships in a sentence. No matter what elements be added to the undirected graph, the following main steps must be included in TextRank. (1) Identify text units in sentence S and add them to the graph as vertices. The text units are all words extracted from the sentence S, and these words are added to the graph as vertices. Assuming the total number of words is N in the sentence S, thus, N vertices are } constructed in the graph model, S = { w1 , . . . , wj , . . . , wN . (2) Use relationships which are identified between text units to draw edges between vertices in the graph. The relationship between text units is defined as the distance between word occurrences. If corresponding lexical units co-occurred within a window of maximum K words, two words are regarded as connected, and vice versa. The window is denoted as cw. The transition matrix M which is constructed between words is defined as follows:
w11 ⎢ . M = ⎣ .. wN1
⎡
· · · w1j · · · .. .
· · · wNj · · ·
⎤ wN1 .. ⎥ . ⎦ wNN
(2)
The jth column elements of the matrix M represent the weight distribution and the sum of the values in each column is 1. The value of the weight distribution is the probability that the jth candidate word wj connect to another random words. (3) Iterate the graph-based ranking algorithm until it converges. We assign an initial value to each word which is set to be 1 in this paper. The initial scores of all words can be represented by a vector B0 and the components of B0 are all equal to 1 and the dimensions of B0 is N. B0 = (1, 1, . . . , 1)
(3)
The ranking algorithm described in Eq. (1) runs several iterations until it converges. We choose the number of iterations to be 30 and the threshold is set to 0.0001 to be consistent with the existing work [42].
Y. Liu, T. Zhu, Y. Jiang et al. / Future Generation Computer Systems 94 (2019) 272–281
275
Fig. 1. Service matchmaking architecture for IoT based on WW-LDA.
(4) Rank vertices based on the final score. We use the results of step (3) about each word for ranking decisions. as a vector { The final iteration result is represented } B, B = (S(w1 ), . . . , S(wj ), . . . , S(wN )) .S(wj ) is defined as the importance of the word wj in the sentence, namely the weight of the word in the sentence. 3.2. WW-LDA and inference The classic LDA model presumes that words come from a topic– word distribution and are equally important. Although the conventional topic models are effective in topic extraction for normal documents, directly applying them on request and service descriptions may not work well. The main reason is that the traditional topic models are based on document-level word co-occurrence patterns, and thus they will suffer from the problem of data sparsity in short documents. To tackle this problem, we propose a new topic model that incorporates the word weights computed by TextRank mentioned above. The model is called Weighted-Word Latent Dirichlet Allocation (WW-LDA). Fig. 2 shows the topic mapping for service description based on WW-LDA. There are three steps to accurately estimate the implied topics of service description using the probabilistic topic model: (1) text description extractor; (2) semantic enricher; (3) topic inference. We obtain service text description and use TextRank to enrich the text description. Then we infer topics through probabilistic topic model and eventually get latent topics of the service. This paper uses the importance of words as a discriminant of textual information. To increase the probability of words that are important for topics, we use the word weights to determine the number of words. We assume that the more important the word, the more frequent it appears in the documents. For unsubstantiated words, like ‘‘the’’, ‘‘so’’ and so on, we use the standard stop word list to remove them. Thus, WW-LDA is a probabilistic generative topic modeling method which integrates the importance of the words. The graphical model of WW-LDA is shown in Fig. 3. Each document d(d = 1, . . . , D) is composed of nd words. The total number of words is N in all request descriptions and service descriptions. The corpus is defined by an N-dimensional vector
Table 1 The generative process of WW-LDA. 1. For each document d ∈ D: Draw topic mixture proportion θd ∼ Dirichlet(α ); 2. For each latent topic dimension k ∈ [1, K ]: Draw ϕk ∼ Dirichlet(β ); 3. For each description word wdi in document d: (a) For each word wd ∈ Wd : i. Draw topic assignment zwd ∼ Multinomial(θd ) ii. Draw word wd ∼ Multinomial(ϕzwd ) (b) For each word wd∗ ∈ Wd∗ i. Draw topic assignment zw∗ ∼ Multinomial(θd ) d ii. Draw word wd∗ ∼ Multinomial(ϕzw∗ ) d
{ } w = w11 , w12 , . . . , wdi , . . . , wDnD , where wdi is the ith word of document d. The vocabulary is defined by V unique words in all corpus; each unique word is denoted by v . The distribution of topic k(k = 1, . . . , K ) over the vocabulary is denoted by a V-dimensional vector ϕk . Element ϕkv denotes the probability of word v belong to topic k. Document d is a mixture of the K topics. θd is a Kdimensional vector which represents the proportions of each topic in document d.( The number of original ) words in the document d is denoted as numwd1 , . . . , numwdn
d
, and the number of added
words is defined as
(
num∗wd1 , . . . , num∗wdn
) d
( ⌈ ( )⌉) = ⌈S (wd1 )⌉ , . . . , S wdnd ( ) − numwd1 , . . . , numwdnd
⌈x⌉ = min {n ∈ Z |x ≤ n }
(4) (5)
We use the importance of the word to expand the original document and obtain the expanded text{data. The original word } count collection is defined as Wd =
numwd1 , . . . , numwdn
d
,
∗ and { the expanded word } count collection is defined as Wd =
num∗wd1 , . . . , num∗wdn
. We can get a word collection d
d
=
Wd , Wd∗ in document d. The generative process of WW-LDA is shown in Table 1.
{
}
276
Y. Liu, T. Zhu, Y. Jiang et al. / Future Generation Computer Systems 94 (2019) 272–281
Fig. 2. Topic mapping for service description based on WW-LDA.
Fig. 3. Graphical model of WW-LDA.
Given the generative process, the probability of observing the words given the model parameters ∅ = {θ, ϕ} is P W , W |∅; α, β
(
∗
)
=
D ∏
( p ( θd | α)
d=1
K ∏
) p ( ϕk | β) ∗
k=1
⎛ ⎞ ⎛n ∗ ⎞ nW W K K d ∏ ∏ ∏d ∏ ( ⏐ ) ⎝ p ( z | θd ) p ( wi | ϕz )⎠ ⎝ p ( z | θd ) p wj ⏐ ϕz ⎠ i=1 z =1
θd,k = (6)
ϕk,wq =
j=1 z =1
In general, the log format of the likelihood function is easier to derive. Thus, we can compute the log of Eq. (6). Since the complexity of the formula, it is not feasible to compute the results of the parameters. Therefore, we recur to Gibbs sampling method to get the parameter results. To develop a Gibbs sampler for WW-LDA, we firstly to compute the conditional probability of all unknown variables which is only topic index z in our situation. The observed words are w and w ∗ . Thus, the conditional probability can be denoted as p(z |w, w ∗ ) . Then we can use Markov chain to simulate the sampler. We can derive the expression for p(z |w, w ∗ ) using Bayes’ theorem and the conditional independence in the following equation: p(zq = k| W , W ∗ , z −q )
⏐ ⏐ ( ) ∝ p(zq = k ⏐z −q )p( wq ⏐ zq = k, z −q , W ∪ W ∗ −q )
(7)
With the generative process, we get the expression for the first and second probability in Eqs. (8) and (9).
(
⏐
)
p zq = k ⏐z −q =
excludes q. Note that in some cases, we omit the −q subscription for simplicity. Once the sampling process is finished, we can readily obtain the model parameters:
(k)
(k)
d
d
numW + numW ∗ + α − 1
Nd + Nd∗ + K α − 1 ( ⏐ ( ) ) nk,−q + β p wq ⏐ zq = k, z −q , W ∪ W ∗ −q = nk + V β − 1
(8) (9)
(k)
In the above formulas, numW is the number of original words d
(k)
assigned to topic k in document d. numW ∗ is the number of added d words assigned to topic k in document d. Nd is regarded as the number of original words in document d. Nd∗ is denoted as the number of added words in document d. nk is defined the number of words assigned to topic k. −q means that all situations which
(k)
(k)
d
d
numW + numW ∗ + α − 1 Nd + Nd∗ + K α − 1
nk,−q + β
nk + V β − 1
(10) (11)
3.3. Service matchmaking based on topic signatures Most existing approaches focus on extracting word signature vectors for service matchmaking in short texts [37–39]. For a given request r, the goal of the service matchmaking model is to find the most relevant service from the candidate services to achieve the best performance. In this section, we illustrate our service matchmaking method which use word and topic signatures at the same time. document–topic distribution and words are two important signatures that can be obtained from description text. document– topic distribution is a feature between word level and document level. It is a vital feature for service matchmaking. Meanwhile, word may be also of great importance since service and request descriptions are short texts. If only word signatures are used, some semantic meanings could be ignored. For example, ‘‘Apple’’ may have two meanings, a fruit or a company, but there is only one meaning when only using word features. However, if we only use topic signatures, the granularity is too coarse which can easily lead to inaccurate results. For { } example, request r =
{apple, iphone, ipad} and θ r = p(1r ) = 1 which is the document– topic distributions of} request r. Service d = {good, bad, slow ly} { (d) and θ d = p1 = 1 which is the document–topic distributions of service d. Although θ r is equal to θ d , they are not similar. Thus, using topic signatures and word signatures together in an integral service matchmaking algorithm will effectively benefit for the matchmaking performance.
Y. Liu, T. Zhu, Y. Jiang et al. / Future Generation Computer Systems 94 (2019) 272–281
Assuming
θd =
{
(d)
that
θr
(d)
= }
(d)
p1 , . . . , pk , . . . , pK
{
(r )
(r )
(r )
p1 , . . . , pk , . . . , pK
}
and
are the two document–topic dis-
tributions. They represent the signatures of request r and service (r ) (d) d, pk and pk are the topic k’s probabilities for r and d respectively. Meanwhile, the number of topics is K . To measure the relevant of the service and the request, we use the cosine similarity as the indicator:
∑K
Similarity (θr , θd ) = √
(r ) (d)
k=1
∑K
k=1
(
(r )
pk
pk pk
)2 ∑
K k=1
(12)
(
(d)
pk
)2
To extract word-level signatures, we use word2vec [44] to represent words and compute the similarities between request r and service d as shown in Eq. (13). The main goal of word2vec is to learn high-quality word vectors from huge datasets with billions of words. Each unique word is assigned with a corresponding vector having{ several hundred }dimensions. r = {x1 , . . . , xi , . . . , xI } and d = y1 , . . . , yj , . . . , yJ are the word sequences of the request r and service d. I is the number of words in request r and J is the number of words in service d. Then, we use the matrix to } vectors, r = {x1 , . . . , xi , . . . , xI } and d = {represent these word y 1 , . . . , y j , . . . , y J . xi and y j are the vector of word xi and word yj that is calculated by word2vec. Since each document consists of a group of word vectors, it is not easy to measure similarities between two documents directly. We develop a function called purity as the indicator:
⎧ ∑I ⎨ i=1 maxj cos(xi ,yj ) J > I purity (r , d) = ∑J maxI cos(x ,y ) ⎩ j=1 i i j I ≥ J J
4.2. Benchmark models for comparison We compare WW-LDA with pLSA, LDA and Mix on Yelp dataset and OWL-TC4 dataset: 1. pLSA (Probabilistic Latent Semantic Analysis): The document is regarded as a mixture of topics, while the topic is defined as a combination of words in pLSA. We use mltool4j2 to implement pLSA. 2. LDA (Latent Dirichlet Allocation): Dirichlet prior is added to document–topic distribution and topic–word distribution respectively compared with pLSA. We use jGibbLDA3 to implement LDA. 3. Mix (Mixture of Unigram) [45]: Mix presumes that each document comes from separate topic and the word is extracted from the single topic. We use jLDADMM4 to implement Mix. 4.3. Evaluation method We utilize ‘‘coherence’’ [26] as the metric to assess topic quality. This metric has a valuable insight where if a topic has a good explanatory power, some pairs of words should often appear in the documents of the corpus, namely they are associated with this topic with a high probability. Topics scoring higher on this metric are more interpretable by human and hence are of better quality. This metric is defined as follows: Ck =
T t −1 ∑ ∑
(
log
(k)
(k)
D vt , vl
)
+1 (15)
(k)
D(vl )
t =2 l=1
(13)
277
Here, γ and 1 − γ are defined as coefficient of topic signatures and word signatures, respectively. We manually set the value of γ .
Where V k = (v1k , . . . , vTk ) is the list of T words which are high probability in topic k. D(v ) is the number of documents with word v . D(v, v ′ ) is the number of documents which contain at least one occurrence of both v and v ′ . The average coherence ∑ is calculated to assess the overall quality of a topic list, which is K1 k Ck for each method. Meanwhile, we also conduct document clustering and classification to measure the topic quality. We employ a metric in clustering evaluation as follows. The topic models are a type of dimension reduction methods, a vector of posterior distribution can represent each document which is called document–topic distribution:
4. Evaluation
d i = [p ( z1 | di ) , . . . , p( zk | di )]
4.1. Experimental data
Then we can measure the distance of two documents by the Jensen–Shannon divergence:
The purity calculates the maximum average cosine value between the word vectors for the request description and the service description. Finally, we utilize a function to calculate the similarity between requests and services: Similarity (r , d) = γ Similarity (θr , θd ) + (1 − γ ) purity (r , d)
(14)
We start by describing the textual data for topic modeling evaluation from Yelp open dataset and OWL-TC4 dataset. The Yelp dataset includes 4.7 million reviews and 200,000 pictures from a total of 156,000 shops in 12 cities. We randomly select 10,000 reviews as our corpus and remove stop words. The mean length is 38.3 words and the number of special words is 52,095 in these reviews. The OWL-TC4 service retrieval test collection1 is a public dataset which consists of 1083 services and 42 queries. It always is used to evaluate service matchmaking in IoT. Each query corresponds to some services. To verify the effectiveness of WW-LDA, multiple datasets were selected for experiments which are the Yelp dataset and the OWLTC4 dataset. The Yelp dataset provides comments about service stores which is similar with the mechanism in IoT service matchmaking. We conducted experiments from three aspects: quality of topics, document clustering and document classification. We also select the OWL-TC4 dataset to verify the effectiveness about our service matchmaking method. We perform a comparative analysis with traditional topic models and service matchmaking methods. 1 http://www.semwebcentral.org/projects/owls-tc/.
dis di , dj =
(
)
1 2
DKL ( d i ∥m) +
(16)
1 2
( )
DKL d j m
(17) p
i where m = 12 (d i + d j ), and DKL ( p∥q) = i pi In qi is the Kullback– Leibler divergence. Given a set of clusters C = {C1 , . . . , CN }, we introduce two distance scores Average Intra-Cluster Distance:
∑
⎡
⎤
N
IntraDis (C ) =
2dis(di , dj ) ⎥
1 ∑⎢ ∑ N
n=1
⎢ ⎣
di ,dj ∈Cn , i̸ =j
⎥ |Cn ||Cn − 1| ⎦
(18)
Average Inter-Cluster Distance:
⎡
InterDis (C ) =
1 N(N − 1)
⎤ ∑ ∑ 2dis(di , dj ) ⎣ ⎦ | Cn ||Cn′ | ∈C ,
∑ Cn ,C ′ n n̸ =n′
di ∈Cn dj ∈Cn′
2 https://code.google.com/archive/p/mltool4j. 3 http://jgibblda.sourceforge.net. 4 https://github.com/datquocnguyen/jLDADMM.
(19)
278
Y. Liu, T. Zhu, Y. Jiang et al. / Future Generation Computer Systems 94 (2019) 272–281
Table 2 Average coherence score of different methods for the Yelp dataset. Corpus
Topic size
WW-LDA
LDA
pLSA
Mix
Yelp
Topic-10 Topic-20 Topic-30 Topic-40 Topic-50
−3685.99 −3541.04 −3421.31 −3483.57 −3457.38
−3738.28 −3614.56 −3571.51 −3544.26 −3512.41
−3737.36 −3642.22 −3597.89 −3594.92 −3574.46
−3699.23 −3644.22 −3625.46 −3617.55 −3601.74
Table 3 Average coherence score of different methods for the OWL-TC4 dataset. Corpus
Topic size
WW-LDA
LDA
pLSA
Mix
OWL-TC4
Topic-10 Topic-20 Topic-30 Topic-40 Topic-50
−2422.948 −1927.2 −1646.49 −1472.04 −1407.59
−2462.708 −1998.343 −1721.292 −1526.999 −1440.808
−2586.593 −2252.7 −2005.926 −1838.538 −1703.277
−2636.597 −2473.192 −2339.964 −2242.013 −2201.758
Therefore, we calculate the following ratio to evaluate the quality of one topical representation of documents as: S=
InterDis (C ) IntraDis (C ) + InterDis (C )
(20)
(21)
Table 3 shows the average coherence score of all methods for the OWL-TC4 dataset. From the results, we can see that WW-LDA is better than all other methods. When topic size is set from 10 to 50, WW-LDA performs significantly better than LDA. LDA is better than pLSA, and Mix ranks last. The results confirm that WW-LDA can generate high qualified topics from short text which is a good way to extract topic features. Meanwhile, the coherence indicator of all methods improve with the topic size.
(22)
4.5. Quality of document clustering
We should maximize the S score based on the idea that these clusters should have low intra-cluster distances and high intercluster distances in cluster. Finally, to evaluate the quality of the service matchmaking method, we use the measurements of precision and recall as follows:
|A ∩ B| |B| |A ∩ B| recall = |A|
precision =
Fig. 4. S score of different topic modeling methods for the Yelp dataset.
where A is defined as a set of relevant services offered by the OWLTC4 dataset; B is regarded as a series of retrieved services supplied by the matchmaking method and the number is set from 5 to 25. 4.4. Quality of topics To evaluate topic quality generated by our ∑method, we calculate the average coherence using Eq. (15), i.e., K1 k Ck for each method. Similar as in previous studies, we select the most 30 important words as the basis to estimate the coherence values for each topic. Tables 2 and 3 give the average coherence values of all methods on Yelp open dataset and OWL-TC4 dataset. The number of topics K is set from 10 to 50. The alpha is set to 50/K and beta is set to 0.01, and cw is set as 2 for all experiments. The number of iterations is 1000. As shown in Tables 2 and 3, we can see that WW-LDA outperforms all other methods consistently, and the improvement over LDA is significant. Table 2 shows the average coherence scores for all the test methods in the Yelp dataset. From the table, we can know that WW-LDA outperforms other three methods. When topic number is set to 10, WW-LDA outperforms Mix slightly, and LDA is similar to pLSA. When topic number is set from 20 to 50, WW-LDA is still better than all other methods. As we can see from the table, LDA dominates Mix and pLSA on short texts. LDA outperforms pLSA due to the benefit of the prior. Although Mix is designed for short texts, it is a bit surprising that the performance of pLSA is better than Mix substantially. It could mean that Mix has a lower explanatory power for the Yelp dataset. In summary, the results show that WW-LDA can discover more qualified topics than the other benchmark methods. Meanwhile, traditional topic models cannot improve topics effectiveness from short texts.
For quantitative evaluation, document clustering is another way to assess the effectiveness of topic model. Besides topic–word distribution, we can also conclude document–topic distribution from the topic modeling method which can be regarded as feature vector of the document. Thus, document clustering based on the document feature vector is also a way to measure the quality of the topic modeling method. We implement K-Means clustering method as the benchmark to test the qualities of the feature vector generated by the proposed method. As mentioned above, we will use Eq. (20) to measure the results. As mentioned above, the higher the value of S, the better the quality of document clustering. The number of experiment clusters is set as 10, 20, 30, 40, and 50, the same as the number of topics. The number of iterations is 2000. The results are shown in Figs. 4 and 5 below. Fig. 4 shows the document clustering results of the different methods for the Yelp dataset. The results indicate that our WWLDA performs better than other algorithms for the task of document clustering, which confirms that WW-LDA is more proper for improving the quality of topics. As shown in Fig. 4, Mix ranks second, and LDA is better than pLSA except for the case when the number of clusters is set as 30. Fig. 5 shows the results of S score of the methods for the OWLTC4 dataset. The results also show that WW-LDA still performs better than all other methods for document clustering. When the number of experiment clusters is set to 10, WW-LDA is the best, Mix ranks second, and pLSA outperforms LDA. When the number of clusters is set from 20 to 50, pLSA ranks second after WW-LDA. Mix outperforms LDA when the number of clusters is set to 20 and 40. However, LDA is better than Mix in the case of 30 and 50.
Y. Liu, T. Zhu, Y. Jiang et al. / Future Generation Computer Systems 94 (2019) 272–281
Fig. 5. S score of different topic modeling methods for the OWL-TC4 dataset.
4.6. Quality of document classification In addition to document clustering, document classification is another task to measure the effectiveness of a topic model. We can treat topic modeling as a way of dimensionality reduction and use the accurate of the document classification task to measure the quality of the methods. We classified documents by RandomForest and Bagging. Fig. 6 shows performance on 10-fold cross-validation of models on the Yelp dataset and the OWL-TC4 dataset. From the classification performance shown in Fig. 6, we can see that WW-LDA always dominates the three baselines. In Fig. 6(a) and 6(b), it shows classification performance about RandomForest and Bagging for the Yelp dataset. No matter how the number of topics changes, WW-LDA achieves the highest accuracy, increased by 9.32% (compared with pLSA) at most in RandomForest and 0.24% (compared with Mix) at least in Bagging. In Fig. 6(c) and 6(d), WW-LDA achieves the best results and LDA has been weaker than other methods. Another important finding is that different classification methods are used to obtain different experimental results in the same dataset. In the actual dataset, we should choose a good method for classification. 4.7. Results of service matchmaking As mentioned above, IoT service interacts with request by exposing a piece of text. Th text contains the service description information. Request will find relevant service by understand the textual information. To improve the matching quality, topic modeling are used to extract topic-level information and make it understandable by requests. Meanwhile, we also extract word-level
279
information to improve the effectiveness of the matching results. To evaluate the performance of the proposed method, we use LDA, pLSA and Mix as baseline methods. Meanwhile, we explore the influence of different weights γ for the experimental result of service matchmaking about word signatures and topic signature as in Eq. (14). The precision and recall results are assessed for the service matchmaking methods when the matchmaking set size is from 5 to 25. The results are shown in Figs. 7 and 8. Higher precision rate means higher effectiveness of the method. Fig. 7 shows the precision rates as defined in Eq. (21), and we can see that: (1) Comparisons of matchmaking performance: WW-LDA shows advantages under different settings and achieves the highest precision rate in most situations. The highest precision rate reaches to 0.66 when matchmaking set size is equal to 5. (2) Comparing different matchmaking sets: the precision rate of WWLDA and LDA reach the peak when the matchmaking set size is 5. This result also shows that the accuracy of WW-LDA increases as the matchmaking set size decreases. (3) The precision rate of LDA is higher than that of Mix when the matchmaking set size is from 5 to 10. When the size is from 15 to 25, Mix is better than LDA. pLSA has the worst performance in all cases. Higher recall rate means higher accuracy of the method. Fig. 8 shows the recall rates as defined in Eq. (22). We can see that: WWLDA shows slightly better performance under most matchmaking set sizes. WW-LDA achieves the highest recall rates which is 32% when the matchmaking set size is 25, and reaches to the lowest rate of 8.8% when the size is 5. The recall rate of these service matchmaking methods all reaches to the peak when the matchmaking set size increases to 25. The recall rate of LDA is higher than that of Mix when the matchmaking set size is from 5 to 10. When the matchmaking set is from 15 to 25, Mix becomes better than LDA. pLSA still has the worst performance under all cases. In addition, we also obtain the word features by word2vec and the topic features by topic models. We aggregate the word signature and the topic signature into one single signature which produces the method ‘‘WW-LDA+Word2Vec’’. In addition to the comparison of the precision rate and the recall rate of different service matchmaking methods, we also explore the influence of different weights as in Eq. (14) when the number of topic is 5. The weight of topic feature λ is set from 0 to 1 with 0.1 as the increment. Fig. 9 shows that as the weight λ increases, the precision rate also increases. When λ is set to 0.9, the method achieves the highest precision rate which is 0.70. However, when λ = 1, the precision rate drops to 0.66. This proves that both of the two signatures are important for service matchmaking. The method achieves the worst precision rate when λ = 0 which is 0.557. Given the results above, we can have the following conclusions: (1) Using only topic features is more effective than using only word features. (2) The best results can be obtained by effectively integrating the word feature with the topic feature.
Fig. 6. Classification performance.
280
Y. Liu, T. Zhu, Y. Jiang et al. / Future Generation Computer Systems 94 (2019) 272–281
Fig. 8. Recall of the service matchmaking methods. Fig. 7. Precision of the service matchmaking methods.
5. Conclusions and future work Service matchmaking is a critical and challenging task in IoT for users to query, search and discover appropriate services. In this paper, we proposed a probabilistic topic model (WW-LDA) to extract topic signatures of IoT service description text and developed a service matchmaking algorithm based on hybrid method features. The experimental results show that incorporating the importance of different words into probabilistic topic model is an effective method to enhance topic quality. It is also benefit for extracting topic features from service descriptions text. In our proposed service matchmaking algorithm, topic signatures and word signatures are incorporated. The experimental results have shown that our service matchmaking algorithm can perform much better than other existing methods in terms of the precision rate and recall rate on real-world datasets. The proposed method is only one of the directions for improving the performance of the service matchmaking algorithm. There are also many other directions which can be explored. One possibility is to investigate new methods such as giving each sentence a weighting coefficient. Another possible extension of the proposed model is to incorporate topic signatures and word signatures into a unified topic model which may further reduce errors. Acknowledgments This work is supported by the Major Program of the National Natural Science Foundation of China (71490725); the Foundation for Innovative Research Groups of the National Natural Science Foundation of China (71521001); the National Natural Science Foundation of China (71722010, 91546114, 91746302, 71501057); and the National Key Research and Development Program of China (2017YFB0803303).
Fig. 9. Different λ for service matchmaking method based WW-LDA+Word2Vec.
References [1] J. Gubbi, et al., Internet of Things (IoT): A vision, architectural elements, and future directions, Future Gener. Comput. Syst. 29 (7) (2013) 1645–1660. [2] I. Lee, K. Lee, The Internet of Things (IoT): Applications, investments, and challenges for enterprises, Bus. Horiz. 58 (4) (2015) 431–440. [3] N.L. Fantana, et al., Internet of Things - Converging Technologies for Smart Environments and Integrated Ecosystems, 2013, pp. 153–204. [4] P. Li, et al., Deep convolutional computation model for feature learning on big data in Internet of Things, IEEE Trans. Ind. Inf. PP (99) (2017) 1–1. [5] Q. Zhang, et al., High-order possibilistic c-means algorithms based on tensor decompositions for big data in IoT, Inf. Fusion 39 (2018) 72–80. [6] Q. Zhang, et al., An incremental CFS algorithm for clustering large data in industrial Internet of Things, IEEE Trans. Ind. Inf. PP (99) (2017) 1–1. [7] S. Ben Fredj, et al., Efficient semantic-based IoT service discovery mechanism for dynamic environments, in: IEEE International Symposium on Personal, Indoor, and Mobile Radio Communication, 2014. [8] Q. Zhang, L.T. Yang, Z. Chen, Privacy preserving deep computation model on cloud for big data feature learning, IEEE Trans. Comput. 65 (5) (2016) 1351– 1362. [9] F. Bu, et al., Incomplete high-dimensional data imputation algorithm using feature selection and clustering analysis on cloud, J. Supercomput. 72 (8) (2016) 2977–2990. [10] I.H. Hsiao, P. Awasthi, Topic facet modeling:semantic visual analytics for online discussion forums, in: International Conference on Learning Analytics & Knowledge, 2015.
Y. Liu, T. Zhu, Y. Jiang et al. / Future Generation Computer Systems 94 (2019) 272–281 [11] T. Iwata, T. Yamada, N. Ueda, Probabilistic latent semantic visualization: topic model for visualizing documents, in: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008. [12] A. Mark, V. Gabriella, The hidden Markov Topic model: a probabilistic model of semantic representation, Top. Cognit. Sci. 2 (1) (2010) 101. [13] G. Cassar, P. Barnaghi, K. Moessner, Probabilistic matchmaking methods for automated service discovery, IEEE Trans. Serv. Comput. 7 (4) (2013) 654–666. [14] G. Cassar, et al., A hybrid semantic matchmaker for IoT services, in: IEEE International Conference on Green Computing and Communications, 2012. [15] Qiang Wei, Zhi Jin, Yan Xu, Service discovery for internet of things based on probabilistic topic model, J. Softw. 25 (8) (2014) 1640–1658. [16] Y. Liu, et al., A crowdsourcing-based topic model for service matchmaking in Internet of Things, Future Gener. Comput. Syst. (2018). [17] S. Deerwester, et al., Indexing by latent semantic analysis, J. Am. Soc. Inf. Sci. 41 (6) (1990) 391. [18] T. Hofmann, Probabilistic latent semantic indexing, in: ACM SIGIR Forum, ACM, 2017. [19] D.M. Blei, A.Y. Ng, M.I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res. 3 (Jan) (2003) 993–1022. [20] M. Rosen-Zvi, et al., The author-topic model for authors and documents, in: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, AUAI Press, 2004. [21] Y.W. Teh, et al., Sharing clusters among related groups: Hierarchical Dirichlet processes, in: Advances in Neural Information Processing Systems, 2005. [22] J.D. Mcauliffe, D.M. Blei, Supervised topic models, in: Advances in Neural Information Processing Systems, 2008. [23] D. Duan, et al., LIMTopic: a framework of incorporating link based importance into topic modeling, IEEE Trans. Knowl. Data Eng. 26 (10) (2014) 2493–2506. [24] X. Wang, A. McCallum, Topics over time: a non-Markov continuous-time model of topical trends, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2006. [25] D. Newman, E.V. Bonilla, W. Buntine, Improving topic coherence with regularized topic models, in: Advances in Neural Information Processing Systems, 2011. [26] D. Mimno, et al., Optimizing semantic coherence in topic models, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2011. [27] D. Ramage, S.T. Dumais, D.J. Liebling, Characterizing microblogs with topic models, ICWSM 10 (1) (2010) 16. [28] Y. Wang, E. Agichtein, M. Benzi, TM-LDA: efficient online modeling of latent topic transitions in social media, in: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2012. [29] J. Weng, et al., Twitterrank: finding topic-sensitive influential twitterers, in: Proceedings of the third ACM International Conference on Web Search and Data Mining, ACM, 2010. [30] W.X. Zhao, et al., Comparing twitter and traditional media using topic models, in: European Conference on Information Retrieval, Springer, 2011. [31] Y. Wang, et al., Using hashtag graph-based topic model to connect semantically-related words without co-occurrence in microblogs, IEEE Trans. Knowl. Data Eng. 28 (7) (2016) 1919–1933. [32] L. Hong, B.D. Davison, Empirical study of topic modeling in twitter, in: Proceedings of the First Workshop on Social Media Analytics, ACM, 2010. [33] O. Phelan, K. McCarthy, B. Smyth, Using twitter to recommend real-time topical news, in: Proceedings of the third ACM Conference on Recommender Systems, ACM, 2009. [34] A. Gruber, Y. Weiss, M. Rosen-Zvi, Hidden topic markov models, in: Artificial Intelligence and Statistics, 2007. [35] X. Yan, et al., A biterm topic model for short texts, in: Proceedings of the 22nd International Conference on World Wide Web, ACM, 2013. [36] D. Andrzejewski, X. Zhu, M. Craven, Incorporating domain knowledge into topic modeling via dirichlet forest priors, in: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, 2009. [37] Z. Ding, et al., A hybrid search engine framework for the internet of things, in: Web Information Systems and Applications Conference (WISA), 2012 Ninth, IEEE, 2012. [38] O. Ascigil, et al., A keyword-based ICN-IoT platform, in: The ACM Conference, 2017.
281
[39] T. Kurita, et al., An extension of Information-Centric Networking for IoT applications, in: International Conference on Computing, Networking and Communications, 2017. [40] Y. Matsuo, M. Ishizuka, Keyword extraction from a single document using word co-occurrence statistical information, I. J. Artif. Intell. Tools 13 (01) (2004) 157–169. [41] M. Litvak, M. Last, Graph-based keyword extraction for single-document summarization, in: Mmies 08 Workshop on Multi-source Multilingual Information Extraction & Summar, 2008. [42] R. Mihalcea, P. Tarau, Textrank: Bringing order into text, in: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 2004. [43] S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine, Comput. Netw. ISDN Syst. 30 (1–7) (1998) 107–117. [44] T. Mikolov, et al., Efficient estimation of word representations in vector space, 2013, arXiv preprint arXiv:1301.3781. [45] K. Nigam, et al., Text classification from labeled and unlabeled documents using EM, Mach. Learn. 39 (2–3) (2000) 103–134.
Yezheng Liu is professor and Dean of School of Management in the Hefei University of Technology, China. His research interests include decision support systems, cloud computing, artificial intelligence and big data. He has published papers in journals such as Marketing Science, European Journal of Operational Research, Decision Support Systems, International Journal of Production Economics, and International Journal of Production Research.
Tingting Zhu received her B.S. degree from College of Engineering, Nanjing Agricultural University, Nanjing, China, in 2016. She is currently a Ph.D. candidate with the school of Management, Heifei University of Technology, Hefei, China. Her main research interests include human behavior analysis and recommendation system.
Yuanchun Jiang is a professor at the School of Management in the Hefei University of Technology, China. His research interests include cloud computing, artificial intelligence and marketing analytics. He has published papers in journals such as Marketing Science, IEEE Transactions on Software Engineering, European Journal of Operational Research, International Journal of Production Economics, and International Journal of Production Research.
Xiao Liu received the master’s degree in management science and engineering from Hefei University of Technology, Hefei, China, 2007, and received the Ph.D. degree in computer science and software engineering from the Faculty of Information and Communication Technologies at Swinburne University of Technology, Melbourne, Australia, 2011. He is currently a Senior Lecturer at School of Information Technology, Deakin University, Melbourne, Australia. He has published papers in journals such as IEEE Transactions on Software Engineering, ACM Transactions on Software Engineering and Methodology and IEEE Transactions on Parallel and Distributed Systems. He is a member of the IEEE.