Engineering Applications of Artificial Intelligence 24 (2011) 1521–1531
Contents lists available at ScienceDirect
Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai
Semantic analysis and classification method for customer enquiries in telecommunication services Motoi Iwashita a,, Shinsuke Shimogawa b, Ken Nishimatsu c a
Chiba Institute of Technology, Tsudanuma 2-17-1, Narashino, Chiba 275-0016, Japan NTT Service Integration Laboratories, Midori-cho 3-9-11, Musashino, Tokyo 180-8585, Japan c NTT East Network Business Headquarters, Nishi-Shinjuku 3-19-2, Shinjuku, Tokyo 163-8019, Japan b
a r t i c l e i n f o
abstract
Article history: Received 8 May 2010 Received in revised form 24 December 2010 Accepted 13 February 2011 Available online 12 March 2011
A variety of services have recently been provided that depend on highly developed networks and personal equipment. With these advances, connecting this equipment has become increasingly more complicated. Customer enquiries about problems such as no-connection to an Internet/phone service will increase, and telecom operators will require the ability to understand such situations and act as quickly as possible. Therefore, it is important to analyze failure trends and establish, in advance, effective coping processes for complex problems conveyed in customer enquiries. We present a method for analyzing and classifying customer enquiries efficiently and precisely from the structural type of description in ontologies. Moreover, our method can reconstruct semantic content efficiently by extracting related terms through analysis and classification. This method is based on a dependency parsing and co-occurrence technique to enable classification of a large amount of unstructured data into patterns because customer enquiries are generally stored as unstructured textual data. The validity of the method is evaluated and the method to determine threshold values is developed by using a large amount of customer enquiries in the real operational field. & 2011 Elsevier Ltd. All rights reserved.
Keywords: Semantic content Co-occurrence Dependency parsing Telecom operation Text mining Ontology
1. Introduction Determining the cause of service problems is easy for a conventional fixed telephone service because the network structure is simple. Furthermore, telecom operators with know-how accumulated from long experience can act quickly. The penetration of fiber-to-the-home (FTTH) and asymmetric digital subscriber line (ADSL) technologies has induced the expansion of a variety of services, such as the exponential increase in use of the Internet, the provision of Voice over Internet Protocol (VoIP) and video distribution services, and security software countermeasures against virus attacks on PCs. Therefore, the end-to-end network structure has become complicated with regard to the connection of home equipment, such as a modem and its software setup. As a result, customer enquiries about problems such as noconnection to an Internet/phone service will increase and discovering the causes becomes difficult. Customer satisfaction decreases when a long time is spent on restoration due to discovery of the cause being difficult. Therefore, it is important to analyze failure trends and establish, in advance, effective coping processes for complex problems conveyed in customer enquiries. A customer enquiry is not classified
Corresponding author.
E-mail address:
[email protected] (M. Iwashita). 0952-1976/$ - see front matter & 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.engappai.2011.02.016
as structured data, i.e., it is unstructured data. Let us take a customer Internet enquiry about no-connection as an example. There are several causes, such as failure of the optical fiber, modem misconfiguration, or problems with the application setup, and several work-arounds, such as switching the power supply off and on, rebooting, or dispatching the problem. Both cause and work-around situations are included in text such as ‘‘switch off and on the power supply because the modem is blinking’’ or ‘‘noconnection to the Internet, so reboot the PC’’. Therefore, it is important to classify the unstructured textual data with accuracy. Simply understanding failure trends, noting customer requirements when analyzing an enquiry, then analyzing the word trends is not always effective. Since a telecom operator writes down information about a customer enquiry, the style of the description deeply depends on the operator. That is, the textual data contains several sentences including failure points, failure phenomena, causes, or results of work-arounds. Therefore, understanding the meaning of the sentences is essential. There are no effective and practical methods to semantically analyze text that are applicable to telecom management. In that sense, the method with structural type of description in syntactic, semantic and reasoning ontology levels is needed. A text classification method from a semantic point of view that considers the features of telecom services and the co-occurrence of terms with dependency parsing for classifying and analyzing the large amount of unstructured data that comprises customer
1522
M. Iwashita et al. / Engineering Applications of Artificial Intelligence 24 (2011) 1521–1531
enquiries is presented. Moreover, our method can reconstruct semantic content efficiently by extracting related terms through analysis and classification. Section 2 is devoted to related works. The features of textual data in telecom services are explained in Section 3. Section 4 presents our classification method. The evaluation results are described in Section 5, and the application of the method in the real field is discussed in Section 6.
2. Related works Text-mining techniques, such as morphological analysis, syntax analysis, co-occurrence relation, etc., have been reported as effective (Ohsumi, 2009; Sato et al., 2007; Sullivan, 2001; Toda et al., 2005). Text-mining is applicable to customer questionnaire analyses in product development, word searches in portal sites such as Google and Yahoo, term frequency analyses in web logs (blogs) or customer-generated media, article classification by keyword in news articles, and evaluation indexes of a company’s image. Mainly morphological analysis is applied in these areas to survey trends by analyzing the frequency of terms in selected text. In this analysis, keyword is extracted as a topic of a sentence in terms of the features of the network structure (Cutting et al., 1992; Ho et al., 2001; Leuski, 2001; Masuo et al., 2001; Ohsawa et al., 1997). Clustering and co-occurrence related methods have been proposed for classifying keywords and relating them to synonymous terms, different words having the same meaning, and synonyms, which have similar meanings (Rodoriguezd et al., 1998; Uejima et al., 2004). An improved method was proposed for synonymous term classification in fuzzy searches for the purpose of failure analysis (Naganuma et al., 2005). Clustering methods based on supervised learning have also been proposed (Akiba et al., 2006; Burnstein et al., 1998; Takahashi et al., 1992; Taira et al., 1998). They are mainly applied for searching the abstract of research papers and for automated scoring of descriptive answers, and they are effective for searching for similarities between texts based on the given trend terms/information. The text similarity evaluation methods were introduced in Agrawal and Srikaut (1995), Kawatani (2003), Manning and Schutze (1999); Sato and Nakagawa (2006)), and evaluated sequential patterns of sentences, or the term frequencies of words. The relationship among products and childhood injury for designing safety product is analyzed based on evidence-based textual data from the viewpoint of text mining (Nomori et al., 2010). Since this research proposed the relationship among product and injury based on term frequency of dependency parsing, it is effective for product designers to think and improve the product spending a certain period of time. However, since understanding the meaning of textual data with accuracy is essential, two term relationship is not enough for telecommunication services. Some business magazines published articles about introduction of analysis work and support systems for customer enquiry at a call center. These articles described only the correlation analysis of terms. No researches for understanding the meaning can not be found as far as the authors know. Therefore, structural type of description in syntactic, semantic and reasoning ontology levels (Akama, 2010) is needed. Ontologies of information appliances have been studied (Ohnuma et al., 2010) for telling friendliness of information appliances to customers in semantic web. The purpose of this research at the first step is to provide lexicons for describing information appliances in syntactic ontology level. These approaches are valid for verifying classified text, but knowing the meaning of the text is impossible.
3. Features of textual data in telecom services 3.1. Necessities of data classification and text mining In general, a telecom operator saves a customer enquiry and the coping process as information. The aim of this is to enable the finding of similar problems by searching with related keywords and to enable quick action when such problems occur. These coping processes are effective for sharing knowledge among assigned telecom operators and for improving their skills. Therefore, this method is useful when problems happen. However, the drawback of this method is that it is impossible to obtain an overview of all the possible patterns of a problem and to establish coping processes for more complex problems in advance. Therefore, it is necessary to establish an effective coping process for customer enquiries, assign optimal operators due to advance classification of customer enquiries, and survey failure trends. Information generally consists of text. It takes a long time to analyze text word by word and to classify large amounts of textual data. Therefore, an effective method based on a text-mining technique, such as term frequency analysis, number of synonymous words determination, and related terms extraction, is needed. 3.2. Limitations of conventional methods 3.2.1. Morphological analysis Figs. 1 and 2 show the relationships between term frequencies and their rankings for 10,000 customer enquiries about telecom services. The terms were classified and counted by morphological analysis, and those that appear more than 50 times are shown in Figs. 1 and 2. ‘‘Category A’’ is text relating to network components, and ‘‘Categories A-1 and A-2’’ are subsets of ‘‘Category A’’. Category A-1 includes network equipment, such as modems, PCs, telephones, etc., whereas Category A-2 is text relating to services, such as Internet, VoIP, etc. ‘‘Category B-1’’ is text relating to problem events, such as failure, misconfiguration, power faults, etc., whereas ‘‘Category B-2’’ includes workarounds, such as switching off and on the power supply, rebooting, dispatching the problem, etc. Both are subsets of ‘‘Category B’’. ‘‘Category 0’’ indicates terms analyzed without considering categories.
Fig. 1. Relationship between term frequencies and rankings: Category A.
M. Iwashita et al. / Engineering Applications of Artificial Intelligence 24 (2011) 1521–1531
1523
Fig. 2. Relationship between term frequencies and rankings: Category B.
Categories 0, A, and A-1 show almost a 1/n feature in Fig. 1, so they follow a power law (Newman, 2005; Zipf, 1949), as is normal in general sentences. A power law implies that there are many kinds of terms in the textual data and means that there are many types of customer enquiries. Since the slope of Category A-2 is steeper than Category A-1 (Fig. 1), the terms used in the textual data for services are limited compared with the terms for network equipment. In contrast, the slopes of Categories B, B-1, and B-2 (Fig. 2) are gradual compared with those of Category A. This shows the variety of failure phenomenon terms. Since the slope of Category B-2 is almost the same as that of Category B-1, operators use the terms for both problem events and work-arounds depending on the situation. These results show that it is possible to only survey the terms with high frequency, such as PC, telephone, Internet, etc. However, this implies that classification in terms of problem events and work-arounds must be difficult. 3.2.2. Limitations of correspondence analysis Correspondence analysis (Benzecri, 1992; Hayashi, 1993; Takahashi, 1996) is applied to classify terms into groups by term features. In our work, the terms with frequency rankings higher than 400 were selected from 10,000 customer enquiries. The results showed that the accumulated contribution rate was less than 20%, even considering the 10th factor, and most of the information was gathered in a small area. These results indicate the difficulty of describing features. 3.3. Categorization of textual data Textual data contains several sentences, written by an operator, containing information about a customer enquiry, including the failure point/failure phenomena. This data for each customer enquiry is stored individually in an operation support system. Therefore, the style of the description deeply depends on the operator. That is, the description style is not regulated. Let us take a customer Internet enquiry about no-connection as an example. There are several causes, such as failure of the optical fiber, modem misconfiguration, or application setup problem, and several work-arounds, such as switching the power supply off and on, rebooting, or dispatching the problem. Both cause and work-around situations are included in text such as ‘‘switch off
Fig. 3. Telecom features categorization.
and on the power supply because the modem is blinking’’ or ‘‘noconnection to the Internet, so reboot PC’’. To summarize, it is difficult to apply a text-mining technique directly to raw textual data in telecom management for semantic/ structural classification. Therefore, we need a modification to distinguish the features of telecommunication services. A telecommunication service is generally provided by an end-to-end network consisting of a telephone, PC, the carrier’s network, the provider’s server, etc. as shown in Fig. 3. There is clearly an event feature for each element of the network. We can predict that the equipment, such as the telephone, PC, and network, is strongly related to the problem, such as failure, misconfigured setup, and cable breakdown, respectively. Moreover, the operator recommends to the customer an efficient work-around, such as switching off and on the power supply, rebooting the PC, or dispatching the problem to the provider/vendor. Therefore, by designating the network factor as one event (Category A) and the problem/workaround as the other event (Category B), we can construct a semantic representation. Moreover, if a problem occurs in one piece of equipment in a network, it is expected to lead to problems in the other equipment or to other events. Next, we discuss the textual structure of a customer enquiry. To understand a customer enquiry and analyze the failure trend, it is desirable to include four kinds of information (Fig. 4) as follows. 1. 2. 3. 4.
Troubled equipment, Problem event, Affected equipment/service, Work-around/related problem event.
How much information is input deeply depends on the operators. There are six patterns for a two-term relationship when we select two terms. ‘‘PC is misconfigured’’, ’’Modem trouble induces Internet trouble’’, ‘‘PC trouble induces no-connection’’, ‘‘Blinking affects Internet trouble’’, ‘‘Switch on and off the power supply because of blinking’’, and ‘‘Internet is not connected’’ correspond to P21, P22, P23, P24, P25, and P26,
1524
M. Iwashita et al. / Engineering Applications of Artificial Intelligence 24 (2011) 1521–1531
Fig. 5. Framework of classification.
Fig. 4. Textual structure for telecom enquiry.
respectively. We can construct the meaning of the classified text, but we need some sentences to fill the gap between two terms. We have four patterns for a three-term relationship when we select three terms. ‘‘PC misconfiguration causes Internet trouble’’, ‘‘Switch on and off power supply because of modem blinking’’, ‘‘PC trouble affects no-connection to the Internet’’, and ‘‘Software reinstall because of security trouble’’ correspond to P31, P32, P33, and P34, respectively. It is possible to classify the text more precisely using three terms. These patterns can support the construction of semantic content for classified text and the establishment of coping processes.
4. Classification method 4.1. Framework of classification Since knowing the meaning of the telecommunication text is difficult as previously described, structural type of description consisting of syntactic, semantic and reasoning ontology levels is needed as shown in Fig. 5. The purpose of syntactic ontology level definition in telecom services is classifying terms based on categories as network characteristics. A large number of customer enquiries is input. Morphological analysis and dependency parsing are effective preprocessing steps for classifying this input in terms of category. The requirement for customer enquiry classification is based on the ability to cover all textual data and to fit the operator’s thought process. We want this classification to be determined from the viewpoints of term frequency, co-occurrence, and transition-rate calculation. Therefore, the purpose of semantic ontology level definition is correlating multiple terms. Term frequency can tell us what kind of customer enquiries often appear, while co-occurrence tells us which terms are strongly
correlated. The transition-rate calculation tells us the relationship among multiple terms. The framework in which customer enquiries (textual data) are classified in terms of these three criteria is shown in Fig. 5. Term-frequencies and co-occurrences of these selected terms are calculated and selected by threshold parameters as input. The purpose of reasoning ontology level definition is to construct the meaning of customer enquiry with selected terms efficiently. Therefore, we construct the semantic content for the classified text in order to establish coping processes at the final stage. 4.2. Term classification for telecom enquiries in syntactic ontology level The terms are defined to be classified according to ALC (Attribute concept Language with Compliments) (SchmidtSchauss and Smolka, 1991) in Description Logic as syntactic ontology level. Formal definition of syntactic ontology level for telecom enquiry: Let A and B be defined as the sets of concept names by network component and problem/work-around, respectively. And O A [ B, where O is defined by A [ B. Let r be defined as a role such that r A O O. Then the followings are concepts for telecom enquiry. (1) > (all), ? (none), and K A O - e:g: if K is ½provider’s equipment, then K A O (2) K u L (intersection of K and L) - e:g: ½PC ½terminal equipment u ½Internet (3) K t L (union of K and L) - e:g: ½terminal equipment ½PC t ½Tel t ½TV (4) :K (Complement of K) - e:g: ½modem D :½PC (5) 8r:K (Value restriction) - e:g: 8½Internet ½trouble:½no-connection (i.e. all ‘‘tourbles’’ are ‘‘no-connection’’, and ‘‘no-connection’’ is a modifying word of ‘‘Internet’’) (6) ( r:K (Existential restriction) - e:g: (½PC ½trouble:½misconfiguration (i.e. some ‘‘troubles’’ are ‘‘misconfiguration’’, and ‘‘misconfiguration’’ is a modifying word of ‘‘PC’’) Preprocessing customer enquiry classification for semantic analysis is important in syntactic ontology level. The outputs of
M. Iwashita et al. / Engineering Applications of Artificial Intelligence 24 (2011) 1521–1531
1525
this level are sets of textual data classified by access types and the relationship between terms and categories. The classification rules are procedures 1 and 2, and also are shown in Figs. 6 and 7.
trouble, e.g. the Internet works whereas e-mail does not. Especially, dependency parsing with negative meaning can efficiently identify such situation. Therefore, this procedure is effective.
Procedure 1: Classification by morphological analysis and
Procedure 2: Classification by type of access and category
dependency parsing (Fig. 6): J First, textual data is classified by morphological analysis. J Each term that has a negative meaning is classified by dependency parsing (e.g., VoIP + not good). J The same negative meanings, such as ‘‘VoIP + not good’’ and ‘‘VoIP + no-connection’’, are grouped and belong to the same term (e.g., VoIP).
(Fig. 7): J Each customer enquiry has a given code for the access type, i.e., type of service, such as dial-up, ADSL, and FTTH. Therefore, a customer enquiry is classified in terms of access according to the identifier code. J The customer enquiry is classified in terms given by Procedure 1. Category A indicates the network component, such as a telephone, PC, or modem, while Category B indicates the problem event/work-around, such as noconnection, misconfiguration, or rebooting. The classification is allowed to contain several terms for one customer enquiry.
Steps of morphological analysis and dependency parsing can make telecom operation work more efficient with grouping many synonymous terms and synonyms as a same term. Moreover, associated factors that complicate the situation include partial
Fig. 6. Classification by morphological analysis and dependency parsing.
Fig. 7. Classification by category.
1526
M. Iwashita et al. / Engineering Applications of Artificial Intelligence 24 (2011) 1521–1531
-e:g: ð8½Internet ½trouble:½no-connectionÞI ½misconfiguration A O (6) ð(r:KÞI ¼ fx A DI j (y A DI ððx,yÞ A r I and y A K I Þg -e:g: ð(½PC ½trouble:½misconfigurationÞI ½no-connection A O
with
Dial-up needs wire line and terminal adaptor (TA), while ADSL needs modem instead of TA. Moreover, FTTH needs optical fiber and optical network unit as a terminal. The troubles in telecom management deeply depend on the access type constructing such different equipments, and it should be necessary to relate the equipment and problems. Therefore, Procedure 2, classification of access type, is essential. Since the number of access types is small, there are little effects for operation works in a real field.
The classification rules in semantic ontology level are as follows.
4.3. Calculation based on co-occurrence in semantic ontology level
Procedure 3: Calculation of term frequency and co-occurrence (Fig. 8): J The frequency of both terms, Ai and Aj, Bi and Bj, and Ai and Bj, where Ai A A and Bj A B, appearing in textual data is represented by f(Ai, Aj), f(Bi, Bj), and f(Ai, Bj), respectively. Let us select a pair in terms of a frequency greater than b, where b is a given threshold. b is decided by the ratio of operation works for the issue per day/month in terms of management point of view. Therefore, problems of rare occurrence are eliminated. Then, calculate co-occurrence as follows:
The relationships among multiple terms are defined to be classified as semantic ontology level. Formal definition of semantic ontology level for telecom enquiry: Let I be defined as interpretation satisfying as follows. (i) D: domain of I . - i.e. Domain is defined as all term relationships among network elements, services, problem events and workarounds in telecommunication services. (ii) I : interpretation function as follows. (ii-1) xI is true for a given x A O when there exists (z A O satisfying Cðx,zÞ Z a, where Cðx,zÞ is a co-occurrence rate and a is a given threshold. (ii-2) rðx,yÞI is true for a given rðx,yÞ A O O when there exists (z A O satisfying Cðx,zÞ Z a and Cðy,zÞ Z a
CðAi ,Aj Þ ¼ f ðAi ,Aj Þ=ðf ðAi Þ þ f ðAj Þf ðAi ,Aj ÞÞ,
CðBi ,Bj Þ ¼ f ðBi ,Bj Þ=ðf ðBi Þ þ f ðBj Þf ðBi ,Bj ÞÞ,
with
ð2Þ
for any Bi and Bj such that Bi ,Bj A B: CðAi ,Bj Þ ¼ f ðAi ,Bj Þ=ðf ðAi Þ þ f ðBj Þf ðAi ,Bj ÞÞ J
with
ð1Þ
for any Ai and Aj such that Ai ,Aj A A:
Then the followings are interpretation of telecom enquiry.
(1) >I ¼ DI , ?I ¼ 0 (2) ðK u LÞI ¼ K I u LI -e:g: ½PCI ½terminal equipmentI u ½InternetI ½mis-input A O (3) ðK t LÞI ¼ K I t LI -e:g: ½terminal equipmentI ½PCI t ½TelI t ½TVI ½power supply trouble A O (4) ð:KÞI ¼ DI \K I -e:g: ½modemI Lð:½PCÞI with ½no-connection A O (5) ð8r:KÞI ¼ fx A DI j 8y A DI ððx,yÞ A r I -yA K I Þg
with
ð3Þ
for any Ai and Bj such that Ai A A and Bj A B. Procedure 4: Transition among multiple terms (Fig. 9): Step 1: Set ai r1, where i¼1,2 as given thresholds. Step 2: Set a ¼ a1 maxi,j fCðAi ,Aj Þ,CðBi ,Bj Þ,CðAi ,Bj Þg: Step 3: Select a pair with Categories A and B satisfying a r CðAi ,Bj Þ, Ai A A, and Bj A B. Step 4: Select a pair with Category A satisfying a r CðAi ,Aj Þ, Ai ,Aj A A, where a rCðAi ,Bk Þ, Ai A A, and Bk A B. Step 5: Select a pair with Category B satisfying a r CðBi ,Bj Þ, Bi ,Bj A B, where a rCðBi ,Ak Þ, Bi A B, and Ak A A. Step 6: Compare frequency value (fc) of selected pair-of-terms by term frequency and that (fp) of first selected pair by Step 4 or 5.
Fig. 8. Calculation of term frequency and co-occurrence rate in the same category.
M. Iwashita et al. / Engineering Applications of Artificial Intelligence 24 (2011) 1521–1531
1527
Fig. 9. Transition rate among multiple terms. Fig. 10. Relationship construction rule for three terms.
If the difference of these values, j fc fp j =fc , is greater than a2 , revise the value of a as a1 a and go to Step 2, else end. J
The a1 and a2 should be adjusted through many evaluation processes generally. We set a1 ¼ 0:9 and a2 ¼ 0:1 at the starting point, because it was difficult to tell how the number of pairs are increased/decreased. The effectiveness of procedure 4 is explained. For example, a customer makes a general complaint that he/she is not able to send e-mail from his/her PC as there is no connection with the Internet. The cause might not be in the PC but in the modem setup in that case. Therefore, as a transitional way of thinking, we need relative keywords to suggest other causes. The relationship among multiple terms should be clear. If we calculate all cooccurrence values between the pairs of terms in Categories A and B, we need a long calculation time: n2 n2 ¼ o(n4). If we use procedure 4, on the other hand, we can reduce the calculation time to n n¼ o(n2). This is because co-occurrence is calculated for each category as a unit.
4.4. Construction of semantic content with multiple terms in reasoning ontology level To establish efficient coping processes, it is important to understand the meaning of a customer enquiry. Therefore, we need a construction method for semantic content with the selected terms in reasoning ontology level. Basically, we need four terms according to the textual structure as shown in Fig. 4. However, it is difficult to predict the complete text because the description depends on the operators. The selected two terms with a high co-occurrence rate are effective for understanding the failure trend. Moreover, the terms are fit to one of the six patterns in a two-term relationship as described in Section 3.3. However, it is difficult to construct the meaning because of whether the selected two terms belong to the troubled equipment, problem event, affected equipment/services, or work-around/related problem event. Three or more terms are obtained by transition-rate calculation. Constructing semantic content is also difficult with the selected three terms. However, there are fewer patterns according to the textual structure in the case of a three-term relationship.
The meaning of text based on three terms is defined to be classified as reasoning ontology level. Formal definition of reasoning ontology level for telecom enquiry: For both ð8rðx,yÞ:KÞI with z A O and ð(rðx,yÞ:KÞI with z A O, the semantic content is constructed as one of the following four forms. P31: P32: P33: P34:
‘‘z’’ is caused by that ‘‘x’’ is ‘‘K’’. ‘‘z’’ because of that ‘‘x’’ is ‘‘K’’. ‘‘z’’ causes that ‘‘x’’ is ‘‘K’’. That ‘‘x’’ is ‘‘K’’ because of ‘‘z’’.
It is not always true that the above definition covers all textual data for customer enquiries. This is because there exist sentences with different meanings even keeping the same relationship among same terms. The validity of this definition is evaluated in Section 5.3. An efficient semantic-content construction rule is proposed as procedure 5 and Fig. 10.
Procedure 5: Step 1: Classification of text into pattern 1 (P31, P33) or pattern 2 (P32, P34): J If the selected terms include two terms in category A, then apply for pattern 1, else pattern 2. Step2: Semantic-content construction for pattern 1 J If CðAi ,Bk Þ Z CðAj ,Bk Þ, then ðAi & Bk Þ-Aj , or Aj -ðAi & Bk Þ J If CðAi ,Bk Þ r CðAj ,Bk Þ, then ðAj & Bk Þ-Ai , or Ai -ðAj & Bk Þ Procedure 3: Semantic-content construction for pattern 2: J If CðBi ,Ak Þ Z CðBj ,Ak Þ, then ðBi & Ak Þ-Bj , or Bj -ðBi & Ak Þ J If CðBi ,Ak Þ r CðBj ,Ak Þ, then ðBj & Ak Þ-Bi , or Bi -ðBj & Ak Þ The patterns described in Section 3 are selected in each classification. 5. Evaluation of results 5.1. Coverage by categorization in syntactic ontology level The evaluation point of procedures 1 and 2 is the coverage rate of textual data by classified terms in terms of syntactic ontology level. Since the coverage rate by classified terms deeply depend
1528
M. Iwashita et al. / Engineering Applications of Artificial Intelligence 24 (2011) 1521–1531
on the amount of textual data input, term frequencies are evaluated based on the data amount as a parameter at first. Then, the coverage rate by procedures 1 and 2 is evaluated. We used about 180,000 customer enquiries made during a month in 2007. The term frequency of categorized terms was calculated and ordered as shown in Fig. 11. Terms A1, A2, A3, A4, and A5 correspond to ‘‘Internet’’, ‘‘e-mail’’, ‘‘Telephone using IP technology (IP-Tel)’’, ‘‘PC’’ and ‘‘Modem’’ respectively, while B1, B2, B3, B4, B5, and B6 correspond to ‘‘misconfiguration’’, ‘‘blinking’’, ‘‘power supply on and off’’, ‘‘security trouble’’, ‘‘transmission speed degradation’’, and ‘‘line failure’’ respectively. The term frequencies for one month and for three days were compared. We confirmed that the results obtained using about 180,000 data items were almost the same as the results obtained using about 18,000 data items from three days. Therefore, the input for one month of data was stable. The coverage rate by the classification method based on procedures 1 and 2 was evaluated next. The proposed method with ten terms was compared with the conventional method by simply using morphological analysis and the top ten terms. The coverage rate on a daily basis is shown in Fig. 12. In general, the conventional method covers the data at a higher rate. This is because the number of enquiries including simply the given term is counted, whereas the proposed method counts only negative texts. However, the results show little difference between the two methods. Procedure 1 using dependency parsing neglects complicated text structures, including partial trouble, such as ‘‘Internet works whereas e-mail does not,’’ and classifies only real
Fig. 11. Validity of enquiry data.
problems. Therefore, the proposed procedure 1 is effective for classifying customer enquiries. 5.2. Co-occurrence in semantic ontology level The evaluation point of procedures 3 and 4 is the co-occurrence rate of multiple terms as transition rate in terms of semantic ontology level. These procedures are compared with the conventional pair-of-terms frequency. 5.2.1. Co-occurrence in a category The order of choice strongly depends on the threshold in classification procedures. The relationship between term frequency and co-occurrence in a category from about 180,000 customer enquiries is shown in Table 1. Pairs of terms are ordered by term frequency. A pair of terms A1 and A4 is first chosen by term frequency, for example. There are three cases of co-occurrence with the threshold as a parameter. The selected pairs of terms become similar with respect to frequency when the threshold decreases. Using the choice decided by co-occurrence was effective for representing the features of the text because we screened and chose pairs with high frequency by procedure 3. The threshold is given a high value in the first step. If the number of pairs is small, a decreases in the second step. In this way, the iteration step of decreasing a enables the observation of relationships among terms. 5.2.2. Co-occurrence among categories Co-occurrence among categories was calculated for the textual data under the given thresholds a ¼ 0:05 and 0:09: Pairs of terms obtained by our proposed method are shown in Table 2 with the pairs of terms obtained by term frequency (conventional method). The frequency value means how many times a pair of terms appeared in 180,000 enquiries. The ratio of pure pairs means the number of enquiries including only selected terms. Let us first focus on the choice of pairs with the threshold as a parameter. The number of candidates increases when the threshold decreases. This is because of the weakness of co-occurrence. The number of pairs with a transition rate grows when the threshold decreases. Let us compare the proposed method and the method using only term frequency for selecting pairs of terms. As for the threshold value, the frequency of the 5th choice was 2377 obtained by pair-of-terms frequency, while that of the 5th choice obtained by the proposed method was 1100 with a ¼0.09. The choice obtained by term frequency was appropriate in this case because the difference between these two choices was large. Therefore, a should be improved to a small value. The frequency Table 1 Choice of pair of terms. Pair of terms
Fig. 12. Coverage rate by categorization.
Term frequency
Co-occurrence
a ¼ 0:09
a ¼ 0:05
a ¼ 0:02
A1 A1 A1 A4 A3
& & & & &
A4 A5 A3 A5 A5
1st choice 2nd 3rd 4th 5th
2nd 3rd – 1st –
2nd 3rd 4th 1st –
2nd 3rd 4th 1st 5th
B2 B1 B1 B1 B2 B3
& & & & & &
B3 B4 B2 B3 B4 B4
1st choice 2nd 3rd 4th 5th 6th
1st 2nd – – – –
1st 2nd 3rd – – 4th
1st 2nd 3rd 5th – 4th
M. Iwashita et al. / Engineering Applications of Artificial Intelligence 24 (2011) 1521–1531
1529
Table 2 Comparison of two methods. No.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Proposed method
Conventional method
a ¼ 0:09
a ¼ 0:05
Pair of terms
Pair of terms
Frequency
Radio of pure pairs (%)
A5 & B1 A1 & B2 A4 & B1 A1 & B1 (A4 & A5) & B1 (A1 & A4) & B1 (A1 & A5) & B1 – – – – – – – –
A5 & B1 A1 & B2 A4 & B1 A1 & B1 A5 & B3 A5 & B2 A3 & B2 A4 & B4 (A4 & A5) & B1 (A1 & A4) & B1 (A1 & A5) & B1 A5 & (B2 & B3) (A1 & A5) & B2 (A1 & A3 & A5) & B2 A4 & (B1 & B4)
5065 7981 4050 5978 1264 2142 2377 994 1100 1492 1340 315 1188 227 336
51.4 66.9 43.6 51.1 29.3 28.6 62.9 45.2 41.8 56.6 47.9 34.3 69.0 31.3 62.8
Table 3 Coverage rate with co-occurrence.
Pair of terms
Frequency
Radio of pure pairs (%)
A1 A1 A5 A4 A3 A1 A5 A5 A3 A3 A1 A4 A5 A1 A4
7981 5978 5065 4050 2377 2364 2142 1264 1194 1127 1063 994 686 558 556
66.9 51.1 51.4 43.6 62.9 30.5 28.6 29.3 60.4 41.7 42.6 45.2 22.9 63.4 35.6
& & & & & & & & & & & & & & &
B2 B1 B1 B1 B2 B3 B2 B3 B1 B3 B4 B4 B4 B6 B3
Table 4 Three-term relationship.
Pair of terms
Ratio of pure pairs (%)
Coverage rate (%)
Pair of terms
Pattern
Concordance rate (%)
A4 & B1,A5 & B1 (A4 & A5) & B1 A1 & B1,A4 & B1 (A1 & A4) & B1 A1 & B1,A5 & B1 (A1 & A5) & B1 A5 & B2,A5 & B3 A5 & (B2 & B3) A1 & B2,A5 & B2 (A1 & A5) & B2 A4 & B1,A4 & B4 A4 & (B1 & B4) A1 & B2,A3 & B2,A5 & B2 (A1 & A3 & A5) & B2
43.6, 41.8 51.1, 56.6 51.1, 47.9 28.6, 34.3 69.9, 69.0 43.6, 62.8 66.9, 31.3
51.4
60.2
43.6
66.3
51.4
65.6
29.3
35.3
(A4 & A5) & B1 (A1 & A4) & B1 (A1 & A5) & B1 A5 & (B2 & B3) (A1 & A5) & B2 (A1 & A3 & A5) & B2 A4 & (B1 & B4)
P31,P33 P31,P33 P31,P33 P32,P34 P31,P33 P31,P33 P32,P34
81.5 90.2 86.9 93.3 97.1 80.2 81.0
28.6
75.9
45.2
51.5
62.9, 28.6
74.6
of the 9th choice was 1194 by pair-of-terms frequency, while that by the proposed method was 1100 when a ¼ 0.05. However, the difference was small when a ¼0.05. Therefore, the choice obtained by the co-occurrence is appropriate and a can be determined by the difference between frequency by pair-of-terms frequency and that by the proposed method. 5.3. Construction rule evaluation of three-term relationship in reasoning ontology level The selected pairs of two terms by the proposed method had a high co-occurrence rate and term frequency. We compared the pairs of three terms obtained by the proposed method and the pairs of two terms obtained using term frequency from the 9th to 15th choice. Using a pair of two terms cannot always construct applicable semantic content because the number of frequency for two terms is sometimes small compared with that for three terms (A3 & B3 and (A1 & A4) & B1, for example, as shown in Table 2). Moreover, the coverage rate by pure pairs, which means text consisting of only selected terms, was evaluated as shown in Table 3. Using pairs of two terms as an example, the coverage rates of A4 & B1 and A5 & B1 are 43.6% and 51.4% respectively. If we consider (A4 & A5) & B1, the coverage rate becomes 60.2% by A4 & B1 3 A5 & B1 3 (A4 & A5) & B1, where ‘‘3’’ is a disjunction of logical symbols. The coverage rate mostly improved more than
50% when pairs of three terms were used, while the pairs of two terms by term frequency did not reach a high coverage rate. The constructed semantic content by pairs of two terms is sometimes ambiguous, while the pairs of three terms can construct semantic content more concretely. Therefore, introducing a pair of three terms by the proposed method is effective for classifying semantic content. Finally, we discuss the validity of constructing semantic content by the selected three terms. We calculated the concordance rate of the constructed meaning and the classified text by applying the method explained in Section 3. In the case of (A4 & A5) & B1 corresponding to patterns P31 or P33, a high concordance rate of 81.5% was obtained. We also obtained a high concordance rate for the other six pairs in Table 4. This suggests that the coping processes can be efficiently established.
6. Application to real operation This section describes a role of an operation support system (OSS) which embeds our method and its application for workflow in a real field. There are two roles for telecom operators. One is front operator who deals with customers, and the other is back operator solving technological problems with his/her high skill. Telecom management generally has planning and routine works as operation. Since an OSS consists of our method, knowledge database, and textual data, it is effective for both works as shown in Fig. 13. Especially, the aim of our method focuses on planning work. A back operator can analyze textual data using our method by OSS (Op-1), and plan coping processes in advance (Op-2). Therefore, stored coping processes in knowledge database of OSS can be searched by front operator in routine work easily and
1530
M. Iwashita et al. / Engineering Applications of Artificial Intelligence 24 (2011) 1521–1531
Fig. 13. System and workflow for classification.
efficiently (Op-3). Moreover, a front operator can improve the operation manual as a feedback (Op-4). It means that unified representation of textual data is possible for front operator based on text structure with four-term relationship as described in Section 3 (Op-5). Therefore, analysis time is expected to be reduced by back operators using unified representation of textual data through OSS. In general, customer enquiries change when a new service is provided, new network equipment is installed, or a new problem event occurs. Therefore, it is necessary to update the category input (Fig. 5) from the transitional point of view. The threshold parameter for term frequency needs to be adjusted. Hence, new indications for coping can be discovered through the proposed method.
7. Conclusion A classification technique for customer enquiries is needed due to the increasing complexity of the connections in end-to-end networks in the telecom operating field. We presented a method for analyzing and classifying customer enquiries that enables quick and efficient responses based on the structural type of description. Because customer enquiries are generally stored as unstructured textual data, this method is based on dependency parsing and co-occurrence techniques to enable classification of a large amount of unstructured data into patterns. Moreover, a construction method for semantic content based on a comparison of the co-occurrence rates among selected terms was proposed. We applied the proposed method to around 180,000 customer enquiries and found it to be effective not only for understanding the meaning but also obtaining an overview of all the possible patterns of a problem to establish coping processes in advance.
Acknowledgements This work was conducted when the first author worked at the NTT Service Integration Laboratories.
References Agrawal, R., Srikaut, R., 1995. Mining sequential patterns. In: Proceedings of ICDE 1995. IEEE Press, pp. 3–14. Akama, S., 2010. Understanding Ontology. Kohgakusya. Akiba, Y., Tanaka, T., Suyama, T., Nagata, M., 2006. Grading examninee’s answer sentences by verifying syntactic and semantic compatibility. In: IPSJ SIG TR 2006-NL-174(b), pp. 31–35. Benzecri, J.-P., 1992. Correspondence Analysis Handbook. Marcel Dekker. Burnstein, J., Kukich, K., Wolff, S., Lu, C., Chodorow, M., Braden-Harder, L., Harris, M.D., 1998. Automated scoring using a hybrid feature identification technique. In: Proceedings of Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, ACL-COLING-1998, pp. 206–210. Cutting, D., Kager, D., Tukey, J., 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In: Proceedings of 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329. Hayashi, C., 1993. Quantification – Theory and Method. Asakura-shoten. Ho, X., Ding, C., Zha, H., Simon, H., 2001. Automatic topic identification using webpage clustering. In: Proceedings of 2001 IEEE International Conference on Data Mining, pp. 195-202. Kawatani, T., 2003. Document clustering via commonality analysis of multiple documents. In: IPSJ NL TR 154-14, pp. 93–100. Leuski, A., 2001. Evaluating document clustering for interactive information retrieval. In: Proceedings of 2001 ACM International Conference on Information and Knowledge Management, pp. 33–40. Manning, C.D., Schutze, H., 1999. Foundations of Statistical Natural Language Processing. The MIT Press. Masuo, Y., Ohsawa, Y., Ishizuka, M., 2001. Document as a small word. In: Proceedings of JSAI International workshop (Lecture Notes on Artifiical Intelligence), vol. 2253, pp. 444–448. Naganuma, K., Isonishi, T., Aikawa, T., 2005. Diamining: text mining solution for customer relationship management. Mitsubishi Technical Report 79-4, pp. 259-262. Newman, M., 2005. Power laws, pareto distributions and zipf’s law. Contemporary Physics 46, 323–351. Nomori, K., Kitamura, K., Motomura, Y., Nishida, Y., Yamanaka, T., Komatsubara, A., 2010. Constructing model for relationship among behaviors and injuries to products based on large scale text data on injuries. Journal of the Japanese Society for Artificial Intelligence 25 (5), 602–612. Ohnuma, H., Matsudaira, M., Hosomi, I., Fukushige, Y., Tomioka, Y., Nomoto, M., 2010. Ontologies make information appliances more firendly. Journal of the Japanese Society for Artificial Intelligence 25 (4), 509–517. Ohsawa, Y., Benson, N., Yachida, H., 1997. Keygraph: Automatic indexing by cooccurrence graph based on building construction metaphor. In: Proceedings of IEEE Forum on Research and Technology Advances in Digital Libraries, pp. 150–157. Ohsumi, N., 2009. Mining of textual data, recent trend and its direction, /http:// wordminer.comquest.co.jp/wmtips/pdf/20060910_1.pdfS.
M. Iwashita et al. / Engineering Applications of Artificial Intelligence 24 (2011) 1521–1531
Rodoriguezd, M., Gomez-Ilidalgo, J., Diaz-Agudo, B., 1998. Using wordnet to complement training information in text categorization. In: Proceedings of Recent Advances in Natural Language Processing, pp. 12–18. Sato, S., Fukuda, K., Sugawara, S., Kurihara, S., 2007. On the relationship between word bursts in document streams and clusters in lexical co-occurrence networks. IPSJ 48-SIG14, pp. 69–81. Sato, I., Nakagawa, H., 2006. Mining semi-structure for text with dependency structure. In: IPSJ SIG TR 2006-DBS-140(II), pp. 207–214. Schmidt-Schauss, M., Smolka, G., 1991. Attributive concept descriptions with complements. Artificial Intelligence 48, 1–26. Sullivan, D., 2001. Document Warehousing and Text Mining. John Wiley.
1531
Taira, H., Mukouchi, T., Haruno, M., 1998. Text categorization using support vector machine. In: IPSJ NL TR 128-24, pp. 173–180. Takahashi, S., 1996. Correspondence Analysis by Excel, Ohm-sya. Takahashi, S., Takahashi, S., Yasuda, N., Takahata, N., Ishikawa, T., 1992. A meaningful keywords extracting system based on a sentence-semantic analysis method. In: IPSJ AI TR 90-8, pp. 65–72. Toda, H., Kataoka, R, Kitagawa, H., 2005. Clustering news articles using named entities. IPSJ SIG Technical Report 2005-DBS-137, pp. 175-181. Uejima, H., Miura, T., Shioya, I., 2004. Improving text categorization by synonym and polysemy. Transactions on IECIE J87-D-I 2, 137–144. Zipf, G., 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley.