Information Sciences 415–416 (2017) 269–282
Contents lists available at ScienceDirect
Information Sciences journal homepage: www.elsevier.com/locate/ins
Quantifying textual terms of items for similarity measurement Longquan Tao∗, Jinli Cao, Fei Liu Department of Computer Science and Information Techonology, La Trobe University, 1 Kingsbury Dr, Bundoora, VIC 3083, Australia
a r t i c l e
i n f o
Article history: Received 10 October 2016 Revised 19 June 2017 Accepted 23 June 2017 Available online 26 June 2017 Keywords: Recommender system Item similarity Dimension classification Textual attribute quantifying
a b s t r a c t It is well known that recommender systems rely on the similarity between items to be recommended. Most current research projects in this area utilize traditional similarity measurement algorithms, such as cosine distance or derivatives of these. However, the most challenging problem facing these approaches is to quantify the non-numerical attributes of items. This is quite intractable and cannot be solved with regular similarity measurement algorithms. This paper proposes two novel methods, the Taxonomic Trees Similarity Measurement (TTSM) and the Decomposed Structures Similarity Measurement (SDSM), so that the similarities between the textual attributes can be measured using numeric values after they have been quantified. Also, the quantifying process is completely based on the semantic meanings of the textual terms. Furthermore, a maximized term matching (MTM) mechanism is induced and applied to the group-based textual attributes of items in recommender systems. Finally, we evaluate our methods by implementing a recipe recommender system which achieves a 74.4% overall satisfaction rate as evaluated by real users. © 2017 Elsevier Inc. All rights reserved.
1. Introduction With the development of Internet technologies, the volume of information online has grown enormously. Consequently, it has become very difficult to find information of interest on the Internet. Additionally, it is desirable if items of interest are actively recommended to users when they are making a decision. Recommender systems are a type of information system that can make intelligent decisions, based on some known information associated with each user, such as the user’s profile, preferences, buying/browsing history and even context such as the weather and the time of a day. The items to be recommended are general items relevant to the corresponding recommender systems, usually information of interest, consumables or types of services [38]. Moreover, personalization is always a desirable feature in a recommender system. Within this large-scale research area, most current research projects focus on the accuracy of recommendations, which is based on measuring the similarity between the items utilizing traditional algorithms or derivatives, such as Euclidean distance and cosine similarity. Also, all these traditional similarity measurement techniques require that the terms to be measured are numbers, such as Pearson correlation coefficient, cosine similarity and distance-based similarity mentioned in [18]. Although there are several algorithms that are able to measure distances between text terms, such as Jaccard distance [2] and Levenshtein distance [25], they do not take into consideration the semantic meaning of the text terms. In addition to these morphology-based algorithms, ontology plays a critical role in measuring the similarities of the textual attributes of items for recommender systems, such as the LCA ontology for scenario-based recommender sys∗
Corresponding author. E-mail address:
[email protected] (L. Tao).
http://dx.doi.org/10.1016/j.ins.2017.06.030 0020-0255/© 2017 Elsevier Inc. All rights reserved.
270
L. Tao et al. / Information Sciences 415–416 (2017) 269–282
tems [42] and utilizing big data as a knowledge base in order to support ontology-based similarity measurement [17]. Nonetheless, these ontology methods still have limitations because they rely heavily on large knowledge bases. Therefore, in order to ensure accuracy from numerical similarity measurement techniques, as well as taking the semantic meanings of the textual attributes into account, quantifying these textual terms is the main objective and motivation for this research paper. This paper proposes two novel methods of similarity measurement from a completely new perspective, that is, an item’s similarity measurement can be improved by taking semantic taxonomies and structuralized decomposition into account, where the original problems are converted into domain-specific problems which enables recommender systems to capture the meanings of these terms by their natural characteristics. 2. Related work Since the 1990s, recommender systems have been applied in numerous areas, such as e-commerce, and industries such as the tourism, film and music industries. There are two major approaches in terms of collecting users’ preferences, namely, explicit and implicit [7]. The former involves explicitly collecting users’ ratings on items, which is the most commonly utilized method, whereas the latter involves implementing strategies that do not depend on users’ ratings, such as monitoring their behaviours, or context-aware methods that take the circumstances of the users (e.g. geographic location) as the parameters and criteria for recommendations [49]. It is commonly believed that the most critical component of recommender systems is the filtering method that filters and ranks the items to be recommended to users. The most widely accepted classifications are presented in a survey paper [23] as follows: • • • •
• •
content-based filtering, which assumes users will prefer items with similar features [6,21,35], collaborative filtering, which assumes that pairs of users are correlated based on their preferences [23,28,37], matrix factorization, which is a subcategory of collaborative filtering but utilizes matrix decomposition [26,35,46], knowledge-based filtering, which verifies item features and user demands, based on a knowledge base which is usually based on ontology [15,32,41], demographic filtering, which analyses the attributes of users in their profiles, such as gender and age [40,44], hybrid filtering, which combines any of the above to produce more accurate results [10,36,48].
In order to conduct item content-based filtering, the similarity measurement between items is the most critical technique. Ref. [21] suggests utilizing LOD (Linked Open Data) to analyse the content of the items, so that the dimensions obtained are formed by the VSM (Vector Space Model) to calculate their cosine distances. Although the approaches that discover the dimensions are various, the similarity measurement algorithms are still quite common and conservative. For example, cosine distance is also used by Cappella et al. [14] and [22], while [9] utilized Jaccard distance as a component to improve their kNN algorithms, and [31] solely utilized probability models. On the basis of these conservative similarity measurements, [20] argued that these traditional methods implicitly assign equal weights to all the features of the items, i.e. the dimensions. Nevertheless, the latent criteria by which users judge whether items are preferable or not is usually at different importance levels. For example, the price of a camera is subjectively more important than its colour for most customers. Therefore, they proposed a weighted similarity measurement as follows:
S(Oi , O j ) = ω1 f (A1i , A1 j ) + ω2 f (A2i , A2 j ) + · · · + ωn f (Ani , An j )
(1)
where ω1 , ω2 , . . . , ωn are the weights of the features A1 , A2 , . . . , An of the two items Oi and Oj , f is the similarity measure, that is either cosine distance or another standard one. They then analysed all the existing rated items in their database to obtain a weight for every dimension, which improved both accuracy and recall considerably compared to the equally weighted approaches. However, the inherent issue facing these algorithms is that all the dimensions need to be quantified and mapped into numbers. Additionally, string distance measurement is still not able to reflect the nature of the features of items, since they only measure the appearance of these words rather than their real meanings. However, semantic-based methods are able to solve this issue. Ref. [30] proposed a semantic-based news recommender system that utilizes a knowledge base to extract concepts that are mentioned in the news, as well as the preferred concepts of the users. Specifically, text snippets from the news are formed into vectors, based on their concepts as follows:
SemRel (ti , t j ) = cos(Vi , V j ) =
Vi · V j ||Vi || · ||V j ||
Vl = {c1l , wl1 , c2l , wl2 , . . . , clp , wlp }
(2) (3)
where ti and tj are the two pieces of news being calculated with their similarity. They are then converted to vectors Vi , and Vj in order to utilize the classic cosine similarity. For any vector Vl where l ∈ {i, j}, c1 , c2 , . . . , c p denotes the concepts extracted from the text, and w1 , w2 , . . . , w p are the dynamic coefficients based on the user preferences where the value will fall into [0,1].
L. Tao et al. / Information Sciences 415–416 (2017) 269–282
271
Therefore, semantics relatedness is calculated based on the ontology concept vectors. Similarly, [1,13,34] utilized an ontology-based similarity measurement in terms of topic matching. Additionally, [12] suggested taking other users’ realtime activities into account in order to build a general context-aware based on user profiles and an ontology-based news recommender system. However, ontologies used in these research projects are all utilized for extracting the main topic or concepts from free text rather than the attributes from entities. Simultaneously, other distances or similarity measurement techniques regarding text terms rarely take semantic meanings into account or completely rely on external support that needs more information from the environment, such as the location, emotion and other context parameters. •
•
•
Jaro is a technique which considers the number and sequence of common letters in text terms. It is generally popular in word typing derivation research such as [4] and [5]. N-gram algorithms detect the subsequence of the characters in the text terms and calculate the n-gram for subcombinations but in original orders, and then calculate the final similarity by dividing the number of similar n-grams by the maximal number of n-grams [11,19]. The Hamming distance-based approximate similarity text search (HASTS) algorithm that was proposed in [29] improved the quality of queries in massive text data in order to reduce the size of the candidates set. The logic behind it is quite similar to the zip techniques which utilize a text “fingerprint” to measure similarities.
All these approaches consider the appearance of text terms rather than taking the semantic meaning of the text into account [24]. •
•
•
A context-aware autoencoder, which is an extension of the basic autoencoder in [3], improved the word sense disambiguation ability by deeply considering the document’s context. Hyperspace Analogue to Language (HAL) relies on a semantic space from word co-occurrences. The words are organized into a matrix together with entropies that determine the probability of the similarity between those words, so that the highest entropy means the words that represent the row and the column are quite similar [16,39]. Normalized Google Distance (NGD) utilizes the Google search engine logs to determine the similarities between words. Specifically, the similarities are defined by the number of clicks and user satisfaction which reflect the underlying correlation between the words which may not be evident on the surface [19].
Although the aforementioned methods measure the similarities between text terms based on their semantic meaning, there is a great amount of context information either from the circumstances of the user of the items themselves or the third-party services that they depend on, which might be not available because of a lack of sensors or unreliable services. The approach taken by this paper proposes that the similarities of the textual attributes should be measured using numeric values after they have been quantified. Also, the quantifying process is completely based on the semantic meaning of the textual terms. The next section introduces the proposed methodology to resolve this problem. 3. Methodology It is well known that the items in general recommender systems are various in terms of their type. Consequently, their attributes, which are also called dimensions or fields, are of different types. In this section, the dimensions of various items are firstly distinguished and classified based on their idiosyncrasies. After this, the concept of textual attributes (or textual dimensions) is clarified. Since it is the main obstacle in this research area, we propose two novel and systematic strategies: the taxonomic tree-based similarity measurement (TTSM) and the structuralized decomposition-based similarity measurement (SDSM). These two strategies target the two different types of textual attributes, which will cover all types of domain-specific textual attributes that might appear in various types of items in any recommender systems. After this, another algorithm is introduced to apply these two methodologies to a specific type of dimension, that is, a group-based textual dimension. 3.1. Attribute types classification In any recommender system, it is commonly known that the similarity between items plays a critical role in measuring how users may like or dislike the items, especially for content-based recommender systems, where the similarity values are the mainstays of the criteria for recommendations. However, the dimension types of these items utilized for measurements have rarely been analysed systematically in previous research projects. Instead of concentrating on the algorithms only, clarifying the different types of dimensions will greatly benefit the algorithm designation. Specifically, classifications of dimension types help extricate over-generalisations, and then develop accurate solutions by drilling down into more specific problems. This research proposes that the dimensions can be classified into seven categories based on their intrinsic forms. In general, the dimensions can be either numbers or text. The numeric attributes can be either continuous numbers or discrete numbers. The textual fields can also be classified according to whether there are any standards or constraints to compose them, therefore, they are classified as textual attributes that are domain-specific, and free text that is completely free-style without any restrictions. Furthermore, depending on the composition of both numbers and text dimensions, they
272
L. Tao et al. / Information Sciences 415–416 (2017) 269–282
Fig. 1. Item dimensions classifications.
are further classified into single or grouped attributes. While a single attribute has only one value, a grouped attribute is composed of a set or series of values. The hierarchy and relationships between these classifications are shown in Fig. 1. Therefore, the item dimensions in any recommender system can be classified into seven types atomically: 1. Single domain-specific text dimensions This type of field contains only one textual value which is domain-specific e.g. a movie’s country of origin, and language used 2. Grouped domain-specific text dimensions This type of field contains multiple textual values which are domain-specific e.g. movie actors’s names, ingredients of a recipe 3. Free text dimensions This type of field contains a snippet of general text e.g. users’ comments on a product 4. Single continuous number dimension This type of field contains only one continuous numeric value e.g. the price of products 5. Grouped continuous number dimension This type of field contains multiple continuous numeric values e.g. the RGB values of a website’s main theme, which are three values that could be any number between 0 and 255. 6. Single discrete number dimension This type of field contains only one discrete numeric value e.g. users’ ratings of items 7. Grouped discrete number dimension This type of field contains multiple discrete numeric values e.g. users’ ratings on different aspects of a movie, such as sound effects, visual effects, plot and actor skills, which are usually represented by a star graph. In the above list, dimension types 4–7 have been well researched in many research projects, and the solutions have been proven to be robust, since the numbers have deductions as their inherent distances and hence are relatively easy to measure. For example, [8] extended the traditional k-neighbours with user contextual information when they were rating items, [27] utilized Bayesian similarity measurement for their recommender system, and Pearson correlation coefficient and cosine distance are improved by considering the target items when measuring distances between two users in [18]. On the other hand, dimension types 1–3 are text fields, which are thought to be unquantifiable. Although numerous ones focus on ontology, the solutions are over-generalized and not sufficient to solve the problems in recommender systems. Specifically, for any single recommender system, its items usually contain a number of textual attributes which are domain-specific, such as the genres of books, while the common ontology methods utilize generic tools such as WordNet or ConceptNet which are not effective to cope with these domain-specific problems. Therefore, this motivation compels us to develop approaches to surmount the problems within the scope of recommender systems. In other words, these approaches should be generic under this research area, but simultaneously resolve the quantification issues of the domain-specific text dimensions for any recommender system. 3.2. Taxonomic tree-based similarity measurement (TTSM) It is widely accepted that there are numerous ambiguities in the real world. The underlying cause of this is because of ambiguities in language. For example, one can use several words such as “beef”, “veal” and “steak” to describe an almost identical concept, while these words may also be utilized in describing thoroughly different food. In fact, it is always a challenging issue in the Natural Language Processing (NLP) research area to map these implicit words in languages with explicit data that can be utilized by computer programs.
L. Tao et al. / Information Sciences 415–416 (2017) 269–282
273
Fig. 2. Taxonomic tree architecture example.
A general ontology is effective to some extent as it uses general language tools, such as WordNet. Moreover, some NLP algorithms are also able to determine the real meaning of words within their context when there is sufficient information that can be utilised as clues for derivation. However, there are two disadvantages when applying traditional NLP methods to similarity measurement for recommended items with textual fields. Firstly, textual fields are domain-specific, which should be resolved by domain-specific methods which are more advanced compared with general ones in terms of the speed performance and accuracy. Secondly, the text attributes of the items are usually short phrases or a single word which does not provide enough informative context compared to other scenarios, such as a document, which makes it extremely difficult to find clues in order to derive and filter out ambiguities. Apart from the aforementioned two obstacles that are insurmountable for general NLP approaches, there is another barrier which confounded this research area for a long time, that is, the quantification of textual dimensions. Specifically, even though general NLP tools are able to infer the correct meanings of textual terms sometimes, the only method that traditional item similarity measurement can utilise is matching the strings. This results in a “black and white” determination. Although it is intuitively rational that the algorithm counts the co-occurrences between the items’ grouped domain-specific text dimensions, it is still not correct since co-occurrences are greatly limited in formulating the “similarity” between terms. For example, there is a job recommender system with two positions: iOS Application Developer and Android Application Developer. They may have a grouped domain-specific text dimension called “programming skills”, which is composed of the programming language requirements which applicants must possess in order to apply for this job. If the iOS job has “Objective-C” in this field, while the Android job has “Java”, then the traditional algorithms will treat these two programming languages as completely different. However, it is commonly known that they are both object-oriented programming languages, which share enormously similar traits and an experienced programmer who masters one of them will have a high possibility of being able to program using the other one. Consequently, in this example, the similarity between “Objective-C” and “Java” should not be 0, but a number that falls into the (0,1) interval. In order to address this problem, we propose a domain-specific method, that is, the taxonomic tree-based similarity measuring technique, which is able to address the three aforementioned disadvantages of traditional NLPs. Generally, it is believed that for each realm of recommender systems’ items, their textual attributes for certain fields are all domainspecific, such as the genre of movies. Based on this fact, it is conceivable that a certain domain taxonomy which maintains all the possible textual terms of this field can be found. Therefore, the taxonomic tree-based similarity measurement (TTSM) is a domain-specific algorithm that is capable of quantifying the similarities between textual terms by utilizing the features of the tree. First, the structure of the tree needs to be designed. Unlike traditional trees as a kind of data structure, a taxonomic tree has a number of additional concepts to accustom to the recommender system items’ textual terms. In other words, in addition to ancestors, offspring and leaf nodes in a taxonomic tree, there are more specific types of nodes, as shown in Fig. 2. The concepts in taxonomic trees can be demonstrated as follows:
274
L. Tao et al. / Information Sciences 415–416 (2017) 269–282
1. ROOT (Root). Similar to traditional data trees, taxonomic trees must also have a root, which represents the cornerstone of the domain for a particular item’s textual attribute. 2. Category (C). These categories are general terms that must contain offspring. They might be the terms that appear as item textual attributes, but more probably will only find their meanings by classifying offspring nodes. 3. Direct node (D). These nodes are the direct offspring of categories, which means that they must not have any ancestors which are other direct nodes or indirect nodes. 4. Indirect node (ID). These nodes are the offspring of either direct nodes or other indirect nodes, which means that their immediate ancestors must not be categories. 5. Transformation (T). Occasionally, some terms are transformed from other terms within this domain. They usually inherit less traits from their ancestors in this case, although they do share a few similarities with their ancestors. The process of transforming terms into other terms is called transformation in taxonomic trees. 6. Transformed node (TID). These are the nodes that are transformed from their ancestors. Transformed nodes must be indirect nodes, since they have at least one ancestor which may be a direct node or an indirect node. 7. Transformation type. Transformed nodes must associate with a certain transformed type, which is usually the reason for the transformation. For example, ‘Whipped’ is the transformation type of ‘cream’ and ‘butter’ which are transformed from dairy, but for ‘cheese’ which is transformed from the same parent node, it’s transformation type is fermentation. Therefore, for transformed nodes that have the same transformed types, their similarity will be slightly higher and vice versa. 8. Internal node. Similar to traditional data structures, a taxonomic tree also has internal nodes which must have other nodes as offspring. 9. Leaf node. Similar to traditional data structures, a taxonomic tree also has leaf nodes which do not have any offspring. 10. Category level (lv). The children on the branches directly under the root are category level 1, and so on. To further clarify these concepts, all the nodes in a taxonomic tree can be generally classified into two types: general nodes and species nodes. The difference is that the general nodes, including the root, categories and transformations are usually utilities to distinguish the species as their offspring, whereas the species nodes, including direct nodes, indirect nodes and transformed nodes are usually the specific textual terms that appear in the corresponding textual attributes. All these concepts facilitate the theorems that are introduced later, in order to form an advanced and well-rounded similarity measurement methodology for domain-specific textual terms. First, one of the most crucial relationships between the nodes is the ancestor-offspring relationship, such as D121 and ID1211 in Fig. 2. In the context of recommender systems, the similarity between these kinds of nodes should be considered as identical. This is because D121 contains D1211. In other words, the children inherit all properties from their parents, as the meaning of the parent value is more general. For example, a job requires “object-oriented programming skill” will always be satisfied with “Java programming skill”. Therefore, it is derived as follows: Definition 1. The similarity between a node and its ancestor is 1. However, it is obvious that some nodes are transformed, derived or made by another node. A common example is soy and tofu, where the latter is made from the former. The principle behind this is that they have major differences in terms of shape, taste and cooking style, but they also share similar traits, especially nutrition and chemicals. Therefore, they are different from each other but not completely. Definition 2. When measuring the similarity between two nodes, if node A is transformed from node B or vice versa, a penalty is applied on similarity. In Definition 2, the penalty is a predefined constant value within the taxonomic tree scope. Nonetheless, in different domains (taxonomic trees), the values vary depending on how the transformation impacts the similarity between the transformed nodes with their ancestors. It is known that the range of any similarity should be between 0 and 1. Consequently, there should be some scenarios where the similarity between two textual terms is equal to or converge towards 0. In TTSM, in most of the cases, items will have a similarity greater than 0, except for the case described in Definition 3. Definition 3. If the nearest common ancestor of two nodes is the root, then the similarity between them is 0. Specifically, the nearest common ancestor (NCA) of two nodes is the ancestor node of both these two nodes, and simultaneously, the path from this ancestor node to these two nodes is the shortest of all the ancestor nodes of these two nodes. For example, because the NCA of nodes ID11111 and TID12211 is C1, the similarity between them could be relatively small but still larger than 0, while the similarity between the nodes ID11111 and ID2121 is 0 according to Definition 3, since their NCA is the root node. Finally, in the most general case, two nodes are not from one branch (similarity is not 1), and their nearest common ancestor is not the root (similarity is not 0) which is calculated as follows:
f sim(ti , t j ) =
( N − 1 )μ ( 1 − μ )l vi j + N μ − 1
(4)
L. Tao et al. / Information Sciences 415–416 (2017) 269–282
275
In this formula, ti and tj are the two terms being measured by their similarity, N denotes the total number of levels of the tree, μ in (0, 1) is a pre-defined minimum similarity value, and lvij is the level number of the nearest common ancestor of ti and tj . Therefore, the complete algorithm of TTSM is presented in Algorithm 1. Algorithm 1 Taxonomic tree based similarity measurement (TTSM). 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25:
ti , t j ← pair of items that similarity being calculated p ← transformation penalty {commonAncestors} ← al l Parents(ti ) ∩ al l Parents(t j ) if ti = t j then simi j = 1 else if size({commonAncestors} ) > 0 then if isSpeciesNode(ti ) AND isSpeciesNode(t j ) then if isTransformedNode(ti ) AND isTransformedNode(t j ) then if ti .transformationType = t j .transformationType then if ti .isParentOf(t j ) OR t j .isParentOf(ti ) then simi j = 1 else simi j = f sim(ti , t j ) else simi j = f sim(ti , t j ) × p else if isTransformedNode(ti ) OR isTransformedNode(t j ) then simi j = f sim(ti , t j ) × p else if ti .isParentOf(t j ) OR t j .isParentOf(ti ) then simi j = 1 else simi j = f sim(ti , t j ) else if ti .isParentOf(t j ) OR t j .isParentOf(ti ) then simi j = 1 else simi j = f sim(ti , t j )
27:
else simi j = 0
28:
output: simi j
26:
3.3. Structuralized decomposition-based similarity measurement (SDSM) In the previous section, the textual dimensions that are taxonomy based were investigated, and a corresponding taxonomy-based similarity measurement was proposed. However, in the real world, not all the textual attributes are able to be settled into a taxonomy tree because they are not within any taxonomic system. In this case, they are either atomic so they cannot be described with sub-concepts, or they can be described by a number of fields with certain values. For example, food can be categorised as either sweet, sour, salty, bitter or umami according to one’s sense of taste, however, some food comprises a combination of flavours, such as soy sauce, which can be identified as being salty, umami and sweet. These flavours can be quantified and utilized as the criteria by which to determine the similarity between different flavours. To generalize, all the textual dimensions that are able to be denoted by data structures can be decomposed and this methodology can be applied to them. Also, because all these dimensions are composed of finite and intrinsic fields of different proportions, such as different flavours, they can be described in a certain data structure. The compound textual dimensions then can be defined as those dimensions with textual terms as their values where each textual value is represented by elements in differing proportions in their data structure. Therefore, when it is too difficult to decide how similar two terms are, this can be solved by decomposing these terms into structuralized fields, which enable a comparison to be made between them. More specifically, every possible term that may appear in a certain domain (values of textual attributes) may be illustrated by histograms with exactly the same number of fields, as shown in Fig. 3. The x-axis f1 to f6 are the decomposed subfields for all the terms, and the y-axis is an interval [0,1] indicating the scale volume of these subfields. Since the terms are domain-specific which are able to be decomposed into certain subfields, they become comparable because of the SDSM algorithm. From the charts in Fig. 3, there are four possible terms for a compound textual attribute. They were decomposed into six subfields and were represented with this data structure. Therefore, the original unsolvable problem is mapped
276
L. Tao et al. / Information Sciences 415–416 (2017) 269–282
Fig. 3. Examples of textual terms structural decomposition.
to another domain, which is the major contribution of the algorithm – quantifying the textual attributes. Based on the bar charts provided in the figure, it is obvious that “term1” and “term4” are quite similar, while “term2” and “term3” are similar but share less similarities than the former pair. More appropriately, these terms are converted into vectors. Therefore, cosine distance can be taken into account in order to measure the distances between these terms. The values that belong to compound textual attributes can then be calculated with similarities as follows:
sim(vi , v2 ) =
n
i=1
n i=1
v
v1di v2di
2 1di
n i=1
(5)
v22di
where v1 and v2 are the terms of the compound textual attribute, while vxdi denotes the value of ith subfield of the xth term. This formula based on the compound textual attribute is a single dimension. 3.4. Group-based textual dimensions similarity measurement As long as the textual dimensions are considered as domain-specific problems, the above two algorithms will cover them thoroughly. Utilizing either TTSM or SDSM will resolve the similarity measurement issues for any single domain-specific textual dimension. However, the group-based ones are relatively more complicated since the attributes consist of multiple terms. In addition to this, the number of terms contained in one attribute might vary from item to item. In this section, a maximized term matching (MTM) technique is introduced to resolve this problem, and this technique can be utilized for both TTSM and SDSM algorithms. Let TA be an item with n different types of attributes TA = (aA1 , aA2 , . . . , aAn ) and TB be another item so it must have the same number of attributes, TB = (aB1 , aB2 , . . . , aBn ), where the attributes from TA and TB correspond. Suppose that aAn and aBn are the compound textual properties for TA and TB respectively, and the terms set for them are respectively SAn = tA1 , tA2 , . . . , tAx and SBn = tB1 , tB2 , . . . , tBy . Please note here that the number of elements in the sets SA and SB may not be the same, that is, x < >y. Then, the TTSM (if the terms are taxonomic tree-based) or SDSM (if the terms are structuralized) is utilized to calculate all the possible pairs of combinations between the elements from SAn and SBn . Then, the MTM algorithm is applied to decide the final similarity between the attribute aAn and aBn . As shown in Fig. 4, suppose that Sim(A3, B2) has
L. Tao et al. / Information Sciences 415–416 (2017) 269–282
277
Fig. 4. Example of item similarity between group based textual dimension with terms.
soybean E n g l i s h 26716 Request 1. Obtain TSN for “soybean”.
the highest value among all the similarities. Consequently, Sim(A3, B1), Sim(A1, B2) and Sim(A2, B2) are eliminated because these eliminated pairs either contain A3 or B2. After this, suppose Sim(A2, B1) is the highest one in the remaining set, so that Sim(A1, B1) is eliminated. Finally, the similarity between attribute aAn and aBn is calculated by (Sim(A3, B2 ) + Sim(A2, B1 ))/2. In this research, the similarity measurement of both single and group-based textual attributes achieved promising results. The similarities between corresponding textual attributes of items can be quantified and calculated accurately. After this, according to the weights assigned to each of the attributes of those items, the similarities between items can be calculated conveniently, utilizing any typical similarity measurement techniques, such as cosine distance, squared Euclidean distance and even clustering techniques. 4. Evaluation In order to validate the aforementioned algorithms, a web-based recipe recommender system is implemented. There are several reasons why this particular recommender system is chosen. Firstly, recipe recommender systems are popular with the general public, but unfortunately, there are usually too many recipes on most websites, making it difficult for users to choose their preferred ones. Secondly, the most significant dimension of the recipes is the ingredients, which are classified as grouped textual terms, so these are suitable to evaluate the effectiveness of the algorithms. More importantly, the ingredients of recipes may contain both types of text terms: ones that form a taxonomic tree such as animal and plant types, and ones that can be decomposed into data structures such as the various flavours which comprise different
278
L. Tao et al. / Information Sciences 415–416 (2017) 269–282
26716 202422 ax23 : tsn> Kingdom Plantae ... 26715 Willd . ax23 : author> Fabaceae 500059 ax23 : parentTsn> Genus Glycine Request 2. Obtain taxonomic free for “soybean”.
Fig. 5. Example of structuralized decomposition of flavours used in recipes.
food. The following subsections describe how the experiments are conducted and how the evaluation results prove the effectiveness of the methods used in this paper. First, a large number of websites hosting large-scale recipe databases, such as BigOven (over 350,0 0 0 recipes) and Yummly (which has a systematic structure for the data on ingredients), are chosen as the data sources. The recipes used in this experiment are selected from these two recipe databases utilizing their API provided. To ensure that recipe types are well-distributed, the API search facilities are utilized by manually feeding different keywords into them (Table 1). After distinct, a total of 206 recipes are stored in the local database. Then, the 7-type-rule is implemented in a JSON file together with the importance levels of the corresponding ingredients. The 7-type-rule is a novel approach that focuses on transforming and cleansing textual terms by matching and combining the textual terms based on their meanings. There are seven types of rules for this disambiguation process, which is comprehensively discussed in [43].
L. Tao et al. / Information Sciences 415–416 (2017) 269–282
279
Table 1 Keywords fed into recipe databases API and number of recipes retrieved. Keyword
Qty
Keyword
Qty
Keyword
Qty
Chinese, Beef American + Beef Indian + Beef Japanese + Beef Beef Chinese Japanese
15 19 17 17 5 18 12
Chinese, Chicken American + Chicken Indian + Chicken Japanese + Chicken Chicken American
13 14 21 12 5 2
Chinese, Potato American + Potato Indian + Potato Japanese + Potato Potato Indian Total
11 17 6 14 5 12 235
Table 2 Similarity coefficients of each taxonomy levels.
1 2 3 4 5 6 7
lvk
f sim(ti , t j ) = (1−μ(1)l−v μ+)NNμ−1 ij
“Species” “Genus” “Family” “Order” “Class” “Phylum” “Kingdom”
1 0.40 0 0 0.2500 0.1818 0.1429 0.1176 0.10 0 0
After the cleansing using the 7-type-rule [43], the number of ingredients is reduced from 549 to 159, as the ingredients with identical meanings, such as “beef mince” and “ground beef” are amalgamated. Therefore, the recipe database is ready for the next stage. Then, for those ingredients which are purely animal or plant types, a taxonomic knowledge tree of the ingredients is built from their natural classifications, which are zoology taxonomies. The taxonomies are retrieved from the Integrated Taxonomic Information System [33] which is the authoritative taxonomic information source on every plants, animals, fungi, and microbes in the world. The retrieval is performed via the API provided by ITIS. Specifically, they provide web services which accept online URL requests via the GET method for retrieval, and the return call back function is in XML format. The retrieval needs to be done in two steps. Firstly, send the species’ name to an API URL to get the exact ID (the ID is called “TSN” in the ITIS database) of the species. Then, the ID can be utilized to retrieve the hierarchy tree of all the species’ parents (i.e. from the scientific name of the species to the kingdom it belongs to). The whole process can be done automatically by looping the term list. An example of retrieving the ingredient “soybean” is shown in Requests 1 & 2 . Since zoology taxonomies classify all species into seven levels, μ in this case is pre-defined to 0.1, which means the similarity within each “Kingdom” is 0.1, such as “Plantae” which refers to all vegetables. Table 2 lists the calculated similarity coefficients of each level. Yummly.com classifies the tastes of recipes into 5 types [47] which are “salty”, “sour”, “bitter”, “meaty” and “piquant”. This might be good for measuring the different flavours in a recipe, but not specific ingredients. According to the definition of “taste” in Wikipedia [45], tastes are very specifically classified as follows: • • • • • • • • •
Sweetness (basic taste) Sourness (basic taste) Saltiness (basic taste) Bitterness (basic taste) Umami (basic taste) Pungency (also called hotness or spiciness, typically the taste of alcohol and chili) Coolness (taste of peppermint) Numbness (taste of Sichuan pepper) Heartiness (also called kokumi, taste of cheese)
Based on this structure, all the flavours of the ingredients are decomposed into a data structure, which are values of these 9 fields. After this, a recipe recommender system is implemented with PHP as the backend server and presented as web UI (Fig. 6). All fields, such as preparation time and cuisine type, were considered as dimensions of the recipe vectors. However, the flavour and ingredients were assigned with more weights since they have the most impact on the recipes. The website enables users to easily access the system via web browsers. The evaluation activities are conducted on 30 volunteers from different backgrounds. Since content-based similarity filtering algorithms are effective in relation to addressing the cold-start problem, the volunteer users are not required to be registered or to provide a profile. The volunteers are firstly required to rate 15 recipes that are selected from the recipe database according to the principle that they are well-distributed, in other words, keeping the maximum distance between them. Then, these 15 recipes are removed from the database, and the scores of the
280
L. Tao et al. / Information Sciences 415–416 (2017) 269–282
Fig. 6. Recipe recommender system implemented with TTSM and SDSM.
L. Tao et al. / Information Sciences 415–416 (2017) 269–282
281
Fig. 7. Statistics of user satisfaction rates.
remaining ones are calculated using the cumulative ranked recommendation algorithm. Then, five recommended recipes are displayed which are among the user’s most preferred, but which also keep a certain pre-defined distance of 0.8. Finally, the volunteers are asked to rate these five recipes again as feedback. The results show that the recipe recommender system achieved a 74.4% satisfaction rate based on the statistics in Fig. 7, calculated as follows:
S=
u∈U u∈U
ri ∈Ru ri
| R u | n=i
rmax
(6)
where U is the set of all users, u is the individual user, Ru is all the feedback ratings of the corresponding user u, ri is the ith rating in the current rating set, and rmax is the max rating in principle, which equals 5 in this case. 5. Conclusion Recommender systems have become necessary tools for a large number of online services and provide useful information to users on items of interest. This paper addressed the current weaknesses in this research area and proposed solutions. Since the core activity of recommender systems is to measure the similarity between items, the pivotal activity to be undertaken in order to improve the current performance level is to enhance the similarity measurements. Therefore, this paper suggested that there are different types of dimensions for all items, so different algorithms must be used. The most significant contribution of this paper is the taxonomy tree-based similarity measurement (TTSM) and structuralized decomposition-based similarity measurement (SDSM), which resolved the similarity measurement of the textual attributes of items by measuring natural semantic distance rather than simply counting co-occurrences. Furthermore, a maximized term matching (MTM) mechanism is proposed to utilize TTSM and SDSM for group-based textual dimensions. Hence, the textual attributes of items in any recommender system can be quantified and measured by their similarity in an accurate manner. References [1] N. Aggarwal, K. Asooja, P. Buitelaar, DERI&UPM: pushing corpus based relatedness to similarity: shared task system description, in: Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1; Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, Association for Computational Linguistics, 2012, pp. 643–647. http://dl.acm. org/citation.cfm?id=2387745. [2] M.Y.H. Al-Shamri, Power coefficient as a similarity measure for memory-based collaborative recommender systems, Expert Syst. Appl. 41 (13) (2014) 5680–5688. [3] H. Amiri, P. Resnik, J. Boyd-Graber, H. Daumé III, Learning Text Pair Similarity with Context-sensitive Autoencoders. 2016, In Association for Computational Linguistics (1). [4] D. Bär, C. Biemann, I. Gurevych, T. Zesch, UKP: computing semantic textual similarity by combining multiple content similarity measures, in: Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1; Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, Association for Computational Linguistics, 2012, pp. 435–440. http://dl.acm.org/citation.cfm?id=2387707. [5] A. Barrón-Cedeno, P. Rosso, E. Agirre, G. Labaka, Plagiarism detection across distant language pairs, in: Proceedings of the 23rd International Conference on Computational Linguistics, Association for Computational Linguistics, 2010, pp. 37–45. [6] J. Beel, S. Langer, M. Genzmehr, A. Nürnberger, Introducing Docear’s research paper recommender system, in: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, ACM, 2013, pp. 459–460. [7] J. Bobadilla, F. Ortega, A. Hernando, J. Bernal, A collaborative filtering approach to mitigate the new user cold start problem, Knowl. Based Syst. 26 (2012a) 225–238. [8] J. Bobadilla, F. Ortega, A. Hernando, A collaborative filtering similarity measure based on singularities, Inf. Process Manag. 48 (2) (2012b) 204–217.
282
L. Tao et al. / Information Sciences 415–416 (2017) 269–282
[9] J. Bobadilla, F. Ortega, A. Hernando, G. Glez-de Rivera, A similarity metric designed to speed up, using hardware, the recommender systems k-nearest neighbors algorithm, Knowl. Based Syst. 51 (2013) 27–34. ˝ [10] S. Bostandjiev, J. O’Donovan, T. Hollerer , Tasteweights: a visual interactive hybrid recommender system, in: Proceedings of the Sixth ACM Conference on Recommender Systems, ACM, 2012, pp. 35–42. [11] D. Buscaldi, R. Tournier, N. Aussenac-Gilles, J. Mothe, IRIT: textual similarity combining conceptual similarity with an N-gram comparison method, in: Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1; Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, Association for Computational Linguistics, 2012, pp. 552–556. http://dl.acm.org/citation.cfm?id=2387729. [12] I. Cantador, A. Bellogín, P. Castells, A multilayer ontology-based hybrid recommendation model, AI Commun. 21 (2–3) (2008a) 203–210. [13] I. Cantador, A. Bellogín, P. Castells, Ontology-based personalised and context-aware recommendations of news items, in: Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology-Volume 01, IEEE Computer Society, 2008b, pp. 562–565. [14] J.N. Cappella, S. Yang, S. Lee, Constructing recommendation systems for effective health messages using content, collaborative, and hybrid algorithms, Ann. Am. Acad. Pol. Soc. Sci. 659 (1) (2015) 290–306. [15] W. Carrer-Neto, M.L. Hernández-Alcaraz, R. Valencia-García, F. García-Sánchez, Social knowledge-based recommender system. application to the movies domain, Expert Syst. Appl. 39 (12) (2012) 10990–110 0 0. [16] S. Chapman, SimMetrics: a Java & C # .NET library of similarity metrics, 2006 http://sourceforge.net/projects/simmetrics/. Accessed 27 June 2017. [17] L.C. Chen, P.J. Kuo, I.E. Liao, Ontology-based library recommender system using mapreduce, Cluster Comput. 18 (1) (2015) 113–121. [18] K. Choi, Y. Suh, A new similarity function for selecting neighbors for each target item in collaborative filtering, Knowl. Based Syst. 37 (2013) 146–153. [19] R.L. Cilibrasi, P.M. Vitanyi, The Google similarity distance, IEEE Trans. Knowl. Data Eng. 19 (3) (2007) 370–383. [20] S. Debnath, N. Ganguly, P. Mitra, Feature weighting in content based recommendation system using social network analysis, in: Proceedings of the Seventeenth International Conference on World Wide Web, ACM, 2008, pp. 1041–1042. [21] T. Di Noia, R. Mirizzi, V.C. Ostuni, D. Romito, M. Zanker, Linked open data to support content-based recommender systems, in: Proceedings of the Eighth International Conference on Semantic Systems, ACM, 2012, pp. 1–8. [22] F. Fouss, A. Pirotte, J.M. Renders, M. Saerens, Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation, IEEE Trans. Knowl. Data Eng. 19 (3) (2007) 355–369. [23] D. Gavalas, C. Konstantopoulos, K. Mastakas, G. Pantziou, Mobile recommender systems in tourism, J. Netw. Comput. Appl. 39 (2014) 319–333. [24] W.H. Gomaa, A.A. Fahmy, A survey of text similarity approaches, Int. J. Comput. Appl. 68 (13) (2013). [25] M. Gómez, R. Rouvoy, M. Monperrus, L. Seinturier, A recommender system of buggy app checkers for app store moderators, in: Proceedings of the Second ACM International Conference Mobile Software Engineering and Systems (MOBILESoft), IEEE, 2015, pp. 1–11. [26] N. Guan, D. Tao, Z. Luo, B. Yuan, NeNMF: an optimal gradient method for nonnegative matrix factorization, IEEE Trans. Signal Process. 60 (6) (2012) 2882–2898. [27] G. Guo, J. Zhang, N. Yorke-Smith, A novel Bayesian similarity measure for recommender systems, IJCAI (2013). [28] W. Hill, L. Stead, M. Rosenstein, G. Furnas, Recommending and evaluating choices in a virtual community of use, in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ACM Press/Addison-Wesley Publishing Co., 1995, pp. 194–201. [29] H. Hu, L. Zhang, J. Wu, Hamming distance based approximate similarity text search algorithm, in: Proceedings of the Seventh International Conference on Advanced Computational Intelligence (ICACI), IEEE, 2015, pp. 1–6. [30] W. IJntema, F. Goossen, F. Frasincar, F. Hogenboom, Ontology-based news recommendation, in: Proceedings of the 2010 EDBT/ICDT Workshops, ACM, 2010, p. 16. [31] N. Koenigstein, Y. Koren, Towards scalable and accurate item-oriented recommendations, in: Proceedings of the Seventh ACM Conference on Recommender Systems, ACM, 2013, pp. 419–422. [32] W.T. Lin, M.H. Wang, C.S. Lee, K. Kurozumi, Y. Majima, FML-Based Recommender System for Restaurants, in: Proceedings of the IEEE Conference on Technologies and Applications of Artificial Intelligence, 2013, pp. 234–239. [33] Integrated taxonomic information system, 2015, Retrieved 9 November. https://www.itis.gov/. [34] S.E. Middleton, D. De Roure, N.R. Shadbolt, Ontology-based recommender systems, in: Handbook on Ontologies, Springer, Berlin Heidelberg, 2009, pp. 779–796. [35] T.T. Nguyen, P.M. Hui, F.M. Harper, L. Terveen, J.A. Konstan, Exploring the filter bubble: the effect of using recommender systems on content diversity, in: Proceedings of the Twenty-Third International Conference on World Wide Web, ACM, 2014, pp. 677–686. [36] C. Porcel, A. Tejeda-Lorente, M.A. Martìnez, E. Herrera-Viedma, A hybrid recommender system for the selective dissemination of research resources in a technology transfer office, Inf Sci (Ny) 184 (1) (2012) 1–19. [37] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, J. Riedl, Grouplens: an open architecture for collaborative filtering of netnews, in: Proceedings of the 1994 ACM Conference on Computer Supported Cooperative Work, ACM, 1994, pp. 175–186. [38] F. Ricci, L. Rokach, B. Shapira, M. Goossens, F. Mittelbach, A. Samarin, in: Introduction to Recommender Systems Handbook, Springer US, 2011, pp. 1–35. [39] C. Royer, Term representation with generalized latent semantic analysis, Proceedings of the Recent Advances Natural Language Processing IV: Selected Papers (RANLP 2005) 292 (2007) 45. [40] L. Safoury, A. Salah, Exploiting user demographic attributes for solving cold-start problem in recommender system, Lect. Notes Softw. Eng. 1 (3) (2013) 303. [41] S. Shishehchi, S.Y. Banihashem, N.A.M. Zin, S.A.M. Noah, K. Malaysia, Ontological approach in knowledge based recommender system to develop the quality of e-learning system, Aust. J. Basic Appl. Sci. 6 (2) (2012) 115–123. [42] A. Takhom, M. Ikeda, B. Suntisrivaraporn, T. Supnithi, R. Hintemann, K. Fichter, G. Stevens, Toward collaborative LCA ontology development: a scenario-based recommender system for environmental data qualification, in: Proceedings of the 29th International Conference on Informatics for Environmental Protection (EnviroInfo 2015), 2015, pp. 157–164. [43] L. Tao, F. Liu, J. Cao, Taxonomy tree based similarity measurement of textual attributes of items for recommender systems, in: Proceedings of the International Conference on Web Information Systems Engineering, Springer International Publishing, 2016. [44] Y. Wang, S.C.F. Chan, G. Ngai, Applicability of demographic recommender system to tourist attractions: a case study on trip advisor, in: Proceedings of the IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology-Volume 03, IEEE Computer Society, 2012, pp. 97–101. [45] Wikipedia.com, Taste, 2016. Retrieved 8 May. https://en.wikipedia.org/wiki/Taste [46] H.F. Yu, C.J. Hsieh, S. Si, I. Dhillon, Scalable coordinate descent approaches to parallel matrix factorization for recommender systems, in: Proceedings of the IEEE 12th International Conference on Data Mining, IEEE, 2012, pp. 765–774. [47] Recipes recommendation website, 2016, Retrieved 8 May. https://www.yummly.com/ [48] Z. Zhang, H. Lin, K. Liu, D. Wu, G. Zhang, J. Lu, A hybrid fuzzy-based personalized recommender system for telecom products/services, Inf. Sci. (Ny) 235 (2013) 117–129. [49] Y. Zheng, X. Xie, Learning travel recommendations from user-generated GPS traces, ACM Trans. Intell. Syst. Technol. 2 (1) (2011) 2.