Developing an approach for lifestyle identification based on explicit and implicit features from social media

Developing an approach for lifestyle identification based on explicit and implicit features from social media

Available online at www.sciencedirect.com Available online at www.sciencedirect.com Available online at www.sciencedirect.com ScienceDirect Procedia...

938KB Sizes 0 Downloads 26 Views

Available online at www.sciencedirect.com Available online at www.sciencedirect.com Available online at www.sciencedirect.com

ScienceDirect

Procedia Computer Science 00 (2018) 000–000 Procedia Computer Science 00 (2018) 000–000 Procedia Computer Science 136 (2018) 236–245

www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia

7th International Young Scientist Conference on Computational Science 7th International Young Scientist Conference on Computational Science

Developing an approach for lifestyle identification based on Developing anand approach forfeatures lifestylefrom identification based on explicit implicit social media explicit and features from social media a,∗ a Mariaimplicit Khodorchenko , Nikolay Butakov ∗ St Petersurg, 197101, aRussia University, 49 Kronverkskya,pr., Maria Khodorchenko , Nikolay Butakov

a ITMO a ITMO

University, 49 Kronverksky pr., St Petersurg, 197101, Russia

Abstract Due to the rapid development of social networks and their increasing availability from almost any device and Abstract platform, they have became an essential part of daily life for most of the users all over the world. Thus, comes the Due toassumption the rapid that development networks theiroffline increasing from almost internet any device and natural the way of of social how user prefersand to live affectsavailability the content of his/her profile. platform, they have became an essential part of daily life for most of the users all over the world. Thus, comes the Having a possibility to evaluate and interpret the available set of features in order to get the notion of the user natural assumption that thewill way how userforprefers live offline affects the content of his/her internet profile. preferences in daily actions be of beneficial many to stakeholders. Having possibility to evaluate and interpret the lifestyle availableit set of features in order to get the notion of the user Whenatrying to search for the features that define is vital to formalize this phenomenon in terms of social preferences in daily actions will be beneficial for many stakeholders. media. Here, we assess lifestyle as the set of specific user interests, descriptions of direct actions (activities) and their Whenoftrying to search for for athe features that it iscombined vital to formalize this phenomenon terms of (sport social degree manifestation particular user.define Such lifestyle definition, with division on 5 activityindomains media. Here,culture we assess aswork the set of family, specificentertainments user interests, descriptions directactivities), actions (activities) and their and health, andlifestyle religion, and and leisure, of general makes possible to degree of manifestation for a particular Such definition, activities combinedthat withcan division on 5 activity domains (sport introduce the way to measure lifestyle as user. the set of performed be uncovered using topic modeling. and culture andlifestyle, religion,the work and proposes family, entertainments and leisure, general activities), makes possible to In health, order to measure paper a two - level topic modeling approach of extraction, where the introduce the to measure thequality set of performed that be uncovered using topic The modeling. crucial part ofway the approach is lifestyle to ensureasthe of resultingactivities topics and to can choose their proper number. latter In order to measure lifestyle, paper proposes a two -methods level topic approach of has extraction, where the requires a comparative study ofthe topic quality assessment for modeling further usage, which been done in the crucial part of the approach is to ensure the quality of resulting topics and to choose their proper number. The latter paper. requires a comparative study of topic quality assessment methods for further usage, which has been done in the paper. c 2018 The Authors. Published by Elsevier B.V.  This is an access article by under the CC © Theopen Authors. Published Elsevier B.V.BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) c 2018  2018 The under Authors. Published by Elsevier B.V.committee of the 7th International Young Scientist Conference on Peer-review of CC the scientific This is an open accessresponsibility article under the BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/3.0/) This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) Computational Science. Peer-review under responsibility of the scientific committee of the 7th International Young Scientist Conference on Computational Science. Peer-review under responsibility of the scientific committee of the 7th International Young Scientist Conference on Computational Science. Keywords: Social Media; Topic Modelling; Lifestyle Assessment; Activity Extraction Keywords: Social Media; Topic Modelling; Lifestyle Assessment; Activity Extraction

1. Introduction 1. Introduction Due to the rapid development of social networks and their increasing availability from almost any device and platform, they have became an essential part of daily life for most of the users all over the world. The Due to the rapid development of social networks and their increasing availability from almost any device easiness of sharing current interests, thoughts and activities instantly with others define the vast amount of and platform, they have became an essential part of daily life for most of the users all over the world. The time spent online. Thus, comes the natural assumption that the way of how user prefer to live offline affects easiness of sharing current interests, thoughts and activities instantly with others define the vast amount of time spent online. Thus, comes the natural assumption that the way of how user prefer to live offline affects ∗

Corresponding author. Tel.: +8-911-214-2575. E-mail address: [email protected] Corresponding author. Tel.: +8-911-214-2575. E-mail  address: [email protected] c 2018 The 1877-0509 Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) c 2018 The 1877-0509  Authors. Published by Elsevier B.V. of the 7th International Young Scientist Conference on ComputaPeer-review responsibility of the scientific committee 1877-0509 © under 2018 The Authors. Published by Elsevier B.V. This isScience. an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) tional This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/3.0/) Peer-review underresponsibility responsibility of the scientific committee of the 7th International Young Conference Scientist Conference on ComputaPeer-review under of the scientific committee of the 7th International Young Scientist on Computational Science. tional Science. 10.1016/j.procs.2018.08.262 ∗

2

Maria Khodorchenko et al. / Procedia Computer Science 136 (2018) 236–245 M. A. Khodorchenko, N. A. Butakov / Procedia Computer Science 00 (2018) 000–000

237

the content of his/her internet profile. In this paper we address popular in many countries, including Russia, social network “Instagram”. Having a possibility to evaluate and interpret the available set of features in order to get the notion of the user preferences in daily actions will be beneficial at least for HR and travelling agencies. Firstly, for human resources (HR) search objectives the inclination of a particular person, for instance, towards the sedentary lifestyle or active lifestyle may turn out to be the main criteria when choosing a candidate, and a fast way of determining such tendencies will increase the speed of final decision making. Secondly, in the field of advertisement connected to travelling, more elaborate recommendations can be formed based on the preliminary knowledge of user lifestyle and the way he or she prefers to spend vacations or free time after work (visit bar or go to the gym). When trying to search for the features that define lifestyle it is vital to formalize this phenomenon in terms of social media. Here, we access lifestyle as the set of specific user interests, descriptions of direct actions (activities) and their degree of manifestation for a particular user. In our paper we propose 5 activity domains: • • • • •

Sport and Health Entertainment and Leisure Family and Work Culture and Religion General

Such definition makes possible to introduce the way to measure lifestyle as the set of performed activities and the amount of data closely connected to each of these aspects of life. Activity itself is understood as an answer to the question "What is this person doing now?" (in the moment of publishing a post) and can be accessed via content of the message. The research question is to determine the structure of activity types available in the network from usergenerated content (posts). The issue at hand mainly requires to evaluate the applicability of automatic topic quality measures for defining the unknown underlying subtypes of predefined abovementioned activity domains. The obtained information will help to get numeric vectors of subtopics probabilities for a user where each subtopic belongs to one of the 5 main activity domains. Having such representation of users is essential for the future development of recommendation system. All the features from social media fall into one of 2 groups – explicit or implicit. The former, such as text and images, are created by the user’s will and may contain information about current activity. However, they have significant disadvantages by nature – irregularity and noisiness. The latter includes metainformation (like geolocation) and online patterns, which have regularity, but contain ambiguity. So, despite of that the usage of any data modality has it’s advantages and disadvantages, which should be counterbalanced by their combination, in this particular work we are going to focus on the first group that can be directly accessed from the user-generated content. That includes visual and, mainly, textual information processing, defining the general topic of the message and extracting more fine-grained activities. To measure lifestyle we have implemented two - level topic modeling approach of specific traits (related to different domains) extraction, which can be later used to perform profiling (from the stakeholders point of view) of individual users. The crucial part of our approach is to ensure the quality of resulting topics and to choose their proper number. The latter requires a comparative study of topic quality assessment methods to be used further. The main contribution of this work is a comparative study of topic quality estimation methods based on experiments on public data of 130 000 users collected from "Instagram". The rest of the paper is organized as follows. Section 2 contains overview of background and related works. Section 3 contains description of the approach used for two-level topic modeling. Section 4 contains description of experiments and results of comparative study and section 5 contains conclusion and description of future work.

238

Maria Khodorchenko et al. / Procedia Computer Science 136 (2018) 236–245 M. A. Khodorchenko, N. A. Butakov / Procedia Computer Science 00 (2018) 000–000

3

2. Background and related work The main source of data for user profiling is text, and thus it is crucial to choose a proper approach to perform such analysis. The most widely ussed approach at the current moment is topic modeling, which is aimed to find distributions of topics in documents and words within the topics itself. There are multiple methods based on this approach with some modifications including the most popular: latent semantic analysis (LSA) [7], latent Dirichlet allocation (LDA) [2], guided LDA [12], additive regularization of topic models (ARTM) [13]. Taking into account the characteristics of the task we are trying to solve, it is not enough just to filter topics related to five listed domains. The need of more fine-grained activities division within one domain raises due to the customers, for instance abovementioned HR or travel agencies, requirement to have an access to specific user traits and interests. For example, fishing and bicycle riding both belong to sport domain but their consumers are different. This leads to necessity of dividing top-level topics into smaller and more specific ones and implies building topics hierarchy. A number of approaches exists to compose a hierarchy, such as top-down (in ARTM) or down-top, levelby-level or node-by-node and others, but there is no best adopted solution [5]. In case of ARTM the number of topics on each layer should be defined by user [6] as well as in many other hierarchical approaches [14] that is not convenient when the connections are unknown. A step towards an automatic structure detection was made in [16] where the hierarchy is built according to nested Chinese Restaurant Process and k-set of words that are likely to belong to the same topic at a certain level. User still have to define the number of topics on the upper level and no multiple inheritance is assumed. Instead of creating an hierarchy, a huge amount of works is concentrated on the search for the specific topics of interest. In [17] Word Embedding Informed Focused Topic Model was introduced, where sparsityenforcing prior defines a topic to be described by a subset of words. For the social studies the ethnic-specific content was mined using the lists of keywords (interval semi-supervised LDA). Authors highlight that the amount of topics should be quite huge (100 and more for the corpuse of 11,000 entries in their case), so to wipe out “trash” topics later. That concerns the issues with texts, e.g. in social media, when the posts are short and aggregation by some meta-parameter, like author, leads to the loss of flexibility. Though, one of the successful examples in this area is tweets aggregation by their hashtags, which is used to summarize content of posts. All aforementioned leads us to a hybrid approach that uses metaiformation (hashtags) topic modeling on the first level to select all texts related to one domain and than performs flat topic modeling on these texts to form the second level. So, our approach allows to perform two-level topic modeling, find relevant, in terms of our goals, texts and extract specific topics by which users can be precisely distinguished utilizing more clean and exact though sparse metainformation available from users’ hashtags. But to use this approach, one still needs to identify the precise number of topics and ensure their quality, which may be estimated from two points of view: the first is the quality of division on topics and the second is the quality of the obtained topics itself. As for the former, such metric as perplexity (log-likelihood of held-out test dataset) was used in [1]. Classical coherence measure, that pays attention to the co-occurence of words, tend to highly evaluate the topics with common words. The modification of this metric, called tf-idf (tf — term frequency, idf — inverse document frequency) coherence [11], is based on idea that the words which define topic should appear together in the relevant topic and rarely met in others. In [8] the quality is assessed by calculating the pointwise mutual information (PMI) between the most relevant topic words. The normalised variant of PMI is known to improve the standard PMI. Authors of tf-idf coherence metric proposed one more evaluation approach based on word vectors [10], where the topic quality is defined as an average distance between the top words in the topic. Comparing to human judgment this metric is slightly better among other measures, like coherence or mutual information. In [15] the ways to stabilize the obtained solutions were proposed. Nevertheless, as it was found out, consistency is not ultimately related to the topic quality. Automatic quality evaluation of hierarchical topic models are rather poor and researchers try to use the same, but modified, measurements as for flat modeling like topic intrusion [14]. For more precise results

4

Maria Khodorchenko et al. / Procedia Computer Science 136 (2018) 236–245 M. A. Khodorchenko, N. A. Butakov / Procedia Computer Science 00 (2018) 000–000

239

human-in-the-loop approach still takes place. Hungarian algorithm and the up-to-one topic correspondence is used in [9] for evaluating the topics stability. Evaluation of the quality of hierarchical topic models remains mainly unsolved as well as automatic structure composing. Currently, there are multiple alternatives (namely, word2vec based alternatives, tf-idf coherence and NPMI) how to assess the quality of domain-related (and thus quite close to each other) topics. It is not clear what method should be used and this requires to perform comparative study of several alternatives to automate the second step in our approach and draw conclusions about their applicability in relation to the case of user profiling.

3. Two-level topic modeling The whole workflow of extracting specific activity information is shown on Fig. 1. The overall process can be divided into two parts - keywords obtaining and fine-grained activities extraction. The first step requires to filter as much posts with specific activity information as possible. To perform this, we process the available posts metainformation that can be used to extract domain-related words (which are used in this particular social media). This approach helps to get the set of keywords for each domain without addressing any external dictionaries. However, human control is needed for the objectives of relevant topics selection and wiping the noise. After that, on the second step, we perform filtering with minimum possible amount of words from the domain keyword list (for instance, we assume that sport post should contain at least 2 words from our sport dictionary subset). This simple threshold helps to avoid the further processing of highly irrelevant information. The second step is to process the resulting filtered posts corresponding to various datasets in order to identify the subtypes of activities within each domain. This task can be performed by using ARTM approach. Such two-level approach provides a way to perform filtration and specifying the unknown structure of activity domains. Concerning the metainformation, one of the possible ways to quickly assess the topic of post is to look at its hashtags information. On the one hand, such short descriptions, selected by the user itself, contain the main idea of the content, on the other hand, the number of hashtags per post is rather small for applying methods based on probability distributions, due to the sparsity of final document-term vector caused by wide range of various hashtags that may be unique for each post (see an example in Table 1). From the table, it is clear that NMF gives more precise topics in comparison to LDA, where such noise as city name "самара"(Samara) takes place or too high-level definitions, like "девушка" (girl) and "жизнь" (life) in one topic. Table 1. Difference in detected topics. LDA

NMF

спорт зож селфи я пп самара друзья девушка жизнь красота

fitness sport gym motivation bodybuilding fit workout health training fitnessmodel

To handle the sparsity problem we used non-negative matrix factorization (NMF) approach with α = 1 and l1 = .1. The selected number of topics was significant and equal to 100 in order to obtain more precise sets of keywords. l1 prior defines the regularization, while α is responsible for its intensity. This approach helps to preserve rare words in the final division on topics. Examples of extracted words are shown in Table 2. For the further extraction of the more fine-grained activities two ways are possible: working with each group of posts using a separate flat model or trying to recreate the top-level distribution by combining together the resulting filtered topics. Here, we used the former approach to identify the quality of filtering and analyze the underlying structure.

240

Maria Khodorchenko et al. / Procedia Computer Science 136 (2018) 236–245 M. A. Khodorchenko, N. A. Butakov / Procedia Computer Science 00 (2018) 000–000

5

Fig. 1. Scheme of activity extraction.

Table 2. Examples of extracted keywords on each activity domains. Sport and Health

Entertainment and Leisure

Family and Work

Culture and Religion

General

fitness правильноепитание тренировки зож fitnessmotivation

summer природа party relax weekend

family children mother bride dad

music show bass festival piano

cars ремонт водитель news nhk

4. Experimental study When performing specific topics extraction it is important to understand the way of assessing the resulting quality. Such knowledge should help to automatically identify number of topics when underlying structure is not known a priory.

6

Maria Khodorchenko et al. / Procedia Computer Science 136 (2018) 236–245 M. A. Khodorchenko, N. A. Butakov / Procedia Computer Science 00 (2018) 000–000

241

For the purposes of quality check we collected a dataset containing posts of 133 000 Instagram users from Saint-Petersburg. To collect the dataset, the crawling system from [3][4] was used. The users were randomly chosen from the list of collected Instagram user identifiers belonging to Saint-Petersburg. For each user, all posts from his/her Instagram profile were collected. The resulting dataset contains 50 000 000 posts. Each collected post consists of text, hashtags and reference to posted image. Here, we experiment with four out of five activity domains, namely "Sport and Health","Family and Work", "Leisure and Entertainment", "Culture and Religion". They contain posts, obtained after the filtration step. The resulting distribution is skewed towards the entertainment and leisure domain, while some domains’ aspects, such as religion and work, are poorly presented. Table 3. Examples of posts in various number of topics division. 4 topics

8 topics

16 topics

25 topics

(3) йога заниматься цель рука программа

(0) спортивныедевушка сезон проект качалка модель (1) вкусно вода здоровоепитание ужин обед

(8) скандинавскаяходьба функциональныйтренинг пилатес фитнеснасвежемвоздух шпагат

(3) пока смотреть любимый ходить думать (2) спортсмен треня акробатика весело кроссфит

Firstly, the evaluation was performed according to how the topics are perceived by human. If the number of topics is less than 4 it is difficult to access the meaning of each topic due to the poor separation quality (see Table 3). More clear division appears from the 8 topic, where the specific sport orientation starts to be seen. Beggining with the division on 20 topic results become more noised by the background topics, though we can see clear ones. When rising the number, topics begin to repeat each other and show the unwanted common words combined together into one topic. Three popular metrics were addressed in order to evaluate the quality of obtained topics, namely (1) tf-idf coherence, which is calculated as  d:w1 ,w2 ∈d t f id f (w1 , d)t f id f (w2 , d) +   Ct f id f (t, Wt ) = log . (1) d:w1 ∈d t f id f (w1 , d) (2) metric, based on word embeddings Cemb (t, Wt ) =

   1 d vw1 , vw2 . |Wt | (|Wt | − 1) w w 1

(2)

2

Two types of distances d were used - cosine similarity from Word2Vec and l2 distance. (3) normalized PMI, C NPMI (t, Wt ) =

 i< j

p(wi ,w j ) log p(wi )p(w j)

−log(p(wi , w j ))

.

(3)

The corresponding word embeddings were obtained using Word2Vec model. Its goal is to fit the word vectors into one space according to their cosine similarity. Word2Vec for topics was trained on the whole

242

Maria Khodorchenko et al. / Procedia Computer Science 136 (2018) 236–245 M. A. Khodorchenko, N. A. Butakov / Procedia Computer Science 00 (2018) 000–000

7

set of posts. The need to prepare own model instead of using the pretrained one is in presence of hashtags (merged word combinations) in posts, which are not considered in existing models. Theta and phi sparsity for main and background topics were tuned for each of the domains and the percentage of background topics was set to 0.2 for all domains, except "Family and Work", where this value reached 0.3.

Fig. 2. Results of topic quality evaluation for "Sport and Health" domain.

Fig. 3. Results of topic quality evaluation for "Culture and Religion" domain.

8

M. A. Khodorchenko, N. A. Butakov Computer Science (2018) 000–000 Maria Khodorchenko et al./ /Procedia Procedia Computer Science 13600 (2018) 236–245

243

Fig. 4. Results of topic quality evaluation for "Family and Work" domain.

Fig. 5. Results of topic quality evaluation for "Leisure and Entertainment" domain.

For the purposes of quality evaluation top 10 most representative words for each topic were picked. All the computed values were normalized according to (4) in such a way that value of 1 identifies the best result on the observed data and 0 - the worst (within metrics). Qiknorm is the quality result for i topics, obtained with metric k. For measure based on l2 distance min should be changed on max and vica versa. Here, we address the quality of the worst topics (5th percentile) for each number of topics in range of 4-40. Qiknorm =

Cki − minn (Ck ) . maxn (Ck ) − minn (Ck )

(4)

244

Maria Khodorchenko et al. / Procedia Computer Science 136 (2018) 236–245 M. A. Khodorchenko, N. A. Butakov / Procedia Computer Science 00 (2018) 000–000

9

Due to the fluctuations in quality results connected with closeness of the subtopics, the obtained graphs were smoothed by Savitzky-Golay filter with window size 11. The results of calculations are presented on Fig. 2 - 4 for all domains except "General" due to its weak definition and absence of ground truth topic to be evaluated. Expert human opinion for subsets of topics is provided under the x axis. The qualitative assessment could be from "very good" (all the topics are coherent and related to the main topic) to "very bad" (all the topics are incoherent/irrelevant) through such categories as "good", "comprehensible" and "bad". The optimal amount of topics is signed by "Human judgment". The label "good" could be assigned to several variants of topics number but the "Human judgment" sign is an indicator that this particular variant is slightly better coherent, from the human perspective, than other labeled amounts due to the better word aligning inside topics or less noise. It is evident that for the amount of topics less than or equal to 30 the rises in resulting values are indicators of at least "comprehensible" topics, while after this threshold this statement doesn’t hold true. It can be seen that Word2Vec measure with cosine distance doesn’t correctly correspond to human labeling and may experience falls when the quality is considered to be at least "comprehensible" (Fig. 2). Regarding the l2-norm the situation is better but the smallest results of the metric are weakly connected with ground truth opinion. NPMI is close to reflecting the "human judgment" labeling, but it acts worse at the ends of the interval. The same situation is for the tf-idf coherence. In overall, several rising trends (on the interval of 4-40 topics) can be seen for all domains. The peaks in quality with growing number of topics can be explained by topics with common words that are more reflected by NPMI and tf-idf coherence measures. Metrics based on word embeddings experience difficulties with reflecting the exact peaks and troughs while, globally, the trend itself is correct. The main reason behind such behavior is in the learning on the initial corpus of posts, which turned out not to be enough to embed and space vectors qualitatively in terms of their spacing in real world. Looking at the results we can conclude that the automatic evaluation with NPMI and tf-idf coherence gives us approximation to the human judgment. The absence of "very good" amount of topics is an indicator of the need to produce better separation between topics.

5. Conclusion and Future work The obtained results show that the whole procedure of activity extraction depends highly on human perception and the process is needed to be guided. By defining two steps we are trying to reduce this human presence to only one first step. Assessment of fine-grained activity using the two-step schema is very promising due to its full closeness on the information that is given in the dataset. The whole structure cannot be predefined externally, because it is a latent property of the social media data itself. Results reflect that the metadata, such as hashtags is extremely helpful in defining domain-specific keywords list. The evaluation of the topics on the second step may be made semi-automated by looking at metrics trends. The best behavior is demonstrated by tf-idf coherence and npmi measures though there is no strong leader. While it is difficult to get the exact number of topics, we can shorten a range of possible numbers within each activity domain. However, due to significant dependence on how the information is perceived by human these measures are only rough approximation of the quality. A step towards the better results can be made by ensemble implementation when the disadvantages of using only one metric is counterbalanced by their combination or introduction of classifier based on human estimations ("very good", "bad", etc.) with manually labeled pairs (quality value - given estimation). In future work, we are going to fully automate the second step of extracting the activity subtypes by choosing the best reasonable way to (comparison of classifier based on human labeling with ensemble approach). It will help to define and fix the amount of fine-grained actions that can be performed by user. Based on this knowledge the numerical vectors indicating the level of each type of activity manifestation can be assigned for further clustering and recommendations building.



10

Maria Khodorchenko et al. / Procedia Computer Science 136 (2018) 236–245 M. A. Khodorchenko, N. A. Butakov / Procedia Computer Science 00 (2018) 000–000

245

Acknowledgments This research financially supported by Ministry of Education and Science of the Russian Federation, Agreement #14.578.21.0196 (03.10.2016). Unique Identification RFMEFI57816X0196. References [1] Apishev, M., Koltcov, S., Koltsova, O., Nikolenko, S., Vorontsov, K., 2016. Additive regularization for topic modeling in sociological studies of user-generated texts, in: Mexican International Conference on Artificial Intelligence, Springer. pp. 169–184. [2] Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent dirichlet allocation. Journal of machine Learning research 3, 993–1022. [3] Butakov, N., Petrov, M., Mukhina, K., Nasonov, D., Kovalchuk, S., 2018. Unified domain-specific language for collecting and processing data of social media. Journal of Intelligent Information Systems , 1–26. [4] Butakov, N., Petrov, M., Radice, A., 2016. Multitenant approach to crawling of online social networks. Procedia Computer Science 101, 115–124. [5] Chirkova, N., Vorontsov, K., 2016. Additive regularization for hierarchical multimodal topic modeling. Journal Machine Learning and Data Analysis 2, 187–200. [6] Kochedykov, D., Apishev, M., Golitsyn, L., Vorontsov, K., . Fast and modular regularized topic modelling . [7] Landauer, T.K., 2006. Latent semantic analysis. Wiley Online Library. [8] Lau, J.H., Newman, D., Baldwin, T., 2014. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality, in: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 530–539. [9] Miller, J., McCoy, K., 2017. Topic model stability for hierarchical summarization, in: Proceedings of the Workshop on New Frontiers in Summarization, pp. 64–73. [10] Nikolenko, S.I., 2016. Topic quality metrics based on distributed word representations, in: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, ACM. pp. 1029–1032. [11] Nikolenko, S.I., Koltcov, S., Koltsova, O., 2017. Topic modelling for qualitative studies. Journal of Information Science 43, 88–102. [12] Qin, Y., Yu, Z., Wang, Y., Gao, S., Shi, L., 2015. Approaches to detect micro-blog user interest communities through the integration of explicit user relationship and implicit topic relations, in: Chinese National Conference on Social Media Processing, Springer. pp. 95–106. [13] Vorontsov, K., Frei, O., Apishev, M., Romov, P., Dudarenko, M., 2015. Bigartm: Open source library for regularized multimodal topic modeling of large collections, in: International Conference on Analysis of Images, Social Networks and Texts, Springer. pp. 370–381. [14] Wang, C., Liu, X., Song, Y., Han, J., 2014. Scalable and robust construction of topical hierarchies. arXiv preprint arXiv:1403.3460 . [15] Xing, L., Paul, M.J., 2018. Diagnosing and improving topic models by analyzing posterior variability . [16] Xu, Y., Yin, J., Huang, J., Yin, Y., 2018. Hierarchical topic modeling with automatic knowledge mining. Expert Systems with Applications 103, 106–117. [17] Zhao, H., Du, L., Buntine, W., 2017. A word embeddings informed focused topic model, in: Asian Conference on Machine Learning, pp. 423–438.