Available online at www.sciencedirect.com Available online at www.sciencedirect.com
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science 00 (2018) 000–000 Procedia Computer Science 14300 (2018) 626–634 Procedia Computer Science (2018) 000–000
www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia
8th International Conference on Advances in Computing & Communication (ICACC-2018) 8th Communication(ICACC-2018) (ICACC-2018) 8thInternational InternationalConference Conferenceon onAdvances AdvancesininComputing Computingand & Communication
Sentiment Sentiment Extraction Extraction from from Naturalistic Naturalistic Video Video Vignesh Radhakrishnan∗∗, Christina Joseph, K. Chandrasekaran Vignesh Radhakrishnan , Christina Joseph, K. Chandrasekaran
Department of Computer Science and Engineering, National Institute of Technology Karnataka, Surathkal - 575025, India Department of Computer Science and Engineering, National Institute of Technology Karnataka, Surathkal - 575025, India
Abstract Abstract Sentiment analysis on video is quite an unexplored field of research wherein the emotion and sentiment of the speaker are extracted Sentiment analysis on video is quite unexplored wherein emotion and sentiment thenaturalistic speaker areaudio extracted by processing the frames, audio andan text obtained field fromof theresearch video. In recentthe times, sentiment analysis of from has by processing the frames, and text fromdone the video. In recent automatic times, sentiment fromon naturalistic audio has been an upcoming field ofaudio research. Thisobtained is typically by performing speechanalysis recognition audio, followed by been an upcoming field of research. is typically by performing automatic recognition on from audio,text followed by extracting the sentiment exhibited byThis the speaker. On done the other hand, techniques for speech extracting sentiments are quite extracting the sentiment exhibited by the speaker. On the other hand, techniques for extracting sentiments from text are quite developed and tech giants have already optimized these methods to process large amounts of customer review, feedback and developed giants have already optimizedanalysis these methods to process large amounts of customer review, feedback and reactions. Inand thistech paper, a new model for sentiment from audio is proposed which is a hybrid of Keyword Spotting System reactions. InMaximum this paper, Entropy a new model sentiment analysis from audioisisdeveloped proposed with whichthe is aim a hybrid of Keyword Spotting System (KWS) and (ME)for Classifier System. This model to outperform other conventional (KWS) andand Maximum Entropy (ME) Classifier System. is developed with aim toaoutperform other conventional classifiers to provide a single integrated system for This audiomodel and text processing. In the addition, web application for dynamic classifiers and to provide a single integrated system for audio and text processing. In addition, a web application for dynamic processing of YouTube videos is described. The WebApp provides an index-based result for each phrase that is detected in the processing of the YouTube videos is viewer described. WebApp provides an content. index-based result for each phrasetothat is these detected in the video. Often, emotion of the of aThe video corresponds to its In this regard, it is useful map emotions video. Often, the emotion the viewer of aavideo corresponds its content. In this it is that useful map these emotions to the text transcript of theof video and assign suitable weight to to it while predicting theregard, sentiment thetospeaker exhibits. This to the describes text transcript video and assign a suitable weight to itfacial whileexpressions predicting using the sentiment that Thus, the speaker This paper such of an the application that was developed to analyze Affdex API. using exhibits. the combined paper describes an application that was developed atorobust analyze expressions using Affdex API. Thus, using the statistics from allsuch the three aforementioned components, andfacial portable system for emotion detection is obtained thatcombined provides statistics from all the and threecan aforementioned components, robust and portable system for emotion detection is obtained that provides accurate predictions be deployed on any moderna systems with minimal configuration changes. accurate predictions and can be deployed on any modern systems with minimal configuration changes. c 2018 2018 The The Authors. Authors. Published Published by by Elsevier Elsevier B.V. B.V. © c 2018 The Authors. Published by Elsevier B.V. This is This is an an open open access access article article under under the the CC CC BY-NC-ND BY-NC-ND license license (https://creativecommons.org/licenses/by-nc-nd/4.0/) (https://creativecommons.org/licenses/by-nc-nd/4.0/) This is an and openpeer-review access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Selection under responsibility of the scientific Selection and peer-review under responsibility of the scientific committee committee of of the the 8th 8th International International Conference Conference on on Advances Advances in in Selection and under(ICACC-2018). responsibility of the scientific committee of the 8th International Conference on Advances in andpeer-review Communication Computing (ICACC-2018). Computing and Communication (ICACC-2018). Keywords: Keyword Spotting; Maximum Entropy; Sentiment Analysis; Emotion Detection Keywords: Keyword Spotting; Maximum Entropy; Sentiment Analysis; Emotion Detection
1. Introduction 1. Introduction Sentiment Analysis, also called opinion mining, is a term that is often misjudged. Fundamentally, it is the means Sentiment Analysis, also called mining, is a term that often misjudged. Fundamentally, it is means towards deciding the passion of theopinion tone behind a progression ofiswords, used to understand the states of the mind and towards deciding the passion of the tone behind a progression of words, used to understand the states of mind and feelings communicated in an online or electronic form. It is extremely valuable in web-based social networking feelings communicated in an online or electronic form. It is extremely valuable in web-based social networking environment as it enables us to pick up an outline of the more extensive popular feeling behind specific themes. As environment it enables pick up an outlineareofvaried the more popular feeling behind specific themes. As mentioned inas [7], the usesus of to sentiment analysis andextensive have tremendous potential. The capacity to separate mentioned in [7], the uses of sentiment analysis are varied and have tremendous potential. The capacity to separate ∗ ∗
Corresponding author at: National Institute of Technology Karnataka, Surathkal, India. Corresponding at: National Institute of Technology Karnataka, Surathkal, India. E-mail address:author
[email protected] E-mail address:
[email protected] c 2018 1877-0509 Authors. Published Published by by Elsevier Elsevier B.V. 1877-0509 © 2018 The The Authors. B.V. c 2018 1877-0509 Thearticle Authors. Published byBY-NC-ND Elsevier B.V. This access under thethe CCCC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) This isis an anopen open access article under license (https://creativecommons.org/licenses/by-nc-nd/4.0/) This is an and openpeer-review access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Selection under responsibility ofofthe committee ofofthethe 8th8th International Conference on Advances in Computing and Selection and peer-review under responsibility thescientific scientific committee International Conference on Advances in Computing Selection and peer-review under responsibility of the scientific committee of the 8th International Conference on Advances in Computing and and Communication (ICACC-2018). Communication (ICACC-2018). Communication (ICACC-2018). 10.1016/j.procs.2018.10.454
2
Vignesh Radhakrishnan et al. / Procedia Computer Science 143 (2018) 626–634 Vignesh R. et al. / Procedia Computer Science 00 (2018) 000–000
627
bits of knowledge from social information is a practice that is by and large well-received by organizations all over the world. Steps in assessment via web-based networking media have been appeared to be connected with shifts in money markets. The Obama organization utilized estimation analysis to gauge popular supposition to strategy declarations and crusade messages in anticipation of 2012 presidential election. The capacity to rapidly comprehend customer states of mind and respond in like manner is something that Expedia Canada exploited when they saw that there was a relentless increment in negative criticism to the music utilized as a part of one of their TV adverts. Although quite popular, the science hasn’t been perfected yet. Some of the challenges are highlighted in [6]. The human dialect is mind-boggling. Instructing a machine to break down the different linguistic subtleties, social varieties, slang and incorrect spellings that occur in online platforms is a troublesome procedure. Educating a machine to see how the setting can influence tone is much more troublesome. For sarcasm, similar to some varieties of Natural Language Processing (NLP), setting matters. Dissecting normal dialect information is the focus of the following 2-3 decades. It’s an unimaginably challenging task, and sarcasm and different sorts of amusing dialect are naturally difficult for machines to identify when taken a glance at during classification. It’s fundamental to have an adequately complex and sufficiently thorough approach so that relevant context can be considered. For instance, that would require realizing that a specific client is for the most part wry, amusing, or having a bigger data-set of the common dialect information that gives intimations to decide if an expression is sarcastic. Given the aforementioned challenges and opportunities, it is evident that it is extremely useful to build one such system for video and that it could be quite complex. This paper describes the development of such a system, which is composed of three main subsystems. The first is the main classifier, which uses a combination of Keyword-Spotting and Maximum Entropy classification techniques to increase prediction accuracy and speed from text transcripts. The second is a web-application built to dynamically process YouTube videos with emphasis on the context of the content. It indexes the sentiments of the phrases using Google Speech Recognition API and hence provides a timeline based graph of the overall emotion of the speaker. The third is a reaction analysis web application that is capable of detecting the minute features of face such a eyebrow elevation, jaw-drop etc. to give an accurate prediction of user emotion, which quite often reflects that of the speaker in the video. The main classifier assigns suitable weights to the the probabilities returned by the second and third subsystem and predicts the overall sentiment. The rest of the paper is organized as follows. In Section 2, work related to opinion mining is explored and compared with that described in this paper. A detailed description of the models used in Section 3, along with the implementation details. The results are presented in Section 4. The merits and demerits of such a model is discussed in Section 5. The steps to be taken in order to fine-tune the algorithm and increase prediction speeds are explored in Section 6. 2. Literature Survey Majority of the peer work considered has little implementation statistics, since the field itself is quite new and experts are still validating the claims made by these papers (such as [11]) by applying the techniques in the industry. Beginning with the most simplistic approach, this paper[1] exhibits another way to deal with state level estimation examination that first decides if the content is neutral or polar and later disambiguates the extremity of the polar content. In particular, [15] discusses this with emphasis on a marketing point of view. With this approach, the framework can naturally recognize the logical extremity for an extensive subset of estimation articulations, achieving performances that are fundamentally superior to benchmark. As described in [12], an alternative is to classify individual statements instead of the entire document in one pass. Sentiment analysis in multiple languages has been attempted in [14] with limited success. To analyze audio in real time, a method discussed in [2] elucidates a multimodal framework for constant examination of audio consisting of social collaboration in a small group of users. Two significant objectives are met: the synchronization of the emotional conduct inside a little group and the development of useful roles, for example, leadership. A small group of users is demonstrated as an intricate framework comprising of single interfacing segments that can auto-compose and indicate global properties. Methods are created for processing quantitative measures of both synchronization and administration. The trial test-set chosen is music, since it is an unmistakable case of intuitive and social movement, where feeling and nonverbal correspondence assumptions are of importance. It has been utilized as a part of exploratory systems and in genuine applications (such as in client driven applications for dynamic music tuning-in).
628
Vignesh Radhakrishnan et al. / Procedia Computer Science 143 (2018) 626–634 Vignesh R. et al. / Procedia Computer Science 00 (2018) 000–000
3
There is a high degree of sarcasm in social media. While there exist a few systems which can detect sarcasm, little work[3] has been completed on measuring the impact that sarcasm has on conclusion in tweets and on consolidating this into programmed instruments for emotion detection. This group investigates the impact of sarcasm scope on the extremity of tweets, and has assembled various tweets which enable us to enhance the precision of opinion mining when sarcasm is known to exist. It considers specifically the impact of assumption and sarcasm contained in hashtags, and has created a hashtag tokeniser for GATE, so conclusion and sarcasm found inside hashtags can be identified all the more effortlessly. As per the experiments, the hashtag tokenisation, sarcasm discovery and polarity identification achieved 98%, 91% and 80% precision respectively. Deep Convolutional Neural Networks have been applied to analyze short texts such as Tweets. All information ranging from character-level to sentence-level has been made use of in [10] to extract as much contextual information as possible. It has given satisfactory accuracy in with two datasets, viz. Stanford Sentiment TreeBank (SSTb) and Stanford Twitter Sentiment (STS) Corpus. The output of this method for other datasets has not been proven. Another parallel line of research in [4] addresses the undertaking of multimodal sentiment investigation that show that a joint model that coordinates visual, audio, and printed highlights can be successfully used to recognize sentiment in Web recordings. This paper makes three critical commitments. To start with, it addresses the undertaking of trimodular sentiment analysis, and demonstrates that it is a practical work that can profit by the joint exploit of visual, audio and literary modalities. Second, it distinguishes a subset of audio-visual highlights important to sentiment analysis and presents rules on the most proficient method to coordinate these highlights. Third, it presents another dataset comprising of genuine web information, which will be valuable for future research. A major part of the work[5] to date on subjectivity and sentiment analysis has concentrated on printed information, and various resources have been created including dictionaries. Given the fast-tracked development of various media on the Web, which incorporates enormous collections of recordings (e.g., YouTube, Vimeo, Video Lectures), pictures (e.g., Flickr, Picasa, Facebook), audio (e.g., podcasts), the capacity to address the distinguishing evidence of conclusions and sentiment for differing modalities is becoming progressively essential. In particular, movie reviews on YouTube are particularly difficult to examine since the sentiment of the content depends on the genre of the movie. For example, negative sentiment word such as ’terrifying’ could be found in a positive review for a horror movie. Therefore, facial expressions and tonal cues have been taken into consideration in the work proposed in [8]. In spite of the fact that there is a huge measure of past work on multimodal emotion analysis, that work has not tended to explicit extremity (or sentiment) of information, and has by and large centered around visual and audio signs, and for the most part overlooked the learning that can be gained from printed analysis. So far the field of multimodal sentiment analysis has not gotten much consideration, and no earlier work has particularly tended to extraction of features and combination of data extricated from various modalities. The point of multimodal information combination is to build the precision and quality of evaluations. Numerous applications, e.g., routing devices, have effectively exhibited the capability of information processing. This delineates the significance and plausibility of building up a multimodal system that could adapt to every one of the three detecting modalities: content, audio, and video in human-driven situations. The way people convey and express their feelings and sentiments can be communicated as multimodal. The literary, audio, and visual modalities are simultaneously and psychologically used to empower successful extraction of the semantic and emotional data passed on during interactions. The component extraction process is from various modalities and in addition they are utilized to fabricate a novel multimodal sentiment analysis system. For tests, the group in [9] has utilized datasets from YouTube. The group has made use of a few classifiers based on machine learning for the sentiment analysis task. The best execution has been obtained with the Extreme Learning Machine (ELM), a popular learning method that gives proficient closely-matched answers for forward systems including (yet not constrained to) single-/multi-shrouded layer neural systems, widespread premise work systems, and segment learning. ELMs offer huge focal points, for example, quick learning speed, simplicity of usage, and insignificant human intervention. It therefore is a huge potential candidate as a reasonable and viable procedure for vast scale registering and machine learning in a wide range of medium, including picture, discourse handling and in addition multimodal information analysis. Deep Convolutional Neural Networks have been applied in [13] to leverage the various feedback mechanisms for multimodal sentiment analysis. Research in this field is quickly developing and drawing in the consideration of both scholarly world and industry alike. This in combination with advances in flag creation and AI has prompted the improvement of cutting edge frameworks that plan to identify and process complete emotion data contained in multimodal sources. The dominant part of such
4
Vignesh Radhakrishnan et al. / Procedia Computer Science 143 (2018) 626–634 Vignesh R. et al. / Procedia Computer Science 00 (2018) 000–000
629
systems in any case, depend on handling a single medium, i.e., content, audio, or video. Further, these frameworks are known to display constraints as far as meeting exactness and general execution requirements, which, thus, incredibly limit the usability of such frameworks in genuine applications. Although the aforementioned techniques are effective, they may require significant computing resources. In addition, configuring the systems for execution is not straightforward. In this paper, we propose a simple yet innovative approach for sentiment analysis that is portable and produces results with good accuracy. 3. A COMPREHENSIVE SENTIMENT ANALYSIS SYSTEM The collection of videos which forms the dataset for training and validation is chosen manually by placing special emphasis on having videos wherein the speakers have different opinions (positive, negative and neutral), tonality, accent, speed etc. The aim is to provide a varied training set to prevent over-fitting of the model. The dataset (playlist) of videos created can be found here: goo.gl/W5VuPv. The three main components of the video sentiment analyzer, viz. hybrid classifier, live video analyzer and reaction analyzer are explored. A comprehensive definition and evaluation of each model is given in the following sub-sections. 3.1. Hybrid KWS-ME Classifier The majority of work described in this paper revolves around that of the development and fine-tuning of the hybrid model. The following are the steps in building the hybrid classifier. 1. 2. 3. 4. 5. 6.
Extract Audio from Video Process Transcript Feature Selection Apply Keyword Spotting Apply Maximum Entropy Sentiment Prediction
The most important algorithms that the model consists is Keyword Spotting (KWS) and Maximum Entropy. Keyword Spotting, mostly associated with speech processing, deals with identification of certain words that are present in a predefined list of words (henceforth referred to as keywords) from the speech that is being processed. As technological advancements were made and speech recognition engines provided satisfactory accuracy, KWS was adopted in the realm of text processing as well. In this experiment, we apply KWS on the text extracted from from audio using CMU’s PocketSphinx library. It should be noted that KWS requires significant amount of data preparation and normalization to be done. The principle of maximum entropy is fundamental to the ME algorithm. In simple words, it states that if prior information is taken about a probability distribution and if encoded using various probability functions, the one with the maximum entropy is the most accurate. Using this function, we can predict the future data that is likely to be in accordance with the pre-exisiting data. If a parallel to modern machine learning technique is to be drawn, the pre-exisiting data would be the ’dataset’ that would be used for training and validation and finding the maximum entropy would be the feature selection or the ’training’ of the model. Building the classifier then consists of appending the probability function to the overall model. The preprocessing step includes downloading all the video in the aforementioned YouTube playsist using youtubedl library of python. Using this library provides the advantage of downloading the audio is wav format, which is the the required format for Automatic Speech Recognition engine (PocketSphinx). PocketSphinx is chosen over other ASR such as Google SR or Python SR since it requires minimum configuration changes and the entire audio can be processed at once instead of being processed in batches. It is noteworthy that the time required to process the audio file and extract text from it is dependent on several factors such as network speed, bandwidth, SR engine used, CPU speed etc. Therefore, the audio files were converted to text transcripts on demand basis. The various operations on the text transcript are described in Figure 1. There are three operations that can be performed simultaneously on the extracted transcript. This can be exploited to achieve parallelism in the prediction phase. The first operation focuses on the syntax of the text. Using Parts-
630
Vignesh Radhakrishnan et al. / Procedia Computer Science 143 (2018) 626–634 Vignesh R. et al. / Procedia Computer Science 00 (2018) 000–000
5
Fig. 1. Operations on the Transcript extracted from Audio
Of-Speech Tagging, the keywords in the text are obtained. These keywords are tagged as adjectives, verbs, nouns (singular, plural) etc. Some examples of keywords that were identified are: romantic, bubbly, Shirley and purchasing. It has been observed that the adjectives (such as beautiful, horrible etc.) have greater influence on the sentiment of the context. Therefore, the adjectives from the keywords are extracted using Regular Expression evaluation engine. The second task is to extract the frequency of the words that occur in the transcript. This is achieved by TextBlob, an open source Python library built for Natural Language Processing functions. Using this, only the most frequently occurring words shall be used for sentiment analysis since they are most likely to influence the probabilistic classifier. The final parallel task deals with the semantics of the text. This includes extracting the noun phrases from the text using TextBlob. This helps in isolating the sentences from the transcript which will be further used to obtain phrase level polarity from the TextBlob sentiment analysis library. The results from the three operations are combined to produce a database whose attributes are the various features extracted during the execution of the tasks. Various data cleaning and normalization operations are performed on this database and tuples with missing values are removed. A Naive Bayes classifier model such as that provided by Python Natural Language Toolkit (nltk) is trained on Movie Review Corpus (from nltk). Similarly, the same dataset is used to train and validate the Vader Classifier (nltk). The models can also be trained on all the text transcripts obtained from the audio database. However, the number of samples would be quite small owing to the large size of audio files and huge computation power required to process them and extract the underlying text. One solution to this is to perform cross-validation on the limited samples (around 63) available in the audio repository. On comparison of the Naive Bayes and Vader classifier, it is evident that both models produce similar results with the former’s accuracy at 81% as opposed to the latter’s accuracy of 83%.
Fig. 2. Keyword Spotting on Audio
The KWS implementation used is depicted in Figure 2. Prior to implementing the KWS algorithm, it is essential to have the keywords for various sentiments (happy, sadness, disgust, anger etc.) ready. There are two ways this can be achieved. The first, perhaps more straightforward and precise, is to obtain these from Python ConceptNet support centre. The second option is to use extract these by means of opinion mining on any standard review dataset. In this paper, the former method has been used. The ME algorithm, as mentioned previously is quite proven and the implementation of the algorithm has the additional task of incorporating the KWS results. For this, a process call to KWS algorithm is made during the initiation
6
Vignesh Radhakrishnan et al. / Procedia Computer Science 143 (2018) 626–634 Vignesh R. et al. / Procedia Computer Science 00 (2018) 000–000
631
Fig. 3. Hybrid Sentiment Classifier
of ME algorithm. Thus, the most frequent keywords are scanned for and extracted from the dataset and then passed to ME algorithm. This provides the result faster since the sample size is greatly reduced by removing irrelevant and infrequent words. This hybrid model runs for several hundred iterations and after each iteration, the error is calculated. Using backpropagation, the model is further fine-tuned by correcting the weight matrix. It was observed that the hybrid model (Figure 3) outperforms all other conventional models for prediction (including standalone versions of KWS and ME). 3.2. Live Video Analysis The aforementioned KWS-ME system produces best results, although it works in a static setup i.e. when the required audio file is already available. Therefore, to analyze videos while they’re being played, a more unconventional approach has to be taken. The subsystem should be independent of the source of the video. Any context that is established should be solely using the contents of the video. In addition, the component must be portable to any system with minimum configuration changes. Therefore, the obvious solution is to build a web-application. A web application can run on a variety of browsers. The browser is an application level client. Thus, any system level detail is abstracted away by the browser, giving the webapp a uniform execution environment across all systems on which it is run. There exists a server written in JavaScript that handles the requests for video analysis which is run on the browser. This server uses the trained hybrid model to make predictions. It should be noted that once the hybrid KWS-ME classifier is trained, it is quite light-weight in terms of resource utilization. This, in combination with its excellent accuracy dictates that it is the model of choice for mobile and web applications. In order to permanently store video analysis results and reuse it as and when required, the server stores all prediction results in MongoDB. This particular database was chosen since it is easy to set up and can support all non-persistent connections as well. This is essential since majority of the users will not always have the webapp running in the background and will use it on an on-demand basis. A Graphical User Interface (GUI) is also provided to allow the user to easily access the full functionality of the webapp. The components of the GUI are: a message box wherein the video related details such as name, URL, description has to be provided, set of buttons for playing, pausing and stopping the analyzer, a text box where the results of Automatic Speech Recognition is loaded and a graphical canvas on which sentiment will be plotted against the index of the phrase that has been detected in the video. To run the webapp, the application is hosted on one of the localhost ports of the user’s machine. The user would require to locate the video of which the sentiment has to be detected. Once found, the URL has to be supplied to the active server, along with other details that may be desired. The server is instructed to start listening for audio. Simultaneously, the video has to be played, ensuring that its audio reaches the microphone of the user’s machine and has sufficient decibel levels. The audio received is broken down into phrases and sent to Google Speech Recognition for extracting the text. Once the text is received, optionally, it can be displayed on the GUI and hence it can be verified by a human counterpart. The sentiment analysis runs automatically in the background and the pointer on the graph moves across the index dimension, as time elapses. The dimension along the index is scaled suitably to give an idea
632
Vignesh Radhakrishnan et al. / Procedia Computer Science 143 (2018) 626–634 Vignesh R. et al. / Procedia Computer Science 00 (2018) 000–000
7
of the overall sentiment of the video. A sentiment of 1, 0, -1 denotes positive, neutral and negative opinion of the speaker respectively. For example, when the speaker in the video mentions the word ’wonderful’, the curve in the graph moves towards 1 on the sentiment axis. Similarly, the word ’average’ directs the curve towards 0. The data can also be accessed by logging into MongoDB instance and making simple SQL queries. 3.3. Video Reaction Analysis Often times when we are deeply engrossed in a particular scene in a movie, we find ourselves imitating the expression of the character in consideration. This ability of humans to empathize with fellow humans is a deep-rooted character trait and hence it is not uncommon. Similarly, when we watch a video online, our facial expression is a characteristic of our feeling towards the content delivered to us. For example, if we are watching the video of a baby crawling, we feel happy and hence we smile. On the other hand, if we are watching a speaker condemn a terrorist attack, we feel hatred and we put on a frown on our face. It is important to note that all this happens subconsciously. Thus, the reaction can be considered more genuine than the user’s verbal or textual feedback. In this paper, this emotional feedback of the user is viewed as an opportunity to further enhance the prediction accuracy. Therefore, a subsystem is built to exploit this opportunity. To analyze the facial expressions, we make use of Affdex Web API from Affectiva. It is an open-source AI based emotion detection project that has detected millions of expressions so far and has been extensively integrated into other large-scale projects with excellent results. The web application is developed using HTML, CSS and PHP. The link for the web application is specified as a source to the HTML file, making it runnable on any browser. To run the web application, the user must run a server on the local machine. The application would access the visual recording device on the user’s machine to analyze the video stream continuously. Initially, the application delays analysis until the user’s face has been registered in the frame. Once the features have been identified, it tracks the features continuously, regardless of the alignment and movement speed of the user’s view. Some examples of the features that are detected include eyebrow position, percentage of jaw drop, teeth exposure etc. These features are cumulatively considered to predict the expression the user is exhibiting. For example, a small jaw drop with raised eyebrows implies surprise. Similarly, a high degree of teeth exposure with relaxed eyebrows indicates happiness. This data is then stored locally along with the timestamp at which it was detected. On the server side, there are several operations that are carried out. These form the bulk of Natural Language Processing and Video Analytics. Some of these operations can be performed in parallel which helps in reducing the prediction time. This comes in handy since the operations are quite demanding in terms of network and CPU time. A script continuously listens to the microphone to detect any audio information coming in. It then breaks the audio stream into smaller audio files, each of which is 5 seconds long. Another script analyses the audio files generated and converts it into text files using automatic speech recognition (PocketSphinx). Each of these text files have the timestamp which represents the instant the phrase occurred in the video. This data, along with the information about the user’s expression is stored in a database and it is matched using the timestamps. Thus, we now have the phrases spoken in the video and the user’s feeling towards each of the phrases. 4. Results The accuracy of the prediction using only KWS is 81%. On the other hand, the accuracy of the system using only ME is 83%. The accuracy of the hybrid model that uses both KWS and ME is the best among all, at 94.6%. In addtion, several other algorithms were also applied whose results are displayed in Table 1. It is evident that SVM model has the best accuracy and precision when compared to Bayes and ME. This is due to the innate feature of the model to be able to define separation boundaries better. The model was tuned further using Grid Search, which is an exhaustive search of through a manually specified subset of hyper-parameter space. This results in development of the most optimal model under the training constraints. Another trend that can be observed is that the metrics of models trained using N-Fold Cross Validation is significantly better than their Single-Fold counterparts. This is expected, since N-Fold CV normalizes the data and removes the bias in data to a large extent. The increased number of training and testing cycles is a substantial factor that contributes towards this improvement.
Vignesh Radhakrishnan et al. / Procedia Computer Science 143 (2018) 626–634
8
Vignesh R. et al. / Procedia Computer Science 00 (2018) 000–000
Type
Model
Accuracy
Precision
Recall
f-msr
Single Single Single N-Fold N-Fold N-Fold
Bayes ME SVM Bayes ME SVM
0.712 0.696 0.884 0.752 0.726 0.855
0.8088 0.8017 0.8842 0.8281 0.8172 0.8549
0.712 0.696 0.884 0.752 0.728 0.854
0.6875 0.6668 0.8839 0.7357 0.7029 0.8544
633
Table 1. Prediction Metrics for Naive Bayes, Maximum Entropy and Support Vector Machine Classifiers.
The significant difference between SVM and KWS-ME systems’ accuracy validates the superiority of the hybrid approach. The comparison of the two models on the basis of hardware and power consumption is work in progress. 5. Conclusion In this paper, the importance of sentiment analysis on video is discussed. The various existing techniques are explored and several novel methods are proposed. The following are the novelties that have been suggested and implemented: 1. Consideration of Adjectives instead of Nouns In the research work proposed initially, the nouns and verbs were considered to have the maximum impact on the overall sentiment of the text. However, this seemed counter-intuitive and experimental results have proven that adjectives have a stronger influence on the hybrid classification algorithm and lead to faster and accurate results. 2. Live Video Sentiment Analysis It was observed that there exists no reliable application to perform sentiment analysis dynamically on video. Technology giants such as Google and Microsoft have such systems, although these are not open source. Therefore, an efficient and robust web application was created to perform live analysis and on-screen results. To promote ease of use, an interactive GUI was also designed for the same. 3. Video Reaction Analysis For reasons mentioned in the previous section, analyzing the user expression while he/she is watching the video is crucial for obtaining the most genuine feedback regarding the video content. Although many emotion detection APIs exists, none of them provide a direct way of mapping the expression to the content of the video. Therefore, in this paper, the development of such a system has been demonstrated and its usefulness has been discussed in detail. By analyzing the text, audio and video frames, it is possible to get the most accurate and comprehensive predictions about the sentiment of the speaker. This system can be applied in many fields of research such as suspect interrogation, advertising products and market forecast analysis, among others. 6. Future Work The proposed system is quite accurate, although it takes significant time to train the hybrid model and employ various recognition techniques. Some independent tasks can be run in parallel but the reduction in CPU time they provide is overshadowed by the increase in CPU load. In this regard, the aim is introduce parallelism in the algorithm itself. If there is a Graphics Processing Unit (GPU) available in the machine, it can be exploited to run the routine Generalized Matrix Multiplication (GEMM) operations in batches. Such parallel logic can be written and implemented using Compute Unified Device Architecture (CUDA). Another parallel line of research would be to increase the size of the dataset and incorporate more varied samples to ensure that the hybrid model is more adaptable to new environments.
634
Vignesh Radhakrishnan et al. / Procedia Computer Science 143 (2018) 626–634 Vignesh R. et al. / Procedia Computer Science 00 (2018) 000–000
9
References [1] Theresa Wilson, Janyce Wiebe and Paul Hoffmann, Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis, ACM Proceedings, 2005. [2] Giovanna Varni, Gualtiero Volpe, and Antonio Camurri, A System for Real-Time Multimodal Analysis of Nonverbal Affective Social Interaction in User-Centric Media, IEEE Transactions on Multimedia, October 2010. [3] Diana Maynard and Mark A. Greenwood, Who cares about sarcastic tweets? Investigating the impact of sarcasm on sentiment analysis, LREC, 2014. [4] Louis-Philippe Morency, Rada Mihalcea and Payal Doshi, Towards Multimodal Sentiment Analysis: Harvesting Opinions from the Web, ACM Proceedings, 2011. [5] Soujanya Poria, Erik Cambria, Newton Howard, Guang-Bin Huang and Amir Hussain, Fusing audio, visual and textual clues for sentiment analysis from multimodal content, Neurocomputing, Elsevier, January 2016. [6] David Osimo and Francesco Mureddu, Research Challenge on Opinion Mining and Sentiment Analysis , W3 Organization, 2012. [7] Erik Cambria, Bjorn Schuller, Yunqing Xia and Catherine Havasi, New Avenues in Opinion Mining and Sentiment Analysis, IEEE Intelligent Systems, IEEE Explore, 2013. [8] Martin Wllmer, Felix Weninger, Tobias Knaup, and Bjrn Schuller, YouTube Movie Reviews: Sentiment Analysis in an Audio- Visual Context, IEEE Intelligent Systems, IEEE Explore, 2013. [9] Soujanya Poria, Erik Cambria, Grgoire Winterstein, Guang-Bin Huang, Sentic patterns: Dependency-based rules for concept-level sentiment analysis, Knowledge-Based Systems, Elsevier, 2014. [10] Ccero Nogueira dos Santos, Mara Gatti, Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts, Proceedings of COLING, 2014. [11] Erik Cambria, Bjrn Schuller, Bing Liu, Haixun Wang, Catherine Havasi, Statistical Approaches to Concept-Level Sentiment Analysis, IEEE Intelligent Systems, September 2013. [12] Tetsuya Nasukawa, Jeonghee Yi, Sentiment Analysis: Capturing Favorability Using Natural Language Processing, Proceedings of the 2nd international Conference on Knowledge Capture, October 2003. [13] Soujanya Poria, Erik Cambria, Alexander Gelbukh, Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-Level Multimodal Sentiment Analysis, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015. [14] Ahmed Abbasi, Hsinchun Chen, Arab Salem, Sentiment Analysis in Multiple Languages: Feature Selection for Opinion Classification in Web Forums, ACM Transactions on Information Systems (TOIS), June 2008. [15] Prem Melville, Wojciech Gryc, Richard D. Lawrence, Sentiment Analysis of Blogs by Combining Lexical Knowledge with Text Classification, Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, June 2009.