Information Processing and Management 57 (2020) 102189
Contents lists available at ScienceDirect
Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman
ViSSa: Recognizing the appropriateness of videos on social media with on-demand crowdsourcing
T
⁎,c
Sankar Kumar Mridhaa, Braznev Sarkara, Sujoy Chatterjeeb, Malay Bhattacharyya a b c
Department of Information Technology, Indian Institute of Engineering Science and Technology, Shibpur, Howrah – 711103, India Université Côte d’Azur, CNRS, I3S, France Machine Intelligence Unit, Indian Statistical Institute, Kolkata, West Bengal – 700108, India
ARTICLE INFO
ABSTRACT
Keywords: Crowdsourcing Streaming data Video analysis Judgment analysis
The recent significant growth of social media has brought the attention of researchers toward monitoring the enormous amount of streaming data using real-time approaches. This data may appear in different forms like streaming text, images, audio, videos, etc. In this paper, we address the problem of deciding the appropriateness of streaming videos with the help of on-demand crowdsourcing. We propose a novel crowd-powered model ViSSa, which is an open crowdsourcing platform that helps to automatically detect appropriateness of the videos getting uploaded online through employing the viewers of existing videos. The proposed model presents a unique approach of not only identifying unsafe videos but also detecting the portion of inappropriateness (in terms of platform’s vulnerabilities). Our experiments with 47 crowd contributors demonstrate the effectiveness of the proposed approach. On the designed ViSSa platform, 18 safe videos are initially posted. After getting access, 20 new videos are added by different users. These videos are assessed (and marked as safe or unsafe) by users and finally with judgment analysis a consensus judgment is obtained. The approach detects the unsafe videos with high accuracy (95%) and point out the portion of inappropriateness. Interestingly, changing the mode of video segment allocation (homogeneous and heterogeneous) is found to have a significant impact on the viewers’ feedback. However, the proposed approach performs consistently well in different modes of viewing (with varying diversity of opinions), and with any arbitrary video size and type. The users are found to be motivated by their sense of responsibility. This paper also highlights the importance of identifying spammers through such models.
1. Introduction Monitoring real-time streaming data is a challenging task in the current era of social media revolution (Hasan, Orgun, & Schwitter, 2019). Streaming data (may be in the form of text, images, audio or videos) are needed to be analyzed on the fly at the rate of streaming data arrival. The overall processing time and storage space should be simultaneously O(N, t) (polylogarithmic complexity is preferred) (Muthukrishnan 2005). Here, N denotes the number of data items available at a time instant t. In some applications, for the sake of simplicity, streaming videos are considered in a semi-streaming setting (Galasso, Keuper, Brox, & Schiele, 2014). We can assume that streaming videos are captured as a set of segments (comprising a number of successive frames) for a better processing. Day by day, the global activity through the Web is increasing at scale. Therefore, the task of monitoring unlawful Corresponding author. E-mail addresses:
[email protected] (S.K. Mridha),
[email protected] (B. Sarkar),
[email protected] (S. Chatterjee),
[email protected] (M. Bhattacharyya). ⁎
https://doi.org/10.1016/j.ipm.2019.102189 Received 26 June 2019; Received in revised form 29 November 2019; Accepted 21 December 2019 0306-4573/ © 2020 Elsevier Ltd. All rights reserved.
Information Processing and Management 57 (2020) 102189
S.K. Mridha, et al.
activities is becoming more and more crucial. Social networking is popular among the younger generation due to the social reforms and opportunities (like, searching jobs) it provides. However, all the contents posted on social media are not necessarily beneficial. Moreover, it may sometimes create a negative impact on the users. The contents which might affect the dignity of an individual, violate the policy of a platform, or create intimidating effect on the humanity may be considered as inappropriate. Social video sharing platforms, where a huge number of videos are uploaded by different users, is one such potential platform that receives such inappropriate contents, either as a complete video or unsafe portions embedded within another video that is apparently safe. Detecting the inappropriate portions (unsafe) in the streaming videos with the help of artificial intelligence is time-consuming and may not be suitable in many circumstances. In social media, diverse kind of videos are uploaded by the users and it contains different kind of features and characteristics. So, it is quite difficult for a machine to recognize the features violating the platform policy. As an alternative, crowdsourcing can be used to provide efficient and effective solution to this problem (Yeung, Yeo, & Liu, 1998). We expect that the results generated by machines has low accuracy, however in human judgment process, it is quite easier to judge the contents collectively and decide their appropriateness. Crowdsourcing is a model of virtual platforms where the crowd can share their views, innovative ideas, contribute skills, etc. (Neto & Santos 2018). Through crowdsourcing, different types of micro-tasks can be solved by the crowd workers in real-time (Gonen, Raban, Brady, & Mazor, 2014; Ho & Vaughan, 2012). The crowdsourcing mechanisms either rely on volunteers or offer remuneration to the crowd workers for solving the task. Both these approaches can be applied for analyzing text, images, audio or video files (arriving in a streaming fashion) depending on the importance of the application. Users on the public media like YouTube, Facebook, Twitter, etc. may be employed in this regard. Note that, it might be required to involve other domain expertise like social network analysis (Doan, Ramakrishnan, & Halevy, 2011), judgment analysis (Chatterjee & Bhattacharyya, 2017), etc. for processing responses obtained from the crowd. Content moderation in social media is a way of accepting or flagging data (videos, audios, images, and articles etc.) based on the guidelines in these platforms. There are limited works on verifying the appropriateness of online contents. There are recent studies that are motivated by the applications like transcription of audio files (Vashistha, Sethi, & Anderson, 2017). However, very limited studies deal with video monitoring (on social media) with crowdsourcing. Due to the increasing popularity of social media like Facebook, Twitter, YouTube, etc., users tend to misuse them by posting improper videos. These types of unacceptable videos are often flooded by the hostile users for their self-promotion. Identifying the improper content in a video within a limited time is challenging enough. Moreover, an automatic tool is very much needed to filter these contents. In this paper, we propose an approach that can detect unsafe (as per the general policy of online contents) videos before making them public with the help of active crowd contributors in a minimal time. We have developed a model where viewers can be engaged as volunteers for identifying inappropriate streaming videos. Here, videos are divided into segments and each segment is sent to a set of users for obtaining their feedback on the segments. Upon receiving the opinions (for a video segment) from the users, the final judgment is obtained by aggregating those opinions. Based on the judgment on every segment, the appropriateness of the respective video is decided according to the safety rules of the platform. To elicit true responses from the users, which is often done in judgment analysis (Liu & Wang, 2012), a normalized performance score is assigned to the users based on their performance. This score (corresponding to a user) dynamically changes depending on whether his opinion matches with majority of the users (Lee, Ha, Lee, & Kim, 2018). The rest of this paper is organized as follows. Related works are described in Section 2. Section 3 discusses the motivation behind this work and Section 4 introduces the proposed framework. Section 5 describes the details of platform design and empirical analysis. In Section 6, we discuss the deployment details and include the outcome of a focused group study. Some additional discussions are incorporated in Section 7. Finally, Section 8 concludes the paper. 2. Related work A substantial amount of work has been carried out to deploy the power of crowdsourcing aiming diverse applications in the last decade. Crowd contributors can be engaged in collective actions to support different interactive tasks (Chen, Meng, Zhao, & Fjeld, 2017). More interestingly, the voluntary based crowdsourcing has also been shown to be effective for solving micro-tasks (AmerYahia & Roy, 2016; Ikeda et al., 2016). This can be helpful in content management in a decentralized yet powerful way. Crowd contributors can collectively keep any eye on the images, audio or video files uploaded on the public media. So, they can monitor the data traffic and even decide the appropriateness of such contents in a distributed fashion. Several studies are there on data traffic monitoring (Biersack, Callegari, & Matijasevic, 2013). Monitoring of traffic data concentrates on data communication and networking. There are also some studies on activity monitoring over social media (Kietzmann, Hermkens, McCarthy, & Silvestre, 2011). Mainly, it focuses on the ways to monitor social media in business perspective which is far different from our motivation. Video monitoring is a tedious task. There are recent attempts to distribute such tedious tasks as microtasks and employ crowd contributors to handle them (Loni, Larson, Bozzon, & Gottlieb, 2013; Vondrick, Patterson, & Ramanan, 2013). However, as the contributors are often volunteers, they lose the focus over time while dealing with such tasks. There are some previous research attempts that aim to keep the users engaged (with attention) during video surveillance (Rahman, 2012). A recent study has also tried to extend the attention of users by augmenting dummy events (Elmalech, Sarne, David, & Hajaj, 2016). In this work, the authors have proposed a method of rewarding the users on reporting the unusual content or dummy event in a video. This mechanism can be well applied in surveillance systems too. However, motivation of the current work is different from the said ones. Unlike the surveillance videos that demand for a binary decision (to impose an alert), videos in social media require the scaling and identification of inappropriateness. 2
Information Processing and Management 57 (2020) 102189
S.K. Mridha, et al.
Numerous research studies have been carried out earlier to investigate the activity of spammers involved in the different social media (Chakraborty, Pal, & Chowdary, 2016). For example, Edge Rank checker (ERC) is an algorithm used in Facebook which can check appropriate stories that should turn up each user’s Newsfeed (Zheng, Zeng, Chen, Yu, & Rong, 2015). Some of the existing approaches employ keyword-based and URL-based detection methods. However, the URL-based detection methods can not identify the spam if the destination address (hyperlink) is kept hidden. Feature selection is another important technique employed by various models to classify the YouTube videos (Benevenuto, Magno, Rodrigues, & Almeida, 2010; Sureka, 2011). In these approaches, features are classified into two categories. These are user-based features and user-behaviour features. The user-based features are the date of joining, number of videos shared, etc. and the user-behavior features are total number of videos disliked, total number of videos liked, etc. However, our proposed framework is based on showing short video clippings (either homogeneously or heterogeneously) to a user to decide its appropriateness. A preliminary version of this work has appeared merely as a concept earlier (Mridha, Sarkar, Chatterjee, & Bhattacharyya, 2017), with no focus on empirical analysis. The content moderation guidelines depend on the policy of any specific platform. These guidelines restrict the users from uploading inappropriate contents. A lot of AI-based filtering techniques are available for content moderation (Hanafusa, Morita, Fuketa, & Aoe, 2011; Heymann, Koutrika, & Garcia-Molina, 2007; Roberts, 2016; Wever, Keer, Schellens, & Valcke, 2007). Indecent contents of a video have been classified depending on motion vectors of frames in an earlier study (Endeshaw, j. Garcia, & Jakobsson, 2008). Jansohn et al. have later presented a framework for identifying pornographic video contents by combining image features with motion information (Jansohn, Ulges, & Breuel, 2009). Most of these are advanced usages of image processing and skin detection. But depending on the demography, human skin color varies. This simply acts as a barrier to skin color discrimination at a global scale. Moreover, it was a static approach. Machine learning is also getting used in video classification (Ochoa, Yayilgan, & Cheikh, 2012; Shao, Cai, Liu, & Lu, 2017). Aggarwal et al. have presented an approach to identify privacy invading harassment and misdemeanor videos by mining the video metadata (Aggarwal, Agrawal, & Sureka, 2014). Based on the metadata of videos, authors can detect harassment and misdemeanor in videos by employing a one-class classifier. In another study, spatio-temporal motion trajectories have been used for detecting adult scenes in a video (Jung, Youn, & Sull, 2014). Most of these are the extraction of information from motion patterns. Some works are related to video content classification for forensic detection (Garcia, 2015). There is another recent study on the determination of unsafe video contents for kids (Kaushal, Saha, Bajaj, & Kumaraguru, 2016). In a recent study, Kaur et al. have proposed a crowd-powered mechanism that hides the private content of data (Kaur et al., 2017). Seufert et al. have proposed a stream-based machine learning approach on YouTube that predicts the cause to stop the YouTube streaming session in real-time (Seufert, Casas, Wehner, Gang, & Li, 2019). Here, streaming videos are divided into timed frames of 1 s each and features are extracted from the frames. These features conclude whether the frame contains stalling or not. When a large number of videos gets uploaded, it becomes challenging to arrange and search the videos according to their categories. Dessi et al. have proposed a machine learning technique that automatically classifies the educational videos into pre-defined categories through the analysis of video frames (Dessì, Fenu, Marras, & Recupero, 2019). They never extracted the new feature from the uploaded video frame for future better analysis and they never ensured the suitability of uploaded content for this platform or not. Zhu et al. have proposed a deep learning model that analyzed the video frames and categorize it when large noise labels are associated with the videos (Zhu, Chen, & Wu, 2019). It is a challenging task is to classify the videos when the label contains noise. One of the challenging tasks is emotion recognition from the uploaded videos. Zhu et al. have proposed a deep learning model that automatically detects the emotions from the video content (Zhu et al., 2019). Schulc et al. have proposed a CNN-LSTM model that detects human behavior (like attention and non-attention during the presentation) from the webcam-based video (Schulc, Cohn, Shen, & Pantic, 2019). There is a recent attempt to automatically recognize the toddler-oriented inappropriate contents in YouTube videos through deep learning framework (Papadamou et al., 2019). However, this work heavily depends on various features derived from other inappropriate contents or from the past user interactions (a kind of metadata of the video). This approach achieves a classification accuracy of 82.8%. However note that, this method is heavily dependent on metadata such as video statistics, title of the video, video description, video tags, like/dislike counts, etc. So, it is not possible to make an early decision about the inappropriateness of contents without having access to such metadata. There are several other works that have been carried out with a particular focus on recognizing spam video responses (Benevenuto et al., 2012; Chaudhary & Sureka, 2013). However, the proposed approach significantly differs from the previous ones because of its on-demand processing capability driven by the power of crowd. The way it automates the recognition of inappropriate portions in videos is also unique. The proposed one pursues a pre-moderation technique in which the uploaded contents are judged by users before they get published online. 3. Motivation There is a significant growth in social networking and human activities online in recent years. As a result, an enormous amount of data is getting included in the online public media. Many real-life applications demand analyzing such contents in real-time before making them public. We are focusing on streaming videos to judge whether they satisfy the policies of respected platforms or not. A large number of people are always engaged in different activities (e.g., video search, watching videos, etc.) online. We propose to involve these persons as users and take their opinions in deciding the appropriateness of video contents. We expect that a user can voluntarily take part in a short job (termed as a micro-task) at the time of performing his own work. So, we can post a segment of the video, covering a short time-frame, as a crowdsourced task. Many social media platforms now have their own guidelines (although such policies do not substantially differ) to define unsafe videos. Fig. 1 shows the policy of YouTube regarding the nature of videos to be uploaded. Considering the YouTube platform, let us 3
Information Processing and Management 57 (2020) 102189
S.K. Mridha, et al.
Fig. 1. Classification of unsafe videos into different categories under the policy guidelines of YouTube.
demonstrate the working principle of our proposed model. As per the latest statistics, YouTube experiences a video upload rate of about 6.7 h/s (on an average) and the average hours of videos that are watched in each second is more than 11.6k1. The number of viewers who watch these videos on YouTube in each second is more than 350. Mobile users also spend a long time (40 min on an average) watching videos on YouTube. These facts highlight a huge potential for using the viewers as feedback providers for verifying the appropriateness of uploaded videos in on-demand. To be precise, we aim the following research objectives in this paper.
• Evaluating the appropriateness of user-uploaded video contents on social media in bounded time with a working model. • Detecting the inappropriate portions in the uploaded videos through a model that can replace the current limitations of standard AI tools. • Ensuring the consistency (in terms of performance) of the model across different sizes, types, and modes of viewing. 4. Proposed framework Let us consider that a streaming video (or a complete video) V is uploaded by a user. The video V is divided into n segments (containing a set of frames) represented as {v1, v2, …, vn} ( vi = V and vi vj = for all i, j). A frame and segment are defined as follows. Definition 4.1 (Frame). (Rangan, Venkat, & Harrick, 1991) A frame is the basic unit of a video. Definition 4.2 (Segment). A segment is a collection of successive video frames. The proposed framework attempts to demarcate the videos uploaded on public media either as safe or unsafe. Each video segment has a minimum running time of τ (to judge whether it is safe or unsafe). These terms are defined hereunder. Definition 4.3 (Safe Video). A video is called safe on a public media if its contents abide by the guidelines of that public media on which it is getting uploaded. Definition 4.4 (Unsafe Video). A video is called unsafe on a public media if its contents do not abide by the guidelines of that public media on which it is getting uploaded. Each segment is judged by a set of users represented as Wi = {wi1, wi2, …, wimi}, for all i, and mi denotes the number of users for the 4
Information Processing and Management 57 (2020) 102189
S.K. Mridha, et al.
Fig. 2. Graphical representation of the proposed model. Here, the video is divided into segments and judged by the unique users. The labels “S” and ‘U’ denote the opinions “Safe” and ‘Unsafe’, respectively against the segments. The final decision is taken by combining the judgments on all the segments.
ith segment. Notably, the set of users making judgments on each segment are not necessarily distinct. The users for each segment provide their opinions in the form of ‘Safe’ (S) or ‘Unsafe’ (U), which signify whether the video segment is accepted or rejected as being safe, respectively. After receiving the responses for each segment, decision is taken by majority voting. So, for each video segment, we derive a judgment either as ‘Safe’ or ‘Unsafe’. The video will be considered safe (appropriate) if all the segments are judged as ‘Safe’, otherwise not. Note that, if a video segment is recognized as ‘Unsafe’ on ViSSa, we further ask to declare the reason why it is unsafe. A schematic view of the entire approach is shown in Fig. 2. These steps are detailed hereunder. 1. Segmentation of videos: The foremost step is to decompose a video into minimum time slots so that it contains sufficient amount of information for making a judgment. We adopt a recent approach of Deza et al. explaining the way to improve the human performance in a realistic sequential visual search task (Deza, Peters, Taylor, Surana, & Eckstein, 2017). We set the (minimum) time duration τ = 6 s (inspired by an advertisement policy of YouTube2,3) for allowing the users to judge the video segments. 2. Assigning videos to the users (crowd contributors): Thereafter comes the proper assignment of video segments to the users. If a segment is judged by a single user, then the outcome is not always reliable, so we assign a set of users to solve a common segment independently. Online viewers are of two types - registered (signed-in) or anonymous. The registered viewers (whose profile details are accessible) formally log into a platform and anonymous viewers (whose profile details are unknown) enter into a platform without logging in. We allow only the registered viewers as the users to judge the video segments. This will ensure whether the users are minor or not. We also track the users’ profile (for the registered ones) for a better assignment. A single user is never allowed to judge the same segment (wi ← {vi}, ∀m, n vm ≠ vn). If a video is posted with American accent then it is obvious that it will be perfectly perceived by a user who is native American. 3. Judgment analysis on each segment: As the videos are coming in a streamline, majority voting is initially applied to obtain the judgment on a particular segment of a video and based on the agreement of user’s opinion with the majority, the final judgment of each segment is computed. In the current context, majority voting is a way of taking decision from responsive social choice function between two alternatives (Kirsch, 2016). If N number of users label for the two options (Safe and Unsafe) then minimum N votes require to select an option is ( 2 + 1) (more than half of votes toward the winning option). In the case of a tie, the final judgment is taken as ‘Unsafe’ to make the model sensitive toward inappropriateness of contents. Here, as the videos are segmented into sub-videos of length 6 s, and majority voting on each sub-video is done based on this setting. Now the final judgment on the overall video can be measured depending on the judgment of the sub-videos. To obtain a better result, we use a weight factor, which is updated based on the performance of every user. This weight factor reflects the performance score of a user and is updated in each iteration (as soon as a decision is taken on a video segment). In the proposed model, we prefer registered users (having a performance score si ∈ (0, 1) for a user i) to judge the segments. The winning option is k ∈ {Safe, Unsafe} if s > oi k si, where oi is the option given by user i. This means, the final judgment will be k if sum of the performance scores oi = k i 5
Information Processing and Management 57 (2020) 102189
S.K. Mridha, et al.
of the users who have opined for k is more than the sum of performance scores of the other users. Here, the incentive of a user is nothing but the performance score reflecting his rank. After a certain period of time, users with a higher rank will get more weightage from the platform. In this situation, majority voting is applied once again taking care of the said weights. Note that, somebody can alternatively choose a threshold value to classify the video as ‘Safe’ or ‘Unsafe’. Depending on a high threshold value for acceptance of a video will however make the platform stricter to avoid any malfunctioning. 4. Final decision making: The final decision is obtained from the set of collective judgments on all the video segments. If all the segments are marked as ‘Safe’, then it denotes the entire video is acceptable by the users. Hence, the video is appropriate for the platform. Decision D is defined as follows.
D=
Safe, if i n: d (vi ) = Safe Unsafe, otherwise
Here, d(vi) represents the final decision on the video segment vi ∈ V and d: V → {Safe, Unsafe} It is interesting to note that, we can alternatively use a threshold value on the number of acceptable segments (instead of restricting acceptance on all the segments) to decide whether a video is safe or unsafe. This will relax the final step of the above approach. 5. Empirical analysis We create a video judgment platform named as ViSSa. ViSSa is an open crowdsourcing platform that allows posting of videos to be seen publicly. In this framework, initially the uploaded videos are not shared among the users. They are rather fragmented in different segments for judging the appropriateness of contents by the users. These video segments are embedded with already justified videos and displayed to the viewers (users) dynamically (see in Fig. 3). In this way, the uploaded videos get judged by the viewers before becoming public. We use the terminology ‘sub-video’ hereafter to denote a segment of unverified videos. The working principle of ViSSa is to share videos after judging the content of the videos as safe with the help of users (viewers of the platform) at the time of viewing the available videos. Note that, we receive multiple feedback over the same sub-videos from the users. Based on these multiple feedback, initially we derive the judgment over the sub-videos by majority voting (ties are broken by arbitrary choice). Finally, the judgment on the full video is determined by combining the judgments on the sub-videos. This platform is restricted to the registered users only for tracking the reliability of their opinions. The purpose is to share only safe videos suitable for the platform uploaded by different users in real-time. If any single sub-video is found unsafe then the video under test is treated as unsafe and removed from the platform. In this proposed model, different videos can have different lengths. Therefore, the number of sub-videos are different for each of the videos but the minimum time duration of each sub-videos is 6 s. To carry out experiments on ViSSa, we initially posted 18 videos and allowed the registered users (viewers) to add more. All the 18 videos were chosen appropriately to be Safe for the platform. The registered users (to be precise 4) added 20 new videos (within a timespan of about 24 min) on ViSSa. There was no restriction on the length of uploaded videos. After getting uploaded, each video is decomposed into multiple sub-videos having a minimum time duration of 6 s. The users are allowed to watch at most 10 sub-videos (covering 1 min) at a time to judge the same. Users may skip a sub-video only after viewing the first 6 s. English and Hindi are the languages of all the uploaded videos. In case a sub-video is labeled as ‘Unsafe’ by a user, we ask the user to select the reason of inappropriateness from the set {harmful, private, sexual, violent}. We have initially carried out some cognitive experiments for designing the empirical platform. This is explained below. 5.1. Cognitive experiments for platform design To better understand the other challenges, we took a survey on 45 daily viewers of YouTube of whom 71% were willing to be users of ViSSa. The participants of this survey were chosen from academic institutes (mostly students or researchers) and they belong to the age group of 25–35. They filled up a common online form to enter different options and ranges. This form was judiciously designed to avoid any kind of framing effect. The following cognitive research questions were aimed to answer through the survey so that we can design an effective platform for experimentation. 1. What is the preferred length of sub-videos (embedded within another video) to be watched (screened) by the viewers? 2. At what position are the viewers comfortable to watch the embedded sub-video for providing feedback? 3. At what level (multiple-choice based or detailed) are the viewers willing to provide feedback on an embedded sub-video? On an average, the respondents prefer a size of the sub-videos to be approximately 140 s and a gap of roughly 37 min between the arrival of two successive segments. To answer the second question, we have asked whether the viewers prefer to watch the embedded video at the beginning, middle, or end (although it is insignificant because viewers are unlikely to wait for providing the feedback after seeing the video). The survey report is presented in Fig. 4. Out of the 31.11% viewers who have an interest to judge appropriateness of videos, 71.43% prefers to see the sub-videos at the beginning, 14.28% prefers at the middle, and the rest at the end. The rest 40% viewers have a weak interest (answered “May be”) to judge appropriateness of videos. Out of them, 55.56% prefers to see the sub-videos at the beginning, 33.33% prefers at the middle, and the rest at the end. On the other hand, out of the 28.89% viewers 6
Information Processing and Management 57 (2020) 102189
S.K. Mridha, et al.
Fig. 3. Overview of the proposed ViSSa platform. (a) The viewers can freely choose and watch a video from the videos already posted and marked as safe for viewing. (b) An embedded short sub-video is shown to the viewer for the purpose of collective feedback on it. The embedded sub-video can be skipped only after watching the first 6 s. (c) The viewers can submit their feedback (whether the videos are safe or unsafe) after watching a subvideo.
7
Information Processing and Management 57 (2020) 102189
S.K. Mridha, et al.
Fig. 4. Distribution of interest of the users in judging the appropriateness of videos (Yes, No, Maybe) and the preferable position (Beginning, Middle, End) to embed the video to be judged. The responses are collected through a survey.
who have no interest to judge appropriateness of videos, 23.07% suggests to insert the sub-videos at the beginning, 23.07% suggests at the middle, and the rest at the end (the reason why this percentage is more is obvious). Overall, 51.11% viewers prefer the video content at the beginning, 33.33% viewers prefer at the middle, and the rest prefers at the end. Interestingly, most of the viewers are ready to watch the video at the beginning. It is also seen that interested users prefer to judge the embedded sub-videos at the beginning of the display (see Fig. 4). This instigates us to embed sub-videos at the beginning (after 2 s only) of ongoing safe video. The less interested users (as identified through the survey) are found to prefer placing sub-videos at the end. In our survey, most of the people are found interested to work as users for judging the appropriateness of the embedded video on the platform. Here, we have treated “may be” option givers as interested users. Based on this, most of the respondents (73%) were ready to provide only a binary response, not the detailed feedback. Based on the above inputs, the ViSSa platform were designed and constructed. 5.2. Dataset Details To create the dataset, each uploaded video is sampled into sub-videos in succession not randomly. These sub-videos neither overlap nor accommodate any missing content. While sampling a video into sub-videos of length 6 s, we do not decompose the actual video into segments. We rather mark the video with small time-segments (of 6 s each). If watched in continuation, a sub-video may extend beyond the minimum length of 6 s (if possible) and the next one starts immediately after it closes. If the last sub-video has a content of less than 6 s it is neglected. When a user plays a video on the platform, we embed the sub-videos (to be judged) into it. They are shown only after 2 s of playing the original video. The maximum length of a sub-video to be judgment is 1 min. Users may wish to skip the video only after 6 s (covering the 1st segment). All the annotations from the different users are collected on all the segments of the videos to be judged on ViSSa. The dataset basically comprises these response matrices (Chatterjee & Bhattacharyya, 2015; 2017). The users are allocated different sub-videos following two separate principles. The rationale behind this is described in the next subsection. To decide the true (gold) label of the videos, several domain experts are consulted. As per the gold label, 7 videos are unsafe and the rest 13 are safe in the dataset. 5.3. Demographic analysis of the users The number of registered users (volunteer viewers) on the ViSSa platform is 56 (male = 39, female = 17) with an average age of 27. Out of these, 47 users participated in the experiments by watching sub-videos and submitting their corresponding opinions. Rest of the 9 users have viewed sub-videos without submitting their opinions. Total 45 and 37 users participated in the homogeneous and heterogeneous modes of experiments, respectively. Among this, 35 users are found to be common. A majority of the users are students (either at the Masters or Doctoral levels) and belong to different institutes. Out of the 56 users, primary language of 38, 23 and 8 users are Bengali, English and Hindi, respectively. Only 3 users know Bengali and English, and 5 users know Bengali, Hindi and English together.
8
Information Processing and Management 57 (2020) 102189
S.K. Mridha, et al.
Fig. 5. Identifying the unsafe sub-videos through the user allocation in (a) homogeneous mode and (b) heterogeneous mode.
5.4. Homogeneous versus heterogeneous modes of allocation To verify whether the mode of sub-video allocation has any impact on the viewers’ feedback, we carry out two parallel experiments. For this, we set up two different video allocation settings, namely in homogeneous and heterogeneous mode. These are exemplified in Fig. 5. As can be seen from Fig. 5, the same viewer is shown multiple segments taken successively from the same video in the homogeneous mode. On the other hand, in the heterogeneous mode, a single viewer cannot be shown more than one segment from a single video. However, both the modes are operated (assigning sub-videos to the viewers) randomly. In the first mode of experiment (homogeneous), each user judges the segments of the same video while watching the original video. In the next mode of experiment (heterogeneous), each user judges the segments of different videos. The number of users assigned to judge the videos in the said two modes are shown in Table 1. Each sub-video is judged by at least 5 users. As the segments of a single video in the heterogeneous mode are shown to distinct users, therefore we need more users in this mode. From the response matrix (Chatterjee & Bhattacharyya, 2015; 2017), we obtain the final decision on each segment with judgment analysis. When we obtain the decision over all the segments of a single video, we can derive the final judgment. 5.5. Preliminary analysis Total 254 unique sub-videos are obtained from the 20 videos (to be tested) on which the users are asked to provide their opinions (Safe/Unsafe). In the homogeneous and heterogeneous mode, total number of opinions obtained across all the sub-videos are 1297 and 1475, respectively. By applying majority voting, we derive judgments over the sub-videos (in both cases). Finally, the ultimate judgment is obtained through a consensus of judgments of sub-videos. Out of the 20 videos judged in both the modes, 8 are found as Table 1 Predicted results obtained by combining the responses of individual users for all the uploaded videos in both homogeneous and heterogeneous mode. The number of users (assigned randomly) who gave feedback on these videos are provided in brackets and the count of common users (who have judged the same video in different modes) is also mentioned. The gold judgment is the true label for a particular video. Video ID
10000019 10000020 10000021 10000022 10000023 10000024 10000025 10000026 10000027 10000028 10000029 10000030 10000031 10000032 10000033 10000034 10000035 10000036 10000037 10000038 Accuracy
Predicted result in different modes Homogeneous (# Users)
Heterogeneous (# Users)
# Common users
Gold judgment
Safe (7) Safe (8) Unsafe (9) Unsafe (5) Unsafe (5) Safe (6) Safe (9) Safe (7) Safe (6) Safe (9) Unsafe (6) Unsafe (6) Safe (5) Unsafe (6) Safe (6) Safe (8) Unsafe (7) Safe (5) Unsafe (7) Safe (5) 95%
Safe (10) Safe (15) Unsafe (11) Unsafe (7) Unsafe (11) Safe (11) Safe (23) Safe (12) Safe (11) Safe (20) Unsafe (10) Unsafe (12) Safe (10) Unsafe (12) Safe (15) Safe (14) Unsafe (14) Safe (14) Unsafe (12) Safe (11) 95%
1 3 3 2 2 1 3 3 2 6 1 2 1 1 3 3 2 3 3 2
Safe Safe Safe Unsafe Unsafe Safe Safe Safe Safe Safe Unsafe Unsafe Safe Unsafe Safe Safe Unsafe Safe Unsafe Safe
9
Information Processing and Management 57 (2020) 102189
S.K. Mridha, et al.
Table 2 Gender-specific responses of users received over all the segments of a sample video in both the homogeneous and heterogeneous mode. Responses are given against the question whether a particular segment is safe (or not). Video ID
Mode
Homogeneous
10000037
Heterogeneous
sub-video ID
Safe
100000370 100000376 1000003712 1000003718 1000003724 1000003730 1000003736 1000003742 1000003748 1000003754 1000003760 100000370 100000376 1000003712 1000003718 1000003724 1000003730 1000003736 1000003742 1000003748 1000003754 1000003760
Unsafe
Male
Female
Male
Female
1 1 1 2 2 1 1 1 3 3 1 0 0 0 0 0 1 2 2 2 2 2
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 3
2 2 2 2 2 3 3 3 1 1 0 3 4 4 3 3 3 1 1 1 1 0
2 2 2 1 1 1 1 1 1 1 0 2 2 2 2 2 1 0 0 0 0 0
unsafe and 12 as safe. As can be seen from Table 1, employing the power of crowd in both the modes reflects the same accuracy of 95%. The only video (ID: 10000021) predicted wrong by the crowd is basically a safe video. In spite of yielding similar results over the full videos in homogeneous and heterogeneous modes, mismatches are found between the judgments over the sub-videos. As for example, responses received against all the sub-videos for an unsafe video (ID: 10000037) is provided in Table 2 for both the modes of experimentation (homogeneous and heterogeneous). We report the results specific to gender types. It is interesting to note that, in spite of the involvement of more users in heterogeneous mode, opinion diversity is less. The opinions are more stable (calculated entropy values are less) in the heterogeneous mode across all the sub-videos. However, the performances are incomparable and consistent (as reflected through the same accuracy values) in both the modes of experimentation. On the other hand, the results indicate that females are more sensitive toward accepting a content as safe. As can be seen from the opinions received in the homogeneous mode, many of the sub-videos that are marked as Safe by male have been demarcated as Unsafe by the female, although such a pattern is not visible in the heterogeneous mode due to the mixture of contents. Higher opinion diversity in the homogeneous mode (as compared to heterogeneous) might be attributed to the learning effect acquired by the participants. A learning effect causes adaptive bias because the human brain often infers from the previous experiences in an adaptive way (Haselton, Nettle, & Murray, 2015). In this case, learning effect might lead to judgment bias. As the users in homogeneous mode watch multiple sub-videos from the same source, a misjudgment on a sub-video might influence the judgments on other sub-videos (from the same source). Due to such misjudgments (caused by learning effect), the opinion diversity increases for a particular video. On the contrary, the users in heterogeneous modes are not affected by previous misjudgments. 5.6. Viability analysis of the power of crowd Note that, in the crowdsourced setting of ViSSa, providing opinions on sub-videos is principally the same as solving microtasks. So, our aim is to solve the 254 microtasks with the help of users (viewers). In our experimental setup, tasks are distributed among the users in two modes - homogeneous and heterogeneous. Note that, to avoid any kind of bias in task allocation, we prevent the users from casting opinions on their own (uploaded) videos. Interestingly, the time spent by the users in the heterogeneous mode (about 239 s on average) is significantly higher than the homogeneous mode (about 177 s on average). This indicates the interest of users toward judging diverse set of videos. We have already seen that user opinions may disagree on some sub-videos in different modes of experiments but they result into the same accuracy. This is because majority voting yields a binary answer (opinion). However, the weights of opinions may vary. E.g., the sub-video with ID 100000370 (see Table 2) gets a majority voting judgment of ‘No’ in both the modes, however their weights are different (4/5 in homogeneous and 5/5 in heterogeneous mode). To examine how much the opinions differ between the modes of experimentation, we assign a safety score (within [0, 1]) for each of the sub-videos where higher values indicate a higher safety (appropriateness) of the sub-video (as opined in aggregation). The safety score vectors (across all the sub-videos) obtained in both the modes are compared for the 20 videos. The statistical details about them are reported in Table 3. It can be seen from Table 3, there is a statistically significant difference for some videos 10
Information Processing and Management 57 (2020) 102189
S.K. Mridha, et al.
Table 3 The safety scores obtained for the sub-videos in homogeneous and heterogeneous modes of experimentation. Video ID
10000019 10000020 10000021 10000022 10000023 10000024 10000025 10000026 10000027 10000028 10000029 10000030 10000031 10000032 10000033 10000034 10000035 10000036 10000037 10000038
# Segments
10 18 11 6 8 11 27 12 12 20 10 11 9 10 15 12 16 14 11 11
Homogeneous
Heterogeneous
Significant
Mean
SD
Mean
SD
difference
0.9400 0.8532 0.4990 0.2333 0.5500 0.8000 0.7543 1 0.8917 1 0 0.0909 1 0.2000 1 0.9000 0.0208 1 0.3818 0.9273
0.0093 0.0289 0.0637 0.0067 0.0086 0 0.0166 0 0.0144 0 0 0.0109 0 0 0 0.0109 0.0032 0 0.0676 0.0102
0.8800 1 0.6753 0 0.2833 1 1 0.9028 1 0.9400 0.1000 0.4909 1 0.1400 1 0.9333 0.0375 0.9286 0.4000 1
0.0107 0 0.0164 0 0.0384 0 0 0.0104 0 0.0088 0.0111 0.0109 0 0.0360 0 0.0097 0.0065 0.0099 0.1840 0
No Yes No Yes Yes Yes Yes Yes Yes Yes Yes Yes No No No No No Yes No Yes
p-value
0.0811 0.0019 0.0657 0.0009 0.0346 < 0.0001 < 0.0001 0.0070 0.0096 0.0102 0.0150 < 0.0001 NA 0.3434 NA 0.1661 0.5419 0.0186 0.8628 0.0379
(12 out of 20) between the safety scores obtained in the two modes. So, the ways the sub-videos are judged may vary but it is not consistent. Interestingly, the safety score values obtained are observed to be higher in the heterogeneous mode, thereby establishing it as a liberal judgment mode. However, it has no impact over the final accuracy obtained. In fact, we see that robust judgments are obtained for the homogeneous mode. Interestingly, the variation (as highlighted by the SD values) of opinions for unsafe videos becomes very diverse in the heterogeneous mode due to conflicts. However, this is more or less stable in the homogeneous mode. To further verify whether there is any influence of the video size on the opinions received, we have tested the degree to which a difference in safety scores is obtained for both shorter and longer videos. As the average segment count (number of sub-videos) of the videos is 12.2, we have segregated the videos into two groups having no more than 12 sub-videos and the rest. Upon comparing the safety vectors between the two modes in these two groups, no significant difference is observed. The mean safety scores for shorter videos (up to 12 frames) in the homogeneous and heterogeneous modes are found to be 0.601 (σ = 0.365) and 0.629 (σ = 0.381), respectively. On the other hand, mean safety scores for larger videos (more than 12 frames) in the homogeneous and heterogeneous modes are found to be 0.771 (σ = 0.381) and 0.818 (σ = 0.384), respectively. None of the aforementioned comparisons are statistically significant (unpaired t-test; p-value = 0.5565 for shorter videos and p-value = 0.4056 for longer videos). It reveals that both the modes of experiments exhibit indifferent results irrespective of the size. Therefore, the video size has no influence over the user opinions to decide the safety of videos. We have also tested whether the safety scores get affected by the video types (safe or unsafe). No significant difference is found in this experiment too. The mean safety scores for safe videos (based on the gold label) in the homogeneous and heterogeneous modes are found to be 0.890 (σ = 0.143) and 0.943 (σ = 0.091), respectively. On the other hand, mean safety scores for unsafe videos (based on the gold label) in the homogeneous and heterogeneous modes are found to be 0.211 (σ = 0.120) and 0.207 (σ = 0.188), respectively. None of the aforementioned comparisons are statistically significant (unpaired t-test; p-value = 0.1167 for safe videos and p-value = 0.9675 for unsafe videos). So, the video type has also no influence over the user opinions in the homogeneous and heterogeneous modes. The proposed approach consistently performs well for any arbitrary video size or type. It is already noted that accuracy of the proposed model is 95% (19 out of 20 videos correctly judged). For a stronger evaluation of the model, we measure accuracy values based on consensus opinions over the individual sub-videos in both the modes (homogeneous and heterogeneous). The results are highlighted in Table 4. The overall accuracy obtained in the homogeneous setting is 93.54% (while considering the true labels for sub-videos). The true positive rate (sensitivity) and false positive rate is found to be 93.37% and Table 4 Confusion matrix obtained for the (a) homogeneous mode of experimentation on all the subvideos, and (b) heterogeneous mode of experimentation on all the sub-videos.
11
Information Processing and Management 57 (2020) 102189
S.K. Mridha, et al.
Fig. 6. Identifying the portion of inappropriateness in an unsafe video (Video ID: 10000037) by the users. A darker shade denotes that it is judged (by the viewers) as more unsafe for the viewers. (a) The users watch the video in the homogeneous mode, (b) The users watch the video in the heterogeneous mode.
94.03%, respectively. On the other hand, accuracy is 90.87% in the heterogeneous setting with true positive rate and false positive rate as 95.41% and 77.61%, respectively. While considering the sub-video specific accuracy, no significant difference is observed between the two different modes of experimentation. 5.7. Recognizing the portions of inappropriateness As described earlier, the ViSSa platform can help in automatically making the judgment of whether a video is safe or not. Moreover, as the opinions are taken at the bottom level (for each sub-video), it becomes inevitably useful in automatically detecting the portion of inappropriateness from a video and crop it out, if required. For example, as shown in Table 2, the first sub-video has ‘Unsafe’ option by majority. We have studied the particular time duration throughout which the sub-videos are opined as unsafe by of the majority of people. In the homogeneous mode, majority of users have chosen the time interval of 1–25 and 37–49 s as ‘Unsafe’ (as shown in Fig. 6 (a)). On the other hand, during the time interval 1–37 s it is chosen as ‘Unsafe’ in heterogeneous modes (as shown in Fig. 6 (b)). Thus it can be helpful in automatically detecting the obscene portions or the undesired contents within the posted videos. We analyzed the portions of inappropriateness (unsafe) in both the modes for all the videos and observed no significant difference. We not only recognize (through ViSSa) whether a video content is suitable for the users but also identify the reason of inappropriateness. To better understand how much the reason of inappropriateness vary among the different modes of experiments, we have plotted the share of all categories (i.e., harmful, private, sexual and violent) under the ‘Unsafe’ option and the ‘Safe’ option opined by the users on the videos. We comparatively plot the share of responses in different categories in the homogeneous and heterogeneous modes in Fig. 7. These responses do not count directly (rather the majority voting is applied) to obtain the final judgment for a video. As can be seen from Fig. 7, although the opinions vary but there is a resemblance of majority of opinions by options (‘Unsafe’ or ‘Safe’) and also by category (unsafe videos categorized as harmful, private, sexual and violent). Therefore, the final results in both the modes yield similar accuracy. 5.8. Analysis on performance score The performance score describes the precision of users while performing the task. By incorporating the performance score, we are capable of ranking the users. Performance score also provides a systematic evaluation of the contribution of users in a crowdsourcing platform like ViSSa. Therefore, we have analyzed the performance score of all the participating users who took part in judging the sub-videos during the empirical analysis. Computation of the performance score is carried out in three iterative steps - (i) initialization of performance score, (ii) calculation of a reputation score for each sub-video, and (iii) revision of the performance score. Initially, a neutral score (0.5) is assigned as the performance score to the new users. Based on their successive participation and performance, we calculate the reputation score for each sub-video the user has judged. Finally, the performance score of a user is updated (by penalizing the wrong answer and crediting the right answer) by adding the reputation score. Suppose, a user i has judged k sub-videos given by the set V = {Vi1, Vi2, …, Vik } . Note that, this k may vary user to user. Then 1 ), where n denotes the number of other reputation score of the ith user over Vij (i.e., the jth sub-video) is defined as Ri j = (1 n+1 users who has given the same judgment (either Safe or Unsafe) as that of the ith user over Vij. Based on this, performance score of the ith user after completing the jth sub-video is defined as follows. (j
1) × Pi j 1 + Ri j
(j
j 1) × Pi j 1
j
Pi =
j
Ri j
, if the judgment of the ith user is same as the gold judgment. , otherwise
Here, Pi j 1 denotes the performance score of the ith user after judging the (j 1) th sub-video (i.e., before judging the jth subvideo). We calculate the performance score of all the users across all the sub-videos. Note that, performance scores are available for users only for those sub-videos that the user has judged. For a better realization of the effect of performance scores on users, we plot the scores of 10 users, across the different sub-videos judged by them, in Fig. 8). Out of the said 10, 5 users have the highest average performance score and the rest 5 have the least average performance score. A higher average performance score reflects the superiority of a user. It is observed that 84% and 94% of the users have attained a higher score than the initial performance score of 0.5 in the homogeneous and heterogeneous mode, respectively. As the performance score reflects the quality of users, we can prioritize 12
Information Processing and Management 57 (2020) 102189
S.K. Mridha, et al.
Fig. 7. The share of different ‘Unsafe’ (harmful, private, sexual and violent) and ‘Safe’ options opined by the users in the (a) homogeneous mode and (b) heterogeneous mode.
their contribution for obtaining a better judgment. 5.9. Behavioral analysis on users Let us now study the users’ interest toward the homogeneous and heterogeneous modes of experiments. With this goal, we compare the time spent by the users in watching sub-videos in these two modes. Note that, the users participating in the homogeneous and heterogeneous modes are not necessarily the same. Therefore, to avoid any kind of bias, we consider only those users who are common participants in these two modes of experiments. We prepare a comparative plot of the times spent by the common 35 users in the said two modes. Fig. 9 represents the decreasing order of time spent by the users in the two different modes. The mean watching time of sub-videos in homogeneous and heterogeneous mode are found to be 176.86 s (within 36–408 s) and 239.18 s (within 102–432 s), respectively. As can be seen from Fig. 9, the users are more interested (statistically significant) toward the heterogeneous mode possibly because it includes more variety of videos than the homogeneous case. To delve more into what happens regarding the time spent (while watching the sub-videos) in the homogeneous and 13
Information Processing and Management 57 (2020) 102189
S.K. Mridha, et al.
Fig. 8. The best 5 users having the highest average performance scores (left) and worst 5 users having the least average performance scores (right) based on the judgment of videos in (a) homogeneous mode and (b) heterogeneous mode.
heterogeneous modes, we prepare an alluvial diagram to understand the users’ interest in detail. An alluvial diagram represents the connection between different clusters of objects in two different settings. Here, we assign different colors to distinguish the users and connect the times spent in the two different modes side by side in the alluvial diagram shown in Fig. 10. As can be seen from this figure, there are only 8 users who spent more than 200 s in the homogeneous mode, however 23 users are there in the heterogeneous mode. This may be because of the variety of sub-videos in the heterogeneous mode. However, we also notice that the times spent by the users in both the modes are not influenced by each other (time spent in the homogeneous mode is not necessarily at per the time spent in the heterogeneous mode). Based on the total time spent on sub-videos, we primarily hypothesized (from Figs. 9 and 10) that users are more interested toward heterogeneous videos because of their variety. However, the supporting analyses do not reveal the curiosity of users toward heterogeneous mode. To better understand the interest of users toward the sub-videos, we plot the histogram of the number of users 14
Information Processing and Management 57 (2020) 102189
S.K. Mridha, et al.
Fig. 9. The time spent (in seconds) by the 35 users, who took part in both the homogeneous and heterogeneous modes of experimentation, plotted in decreasing order.
Fig. 10. Watching time of sub-videos by the users in the homogeneous and heterogeneous mode. Here, a single color represents a unique user. The labels along the vertical axis denote the time (in sec). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
with respect to the count of sub-videos watched. Interestingly, we observe that (see Fig. 11) more number of users prefer to watch smaller number of sub-videos in the homogeneous mode, however, more number of users prefer to watch larger number of sub-videos in the heterogeneous mode. This is probably variety of content encourages the users to watch more sub-videos, thereby keeping them engaged in the environment.
15
Information Processing and Management 57 (2020) 102189
S.K. Mridha, et al.
Fig. 11. Histogram showing the number of users who judge a set of sub-videos during our experiment.
6. Deployment and followup analysis To deploy the system, we have involved the users by invitation. To comprehend the sheer scale of the problem, we perform an estimation of the ratio between the number of users required and amount of videos. As highlighted earlier (see Section 3), the number of active users on YouTube is more than 350 per second and 6.7 h of videos get uploaded within this time frame.1 Hence, the ratio between the number of users and the number of sub-video segments (of length 6 s) to be judged becomes 350:4020 ≈ 1:11. On ViSSa, we employ 47 users to judge 264 sub-video segments (of length 6 s), resulting into a ratio of 47:264 ≈ 1:6. Hence, the deployment is very much at per the scale of real scenario. Since a majority (two-third) of the adult social media users are within the age-group of 18-34 and this young generation is openminded, we choose users accordingly. Upon registering on the ViSSa platform, users are sent detailed emails with necessary guidelines to visit the platform and judge the uploaded videos. The platform details were described in the mail. They are also allowed to upload new videos. The criterion of inappropriateness of a video was described in the platform. The interested participants registered themselves through providing their details (such as age, mother tongue, etc.). Initially, the participants are requested to judge segments of videos in the homogeneous mode and at a later time in the heterogeneous mode. Notably, the choice of 6 s video segment is guided by the advertising scheme of YouTube introduced in 2016 that includes video advertisements of 6 s.2,3 To dig deep into the user experiences, we carry out a focused group interview on some users who participated in the deployment phase. Total 14 users volunteered to take part in this. Regarding the nature of content, users appear to disfavor the videos pertaining to child abuse, domestic violence, and sexual contents. However, some of them insisted that if the goal is novel then such videos can be shown. A user expressed: “... it might be okay if it has some social message associated within the content.” [Male, 26]
When asked about the emotional impact of judging inappropriate videos, a user pointed out psychological problems like tension and anxiety. But in general. users did not feel disturbed with this. Interestingly, a user felt anger while watching the inappropriate contents, however, there was a feel good factor too. The user stated: “Feel good to make them out from internet” [Male, 31]
To better understand the environmental factors on participation, we ask users to use ViSSa in their professional space. Most of 1
https://www.youtube.com/yt/about/press (Accessed in November, 2018). https://www.youtube.com/yt/about/press (Accessed in November, 2018). 3 https://www.thedrum.com/opinion/2018/12/03/small-mighty-making-sense-6-second-video . 2
16
Information Processing and Management 57 (2020) 102189
S.K. Mridha, et al.
them were unwilling to do so. However, some of them revealed that it is the degree of inappropriateness that influences such decisions. It highlights that ViSSa must seek the permission before showing an embedded video. However, this may reduce the participation significantly. It is interesting to note that, judgment of sub-videos for a longer time is not an obligation on ViSSa. A user may skip an embedded video after 6 s. As we have seen that many users were engaged for longer times (see Fig. 9), we ask about the motivation behind their engagement. They were basically happy because they were serving as filters to the Internet. A user was so concerned that she stated: “To judge the appropriateness of a video is important as they may not be deemed proper for streaming on social media where even a child can access it.” [Female, 31] As stated by most of the users, the deployment was fair in general. However, some of the concerns that were raised are simplicity of the mechanism (positive), avoidance of using any algorithm in the opinion seeking process (positive), partial view to the videos (negative), etc. Notably, a user pointed out that the sub-video recommendation process might include some bias. “The experiment was not very fair because the level of appropriateness varies from person to person. There is no set benchmark as such. So, the final answer may not reflect the proper appropriateness of a video.” [Female, 31] To realize the confidence of users, we asked the users whether they do feel having sufficient expertise. Most of the users were confident to judge videos on ViSSa. Going beyond the question of expertise, a user indicated the issue of ethical responsibility in this regard as follows: “Yes, cause i have some morals to stand upon” [Male, 21] We essentially deploy a platform to identify unsuitable video contents. However, the users are found to welcome this with a positive feeling. When asked whether they prefer to say “I have judged the appropriateness of videos” or “I have judged the inappropriateness of videos” (with no framing effect), majority of the users (two-third) chose the former statement. A user also added the comment: “Would say judged the appropriateness of videos, termed any content as inappropriate only if found it objectionable by viewing the part of the content else always searching for some positiveness in the content.” [Male, 26]
7. Discussion The proposed model is a good fit for the social media platforms like YouTube, Facebook, etc. because videos can be checked (by maintaining the safety policies) before making them public. This model can also be utilized for determining abnormal activities in streaming videos received directly from CCTVs in the area under surveillance. The approach is also applicable toward measuring the appropriateness of images and audio (as per the policies), when they get uploaded to other online social media. Quantifying the appropriateness of videos might be useful in determining the suitability to different viewership. Unlike the existing approaches, where videos are blocked with a delay after getting users’ complains, the proposed model provides an on-demand support. Flag videos are occasionally marked on social media. In such cases, a flag is raised by the viewers against the inappropriate videos for the platform. Duplicate contents or similar frames of two videos can be treated as flag contents. Flag contents are mainly judged by the YouTube staffs (experts). Viewers may also judge the flag content by clicking the report button and providing proper justification. In flag videos, one can mute the copyright contents or even remove the same. However in our approach, the total videos are judged once a decision is taken whether it should be public or not. Our model works in real-time and also points out the region of inappropriateness. After carrying out an in-depth analysis, it is observed that among the safe videos (according to the gold judgment set by the experts), only one is considered as unsafe for the platform (ViSSa) by the users. The reason behind this may be that the last segment did not carry any message to the user. As the video duration was 66 s and a user can judge a sub-video of length at most 60 s, so there remained a segment of 6 s having no meaningful content. This suggests that such framework should contain all meaningful sub-videos to be judged. It is interesting to note that, the imbalance of Safe and Unsafe videos that we collectively consider in the current analysis does not play any role in the final decision making. This is because the focus of workers in such crowdsourcing environments is only over the microtasks (here labeling of sub-videos). Hence, the crowd contributions are independent to each other. By doing a follow up and focused group survey on the ViSSa users (close to 50), we have obtained a few important insights. Many surveyors appraise the framework’s simplicity. Some users' comments highlight their deep level of engagement. For example, “User can report for unsafe videos there is no need for extra buttons for it” (this reflects the users’ interest), “embed video in such a way so that it cannot be skipped” (this shows the sense of responsibility in the users), etc. However, many of them wanted to restrict the number of sub-videos to no more than one. Given the fact that it is a volunteer job, the reason is understandable. On the other hand, we have perceived that the users do not prefer to judge certain types of videos (impersonated, private, sexual, etc.) on the ViSSa platform. Finally, remuneration is found to be a major concern to many of the users (about 60% demanded for it). 17
Information Processing and Management 57 (2020) 102189
S.K. Mridha, et al.
8. Conclusion In the proposed model, majority voting has a major role to play in computing the final judgment. But as majority voting treats all the users as equal experts and so opinions with equal weights, their relative accuracies do not get counted in. It is obvious that neither the accuracy of a user is constant over different time frames not same as the others in a single time frame. Therefore, the accuracy of a user can be computed after certain time intervals to employ a weighted majority voting scheme. Moreover, we can consider the bias of users over a particular type of video. Here bias means inclination (due to the varying perception and conservedness of viewers) toward the ‘Safe’ or ‘Unsafe’ option. To identify these characteristics, some videos (as test cases) with known ground truth labels can be included and the bias can be determined based on the responses on them. However, there remain some ethical challenges like managing the risks involved in having contact with potentially harmful media and their impact on the relationship with the users. The recent features like live streaming of videos in various online social media (e.g., Facebook Live) add a new dimension to the problem addressed in this paper. Judging streaming videos in real-time by the users is itself a challenging task. However, there are several explicit factors that influence the performance. These include the selection of appropriate users, deciding the number of users (and so the number of opinions) required for each sub-video (Carvalho, Dimitrov, & Larson, 2016), fixing the duration of each subvideo, response time of each user, fixing the performance score of cold start users, etc. This demands for contributions in different directions of future research like recommendation systems, crowd computing, etc. Attributes of a video and demography of the assigned users play a major role in evaluating the judgments too. So, the same video may not be equally appropriate for a pair of users with separate demography. It is also challenging to find the ways to segregate a video into overlapping segments for reducing the misinterpretation at the time of judgment. Estimating the quality of crowd contributions is always a demanding task (Lyu, Ouyang, Shen, & Cheng, 2019). User reputation is another important concern in crowdsourcing (Horton & Golden, 2015; Jøsang, Ismail, & Boyd, 2007). In non-paid crowdsourcing markets, reputed users always produce high quality works. Identifying the reputed users and motivating them to participate in the tasks is another challenging job. One may think about using a reputation score for the users to motivate them. Declaration of Competing Interest SKM, BS and MB have conceptualized the idea and the model; SKM has implemented the model; SKM, BS and SC have carried out formal analysis and further investigation; MB has supervised the overall progress; SKM, BS and SC have written the first draft; All the authors have read and approved the final version of the manuscript for submission. Acknowledgments The authors are thankful to the anonymous reviewers for their valuable comments that greatly helped to improve the paper. Additionally, the authors would like to thank the crowd contributors involved in this work. References Aggarwal, N., Agrawal, S., & Sureka, A. (2014). Mining YouTube metadata for detecting privacy invading harassment and misdemeanor videos. Proceedings of the 12th annual international conference on privacy, security and trust. IEEE84–93. Amer-Yahia, S., & Roy, S. B. (2016). Toward worker-centric crowdsourcing. IEEE Data Engineering Bulletin, 39(4), 3–13. Benevenuto, F., Magno, G., Rodrigues, T., & Almeida, V. (2010). Detecting spammers on Twitter. Proceedings of the collaboration, electronic messaging, anti-abuse and spam conference (CEAS). Redmond, Washington, USA. Benevenuto, F., Rodrigues, T., Veloso, A., Almeida, J., Gonçalves, M., & Almeida, V. (2012). Practical detection of spammers and content promoters in online video sharing systems. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(3), 688–701. Biersack, E., Callegari, C., & Matijasevic, M. (2013). Data traffic monitoring and analysis. Lecture notes in computer science, 7754. Springer. Carvalho, A., Dimitrov, S., & Larson, K. (2016). How many crowdsourced workers should a requester hire? Annals of Mathematics and Artificial Intelligence, 78(1), 45–72. Chakraborty, M., Pal, S., & Chowdary, R. P. C. R. (2016). Recent developments in social spam detection and combating techniques: A survey. Information Processing & Management, 52(6), 1053–1073. Chatterjee, S., & Bhattacharyya, M. (2015). A biclustering approach for crowd judgment analysis. Proceedings of the 2nd ACM IKDD conference on data sciences. ACM118–119. Chatterjee, S., & Bhattacharyya, M. (2017). Judgment analysis of crowdsourced opinions using biclustering. Information Sciences, 375, 138–154. Chaudhary, V., & Sureka, A. (2013). Contextual feature based one-class classifier approach for detecting video response spam on Youtube. Proceedings of the 11th annual international conference on privacy, security and trust (PST). IEEE195–204. Chen, C., Meng, X., Zhao, S., & Fjeld, M. (2017). ReTool: Interactive microtask and workflow design through demonstration. Proceedings of the Chi conference on human factors in computing systems. Denver, Colorado, USA: ACM3551–3556. Dessì, D., Fenu, G., Marras, M., & Recupero, D. R. (2019). Bridging learning analytics and cognitive computing for big data classification in micro-learning video collections. Computers in Human Behavior, 92, 468–477. Deza, A., Peters, J. R., Taylor, G. S., Surana, A., & Eckstein, M. P. (2017). Attention allocation aid for visual search. Proceedings of the Chi conference on human factors in computing systems. ACM220–231. Doan, A., Ramakrishnan, R., & Halevy, A. Y. (2011). Crowdsourcing systems on the world-wide web. Communications of the ACM, 54(4), 86–96. Elmalech, A., Sarne, D., David, E., & Hajaj, C. (2016). Extending workers’ attention span through dummy events. Proceedings of the 4th AAAI conference on human computation and crowdsourcing. Austin, TX, USA: AAAI42–51. Endeshaw, T., j. Garcia, & Jakobsson, A. (2008). Classification of indecent videos by low complexity repetitive motion detection. Proceedings of the 37th IEEE applied imagery pattern recognition workshop (AIPR’08). IEEE1–7. Galasso, F., Keuper, M., Brox, T., & Schiele, B. (2014). Spectral graph reduction for efficient image and streaming video segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)49–56. Garcia, J. (2015). Examining the performance for forensic detection of rare videos under time constraints. Proceedings of the 12th international joint conference on e-business
18
Information Processing and Management 57 (2020) 102189
S.K. Mridha, et al.
and telecommunications (ICETE). IEEE419–426. Gonen, R., Raban, D., Brady, C., & Mazor, M. (2014). Increased efficiency through pricing in online labor markets. Journal of Electronic Commerce Research, 15(1), 58. Hanafusa, H., Morita, K., Fuketa, M., & Aoe, J. (2011). A method of extracting malicious expressions in bulletin board systems by using context analysis. Information Processing & Management, 47(3), 323–335. Hasan, M., Orgun, M. A., & Schwitter, R. (2019). Real-time event detection from the Twitter data stream using the TwitterNews+Framework. Information Processing & Management, 56(3), 1146–1165. Haselton, M. G., Nettle, D., & Murray, D. R. (2015). The evolution of cognitive bias. The Handbook of Evolutionary Psychology, 1–20. Heymann, P., Koutrika, G., & Garcia-Molina, H. (2007). Fighting spam on social web sites: A survey of approaches and future challenges. IEEE Internet Computing, 11(6), 36–45. Ho, C. J., & Vaughan, J. W. (2012). Online task assignment in crowdsourcing markets. Proceedings of the 26th AAAI conference on artificial intelligence. Toronto, Ontario, Canada: AAAI45–51. Horton, J., & Golden, J. (2015). Reputation inflation: Evidence from an online labor market. 1, 1–31 Workshop Paper Ikeda, K., Morishima, A., Rahman, H., Roy, S. B., Thirumuruganathan, S., Amer-Yahia, S., et al. (2016). Collaborative crowdsourcing with crowd4U. Proceedings of the VLDB Endowment, 9(13), 1497–1500. Jansohn, C., Ulges, A., & Breuel, T. M. (2009). Detecting pornographic video content by combining image features with motion information. Proceedings of the 17th ACM international conference on multimedia. ACM601–604. Jøsang, A., Ismail, R., & Boyd, C. (2007). A survey of trust and reputation systems for online service provision. Decision Support Systems, 43(2), 618–644. Jung, S., Youn, J., & Sull, S. (2014). A real-time system for detecting indecent videos based on spatiotemporal patterns. IEEE Transactions on Consumer Electronics, 60(4), 696–701. Kaur, H., Gordon, M., Yang, Y., Bigham, J. P., Teevan, J., Kamar, E., et al. (2017). Crowdmask: Using crowds to preserve privacy in crowd-powered systems via progressive filtering. Proceedings of the 5th AAAI conference on human computation (HCOMP). AAAI. Kaushal, R., Saha, S., Bajaj, P., & Kumaraguru, P. (2016). KidsTube: Detection, characterization and analysis of child unsafe content & promoters on YouTube. Proceedings of the 14th annual conference on privacy, security and trust (PST). IEEE157–164. Kietzmann, J. H., Hermkens, K., McCarthy, I. P., & Silvestre, B. S. (2011). Social media? Get serious! Understanding the functional building blocks of social media. Business Horizons, 54(3), 241–251. Kirsch, W. (2016). A mathematical view on voting and mower. European Mathematical Society, 251–279. Lee, S., Ha, T., Lee, D., & Kim, J. H. (2018). Understanding the majority opinion formation process in online environments: An exploratory approach to Facebook. Information Processing & Management, 54(6), 1115–1128. Liu, C., & Wang, Y. (2012). True Label + Confusions: A spectrum of probabilistic models in analyzing multiple ratings. Proceedings of the 29th international conference on international conference on machine learning17–24. Loni, B., Larson, M., Bozzon, A., & Gottlieb, L. (2013). Crowdsourcing for social multimedia at MediaEval 2013: Challenges, data set, and evaluation. Proceedings of the mediaeval workshop. Barcelona, SpainVol. 1043. Lyu, S., Ouyang, W., Shen, H., & Cheng, X. (2019). Learning representations for quality estimation of crowdsourced submissions. Information Processing & Management, 56(4), 1484–1493. Mridha, S. K., Sarkar, B., Chatterjee, S., & Bhattacharyya, M. (2017). Identifying unsafe videos on online public media using real-time crowdsourcing. Proceedings of the 5th AAAI conference on human computation and crowdsourcing (HCOMP WIP track). Quebec City, CanadaarXiv:1708.09654. Muthukrishnan, S. (2005). Data Streams: Algorithms and applications. Foundations and Trends® in Theoretical Computer Science, 1(2), 117–236. Neto, F. R. A., & Santos, C. A. S. (2018). Understanding crowdsourcing projects: A systematic review of tendencies, workflow, and quality management. Information Processing & Management, 54(4), 490–506. Ochoa, V. M. T., Yayilgan, S. Y., & Cheikh, F. A. (2012). Adult video content detection using machine learning techniques. Proceedings of the 8th international conference on signal image technology and internet based systems (SITIS). IEEE967–974. Papadamou, K., Papasavva, A., Zannettou, S., Blackburn, J., Kourtellis, N., Leontiadis, I. et al. (2019). Disturbed YouTube for kids: Characterizing and detecting disturbing content on YouTube. arXiv:1901.07046. Rahman, D. (2012). But who will monitor the monitor? The American Economic Review, 102(6), 2767–2797. Rangan, P., Venkat, V., & Harrick, M. (1991). Designing file systems for digital video and audio. Vol. 25. ACM. Roberts, S. T. (2016). Commercial content moderation: Digital laborers’ dirty work. Schulc, A., Cohn, J. F., Shen, J., & Pantic, M. (2019). Automatic measurement of visual attention to video content using deep learning. Proceedings of the 16th international conference on machine vision applications (MVA). IEEE1–6. Seufert, M., Casas, P., Wehner, N., Gang, L., & Li, K. (2019). Stream-based machine learning for real-time QoE analysis of encrypted video streaming traffic. Proceedings of the 22nd conference on innovation in clouds, internet and networks and workshops (ICIN). IEEE76–81. Shao, L., Cai, Z., Liu, L., & Lu, K. (2017). Performance evaluation of deep feature learning for RGB-D image/video classification. Information Sciences, 385, 266–283. Sureka, A. (2011). Mining user comment activity for detecting forum spammers in YouTube. Proceedings of the 1st international workshop on usage analysis and the web of data. Hyderabad, India. Vashistha, A., Sethi, P., & Anderson, R. (2017). Respeak: A voice-based, crowd-powered speech transcription system. Proceedings of the chi conference on human factors in computing systems. Denver, Colorado, USA: ACM1855–1866. Vondrick, C., Patterson, D., & Ramanan, D. (2013). Efficiently scaling up crowdsourced video annotation. International Journal of Computer Vision, 101(1), 184–204. Wever, B. D., Keer, H. V., Schellens, T., & Valcke, M. (2007). Applying multilevel modelling to content analysis data: Methodological issues in the study of role assignment in asynchronous discussion groups. Learning and Instruction, 17(4), 436–447. Yeung, M., Yeo, B., & Liu, B. (1998). Segmentation of video by clustering and graph analysis. Computer Vision and Image Understanding, 71(1), 94–109. Zheng, X., Zeng, Z., Chen, Z., Yu, Y., & Rong, C. (2015). Detecting spammers on social networks. Neurocomputing, 159, 27–34. Zhu, Y., Chen, Z., & Wu, F. (Chen, Wu, 2019a). Multimodal deep denoise framework for affective video content analysis. Proceedings of the 27th ACM international conference on multimedia. ACM130–138. Zhu, Y., Tong, M., Jiang, Z., Zhong, S., Tian, Q., et al. (2019). Hybrid feature-based analysis of video’s affective content using protagonist detection. Expert Systems with Applications, 128, 316–326.
19