Closed-loop Big Data Analysis with Visualization and Scalable Computing

JID:BDR AID:54 /FLA [m5G; v1.195; Prn:18/01/2017; 8:02] P.1 (1-15) Big Data Research ••• (••••) •••–••• 1 Contents lists available at ScienceDirec...

Download PDF

8MB Sizes 0 Downloads 147 Views

Report

Full Text

JID:BDR AID:54 /FLA

[m5G; v1.195; Prn:18/01/2017; 8:02] P.1 (1-15)

Big Data Research ••• (••••) •••–•••

1

Contents lists available at ScienceDirect

2

67 68

3

Big Data Research

4 5

69 70 71

6

72

www.elsevier.com/locate/bdr

7

73

8

74

9

75

10

76

11 12 13

Closed-loop Big Data Analysis with Visualization and Scalable Computing ✩

14 15 16 17 18

Guangchen Ruan , Hui Zhang

b

23 24 25 26

79 81 82

a

83

Indiana University, Bloomington, IN, United States b University of Louisville, Louisville, KY, United States

84 85

20 22

78 80

a

19 21

77

86

a r t i c l e

i n f o

Article history: Received 17 April 2016 Received in revised form 12 November 2016 Accepted 2 January 2017 Available online xxxx

27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43

a b s t r a c t

87 88

Many scientiﬁc investigations require data-intensive research where big data are collected and analyzed. To get big insights from big data, we need to ﬁrst develop our initial hypotheses from the data and then test and validate our hypotheses about the data. Visualization is often considered a good means to suggest hypotheses from a given dataset. Computational algorithms, coupled with scalable computing, can perform hypothesis testing with big data. Furthermore, interactive visual interfaces can allow domain experts to directly interact with data and participate in the loop to reﬁne their research questions and redirect their research directions. In this paper we discuss a framework that integrates information visualization, scalable computing, and user interfaces to explore large-scale multi-modal data streams. Discovering new knowledge from the data requires the means to exploratively analyze datasets of this scale—allowing us to freely “wander” around the data, and make discoveries by combining bottom-up pattern discovery and top-down human knowledge to leverage the power of the human perceptual system. We start with a novel interactive temporal data mining method that allows us to discover reliable sequential patterns and precise timing information of multivariate time series. We then proceed to a parallelized solution that can fulﬁll the task of extracting reliable patterns from large-scale time series using iterative MapReduce tasks. Our work exploits visual-based information technologies to allow scientists to interactively explore, visualize and make sense of their data. For example, the parallel mining algorithm running on HPC is accessible to users through asynchronous web service. In this way, scientists can compare the intermediate data to extract and propose new rounds of analysis for more scientiﬁcally meaningful and statistically reliable patterns, and therefore statistical computing and visualization can bootstrap each another. Furthermore, visual interfaces in the framework allows scientists to directly participate in the loop and can redirect the analysis direction. All these combine to reveal an effective and eﬃcient way to perform closed-loop big data analysis with visualization and scalable computing. Published by Elsevier Inc.

89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109

44

110

45

111

46

112

47 48

113

1. Introduction

49 50 51 52 53 54 55 56 57 58 59

A recent trend in many scientiﬁc investigations is to conduct data-intensive research by collecting a large amount of highdensity high-quality data [14,18,38,40,36]. These data, such as text, video, audio, images, RFID, and motion tracking, are usually multifaceted, dynamic, and extremely large in size, and likely to be substantially publically accessible for the purposes of continued and deeper data analysis. Indeed, data-driven discovery has already happened in various research ﬁelds, such as earth sciences, medical sciences, biology and physics, to name a few.

60 61 62 63 64 65 66

✩

This article belongs to HPC Tutorial for Big Data. E-mail address: [email protected] (H. Zhang).

http://dx.doi.org/10.1016/j.bdr.2017.01.002 2214-5796/Published by Elsevier Inc.

Top-down human knowledge plays an important role in knowledge discovery. Getting big insights from big data is no exception. We need to ﬁrst develop our initial hypotheses from the data, and visualization is considered a good means to suggest initial hypotheses from a given dataset—very often in small size. We next need to test and validate our hypotheses about the data, and make discoveries by combining bottom-up pattern discovery and top-down human knowledge to leverage the power of the human perceptual system. Computational algorithms, coupled with scalable computing, can perform hypothesis testing with much larger data sets. Discovering new knowledge from the data also requires the means to allow us to freely “wander” around the data, reﬁne and retest our hypotheses. Interactive visual interfaces can allow us to directly interact with data and participate in the analysis loop to reﬁne their research questions and redirect their research directions in multiple iterations.

114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

JID:BDR

2

AID:54 /FLA

[m5G; v1.195; Prn:18/01/2017; 8:02] P.2 (1-15)

G. Ruan, H. Zhang / Big Data Research ••• (••••) •••–•••

1

67

2

68

3

69

4

70

5

71

6

72

7

73

8

74

9

75

10

76

11

77

12

78

13

79

14

80

15

81

16

82

17

83

18

84

19

85

20

86

21

87

22

88

23

89

24

90

25

91

26

92

27 28 29 30 31 32 33

93

Fig. 1. (a)→(b): Using multi-modal sensing systems to collect and analyze ﬁne-grained behavioral data including motion tracking data, eye tracking data, video and audio data. A family of studies using this research paradigm are conducted to collect multi-stream multi-modal data and convert them into multi-streaming time series for further data analysis and knowledge discovery. (c)–(g): Integrating over ROI event streams by overlaying pictures that record individual ROI streams. (c) Robot face-looking event. (d) Human face-looking event. (e) Face-to-face coordination: (Robot face-looking) XOR (Human face-looking). (f) Human eye–hand-coordination: (Human eye gaze) XOR (Human hand movement). (g) Human–Robot joint attention: (Human eye–hand-coordination) XOR (face-to-face coordination). (h) Six instances of momentary interactive behavior highlighted in the AOI streams. This sequential pattern starts with the situation that the robot agent and the human teacher attend to different objects (①), and then the human teacher checks the robot agent’s gaze (②) and follows the robot agent attention to the same object (③) and ﬁnally reach to that object (④).

34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

63

2. A motivating use case—developing top-down knowledge hypotheses from visual analysis of multi-modal data streams

64 65 66

95 96 97 98 99 100

Thus one goal in this research is to develop, use and also share tools that enable researchers to ﬁnd new patterns and gain new knowledge from such data. But how can we discover new and meaningful patterns if we do not know what we are looking for? Although standard statistics or data mining algorithms may ﬁnd a subset of these meaningful patterns, they may miss a great deal more. We suggest here that visualization and visual mining is a powerful approach for new pattern discovery. Discovering new knowledge requires the ability to detect unknown, surprising, novel, and unexpected patterns. To achieve this goal, our proposed solution is to use visualization system that allows us to easily spot interesting patterns through both our visual perception systems and our domain knowledge. Consequently, the exploratory process should be highly iterative and interactive—visualizing not only raw data, but also the intermediate results of current statistical computations for further analysis. In this way, computational algorithm and visualization can bootstrap each other—informative visualization based on new results leads to the discovery of more complex patterns which can in turn be visualized, leading to more ﬁndings. Human experts play a critical role in this human-in-theloop knowledge discovery by applying statistical techniques to the data, examining visualization results, and deciding/directing the research focus based on their theoretical knowledge. In this way, domain knowledge, computational power, and information visualization techniques can be integrated together to understand massive datasets.

61 62

94

Analyzing ﬁne-grained behavioral data in the format of multiple time-series (see e.g., Fig. 1(a)) is the motivating use case in

our work. Interacting embodied agents, be the groups of people engaged in a coordinated task, autonomous robots acting in an environment, or an avatar on a computer screen interacting with a human user, must seamlessly coordinate their actions to achieve a collaborative goal. The pursuit of a shared goal requires mutual recognition of the goal, appropriate sequencing and coordination of each agent’s behavior with others, and making predictions from and about the likely behavior of others. Such interaction is multimodal as we interact with each other and with intelligent artiﬁcial agents through multiple communication channels, including looking, speaking, touching, feeling, and pointing. To gain insight of how interaction happens across multiple channels, we need to ﬁrst ﬁnd a way to “look” at the data. Information visualization can be effectively exploited to explore moments-of-interest in multi-modal data streams (see e.g., [37]). Computational algorithms such as data mining and sequential pattern mining (see e.g., [21,25,28,29,24,39,16] are other variants that have requirement similar to ours. While we of course exploit many techniques of information visualization and interactive data mining that have been widely used in other interfaces, we found that many of the problems we encounter have been fairly unique, and thus require customized hybrid approaches.

101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123

Color-based representation of temporal events. Multi-modal data streams are ﬁrst converted to multi-streaming temporal events with categorical type values. We represent an event e as a rectangular bar by assigning the distinct color key to the event type (i.e., e.t), with the length corresponding to the event’s duration (i.e., e.d). The visual display of temporal events themselves in a sequence allows us to examine how frequently each temporal event happens over time, how long each instance of an event takes, how one event relates to other events, and whether an event appears

124 125 126 127 128 129 130 131 132

JID:BDR AID:54 /FLA

[m5G; v1.195; Prn:18/01/2017; 8:02] P.3 (1-15)

G. Ruan, H. Zhang / Big Data Research ••• (••••) •••–•••

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

more or less periodically or whether there are other trends over time. The colored bar representation in principle is suﬃcient to allow the viewer to examine global temporal patterns by scanning through the event data stream. However, in practice, many interesting temporal pattern discoveries are based on user-deﬁned AOIs, thus oftentimes distinct event types correspond to one AOI in the analytics. We illustrate this point by using an example in which multi-stream behavioral data in real-time multi-modal communication between autonomous agents (human or robots) are recorded and analyzed. As shown in Fig. 1(a), a human teacher demonstrated how to name a set of shapes to a robot learner who can demonstrate different kinds of social skills and perform actions in the study. Multi-modal interactions between the two agents (speaking, eye contact, pointing, gazing, and hand movement) are monitored and recorded in real time. Fig. 1(b) visualizes the three processed streams derived from raw action data from both the human and the robot. The ﬁrst one is the AOI stream from the robot agent’s eye gaze, indicating which object the robot agent attends to (e.g., gazing at one of the three virtual objects or looking straight toward the human teacher). The second stream is the AOI stream (three objects and the robot agent’s face) from the human teacher’s gaze and the third one encodes which object the human teacher is manipulating and moving. Most often, visual attention from both agents was concentrated on either each other’s face (represented as yellow bar), as shown in Fig. 1(c) and Fig. 1(d) or the three objects in the work space (represented in three colors (e.g. red, blue, and green), respectively). Color-based representation of AOIs in temporal events allows us to examine how visual attentions happen over time, how long each instance of an event takes, and how one relates to another.

32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55

Generating logical conjunctions. We can then connect these patterns back to top-down research questions and hypotheses. For example, our color-based representation also allows an end-user to overlay data streams by dragging a rendered data stream on top of another data stream. Our system performs a binary XOR operation when two streams are integrated. By XORing the pixel values of two Regions of Interest (ROI) event streams, we can integrate two sensory events to ﬁnd out the joint attention over these two sensory channels (XOR will produce true if the two pixels being integrated are of different colors, and false if they are of the same color). For example, in Fig. 1(e) we can overlay the newly derived robot face-looking stream (see e.g., Fig. 1(c)) and human face-looking stream (see e.g., Fig. 1(d)) to identify the human–robot face-to-face coordination moments. Similarly, in Fig. 1(f) we can integrate human eye gaze ROI stream and human hand movement ROI stream to obtain a new eye–hand coordination stream, which represents two concurrent events: (1) human gazes at an object; and (2) human manipulates with hand the same object. Finally, in Fig. 1(g) we can integrate the human–robot joint attention event stream by overlaying eye–hand coordination stream (i.e., Fig. 1(f)) and face-toface coordination stream (i.e., Fig. 1(e)). In this way, the color-based event visualization scheme is able to represent various logical conjunctions of those events.

56 57 58 59 60 61 62 63 64 65 66

Developing hypotheses from visual analysis. Eye-balling the colorbased representation of AOI streams helps the viewer to discover momentary interactive behaviors. For example, one such sequential behavioral pattern that seems to frequently happen in those data streams is labeled ①→④ in Fig. 1(h): this sequential pattern starts with the situation that the robot agent and the human teacher attend to different objects (see e.g., ①, the robot attends to the object indicated by the green bar (i.e. the bar marked by ①) while the human doesn’t), and then the human teacher checks the robot agent’s gaze (e.g., ②, human face-looking indicated by the yellow

3

bar (i.e. bar marked by ②)) and follows the robot agent’s attention to the same object (e.g., ③, the human object-looking indicated by the green bar (i.e. bar marked by ③)) and ﬁnally reaches to that object (e.g., ④, the human hand-movement indicated by the green bar (i.e. bar marked by ④)).

67 68 69 70 71 72

2.1. Overall scenario—towards closed-loop big data analysis

73 74

First, often when behavioral scientists draw conclusions from data stream—such as eye gaze—the ﬁnal published results take the form of overall statistics, for example in the case of eye gaze, total looking time, or time to ﬁxation, or longest look. Thus, a huge amount of information is reduced and usefully summarized. But statistics and measures extracted from raw data may exclude embedded patterns or even be misleading, because multiple steps are involved in reducing raw data to ﬁnal statistics and each step involves decisions about both statistical methods used to compute ﬁnal results and the parameters of those methods. Different decisions may change the ﬁnal outcome dramatically, leading to different results and consequently different interpretations. We argue that sensible decisions about streams of data analysis cannot be completely pre-determined, but must be derived at least in part from the structure of the data itself. Most often, even if we have some predictions from our experimental designs, we nonetheless lack precise predictions about the structure and patterns of data as a whole, yet it may be the unexpected dependencies that prove most informative about developmental process, about the possibility of alternate pathways that might be leveraged in interventions, and that may enable to understand both development in aggregate and in the individual. Since we cannot specify all of the data analysis details apriori, we need insights from both top-down knowledge and raw data themselves to make sensible decisions step by step as we systematically reduce the data to extract reliable and interesting patterns. These issues are particularly critical in the dense time series of micro-behaviors such measures of momentary eye-gaze direction and electroencephalography (EEG) signals. This magniﬁcation of behavior that these advanced techniques deliver is allowing behavioral scientists to see the substructure of macro level behaviors for ﬁrst time. Due to the complexity of multi-modal streaming data, it has been very challenging to design a useful analytics system. In practice, study of raw multi-modal data streams is often transformed into the statistical investigation of continuous categorical streams using derived measures such as user-deﬁned Areas Of Interest (AOIs), whose complexity may initially suggest that one shall refer to an equally complicated approach for statistical and mathematical analyses. Most often, however, to better accomplish such exploration and analytics tasks researchers should play a critical role in the knowledge discovery by exploring the information visualization, suggesting initial analysis, examining the intermediate analytics result, and directing the research focus for the new round of scientiﬁc ﬁndings. In this paper we are motivated to develop a hybrid solution—a novel approach to integrate top-down domain knowledge with bottom-up information visualization and temporal pattern discovery. A few key design rules behind our implementation are as follows:

75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123

• Enabling human-in-the-loop knowledge discovery—We argue

124

that sensible decisions about streams of data analysis cannot be completely pre-determined. Since we cannot specify all of the data analysis details a priori, we need insights from both top-down knowledge and raw data themselves to conduct interactive data analysis and visual data mining iteratively. In these tasks, researchers play a critical role in this human-in-theloop knowledge discovery by applying statistical techniques to the data, examining visualization results, determining how

125 126 127 128 129 130 131 132

JID:BDR

4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

AID:54 /FLA

[m5G; v1.195; Prn:18/01/2017; 8:02] P.4 (1-15)

G. Ruan, H. Zhang / Big Data Research ••• (••••) •••–•••

to chunk the streams of data, and deciding/directing the research focus based on their theoretical knowledge. Redirected bottom-up analytics plays a complementary role in assisting scientists to not only validate their hypothesis but also quantify their scientiﬁc ﬁndings. • Generating bottom-up analysis using machine computation and information visualization—Discovering new knowledge requires the ability to detect unknown, surprising, novel, and unexpected patterns. With ﬁne-grained massive data sets, algorithms and tools are desired to extract temporal properties from multi-modal data streams. Examples include statistics, such as the typical duration of a sensory event, temporal correlations of derived variables, such as the triggering relationships among events (or sequential association of events), and visual display of derived variables or their logical conjunctions. • Scalable architecture for data intensive computing—Massive data sets collected from modern instruments impose high demands on both storage and computing powers. Knowledge discovery at terabyte scale is nontrivial as the time cost of simply applying traditional algorithms in a sequential manner can be prohibitive. Thus, distributed parallel processing/visualization at scale becomes a requirement rather than an option.

25 26 27

3. Details of implementation models—visualization as part of larger process of data analysis

28 29 30 31 32 33 34 35 36 37 38 39 40

In this section, we introduce an “interactive event-based temporal pattern discovery” paradigm to investigate “moments-ofinterest” from multi-modal data streams. To be effective, such a paradigm must meet two goals: on one hand, it must follow the general spirits of information visualization to allow users to interactively explore the visually displayed data streams; on the other hand, machine computation should be applied in a way that can facilitate the identiﬁcation of statistically reliable patterns, whose visual pictures can then be examined for closer investigation and can trigger a new round of knowledge discovery. Before we detail the logical series of steps, several deﬁnitions are in order:

41 42 43 44 45 46 47 48 49 50 51 52 53 54 55

Deﬁnition 3.1 (Event). An event is a tuple e = (t , s, d), where t is the event type (or label), s is the starting time of the event and d indicates its duration. The ending time of the event can be derived from s + d. Deﬁnition 3.2 (Example). An example of length k is a tuple ex = (id, du , {e 1 , e 2 , · · · , ek }), where id uniquely identiﬁes the example, du is the total duration of the example and {e 1 , e 2 , · · · , ek } is the sequence of events that forms the example. We note that events can overlap with each other, namely overlap (e i , e j ) = true ⇐⇒ min(e i .e , e j .e ) > max(e i .s, e j .s); where overlap (e i , e j ) is the predicate to check whether two events e i , e j overlaps and ‘.e’ indicates the ending time of an event which is .s + .d.

56 57 58 59 60

Deﬁnition 3.3 (Example set). An example set of size n is a set S = {ex1 , ex2 , · · · , exn } where exi is an example and 1 ≤ i ≤ n. Examples in the set should have the same total duration, namely ex1 .du = ex2 .du =, · · · , = exn .du.

61 62 63 64 65 66

Deﬁnition 3.4 (Pattern). A pattern of length k is a tuple p = (du , {e 1 , e 2 , · · · , ek }), deﬁnitions of du and e i , 1 ≤ i ≤ n are the same as those in Deﬁnition 3.2. Events in a pattern are ordered by beginning times and then by ending times and then by lexicographic order of event types. In order to uniquely represent a

pattern, the sorted form of a pattern is an ordered event sequence, i.e. events are ordered by beginning times and then by ending times and then by lexicographic order of event types. If not explicitly stated otherwise, all patterns in the paper are presented in sorted form.

67 68 69 70 71 72

Deﬁnition 3.5 (Symbolic projection). The symbolic projection of a pattern (either unsorted or sorted) p = (du , {e 1 , e 2 , · · · , ek }) is the sequence of its event type symbols: πsym ( p ) = {e 1 .t , e 2 .t , · · · , ek .t }. The order of the symbols is the same as that in the pattern. In particular, we call symbolic projection of the sorted form of p as its signature, denoted as sig ( p ).

73 74 75 76 77 78 79

3.1. Complementing hypotheses with bottom-up quantitative measures

80 81

We next discuss how to employ machine computation to detect and quantify reliable moment-by-moment multi-modal behaviors one could discover by scanning the color-based representation illustrated in Fig. 1(h). The visual-based knowledge discovery can be complemented and validated by quantitative temporal measures powered by analytics algorithms—e.g., the detection of statistical reliable temporal patterns, the quantiﬁcation of typical durations and frequencies of repeating temporal phenomena, and the measure of precise timing between associated events. Our task in this section is to develop bottom-up machine computation methods, to extract two major complementary measures— one focusing on the detection of sequential relationships between events and the other focusing on the precise timing such as duration for a reliable temporal event and the time between events in a sequential pattern. We assume continuous data streams are ﬁrst pre-processed and chunked into a large sequence of event examples that all have same total duration. These examples are encapsulated into an example set as the input data structure of our machine computation (see Fig. 2(a)). Our basic implementation exploits an Apriori-like “event-based temporal pattern mining algorithm” in 2D event space where each event in the example set is represented as a point. The algorithm is based on the Apriori algorithm [7] for searching frequent patterns, to be discussed shortly in Algorithm 1. The logical series of modeling steps, the problems they induce, and the ultimate resolution of the problems are as follows:

82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108

• Project events to 2D event space for clustering. The algorithm

109

projects examples into an event space and clustering is performed in the event space based on event types (see Fig. 2(b)). • Pattern discovery using an Apriori-like algorithm. These clusters become the ﬁrst set of candidate patterns for an Apriori-like procedure which then computes representative and frequent sequential pattern prototypes iteratively (see Fig. 2(c)). • Reﬁne temporal relationships. Those prototype patterns are reﬁned by considering their temporal relationships through a pattern adjustment procedure (see Figs. 2(d) and 2(e)). • Interactive temporal pattern discovery by user query. Discovered frequent patterns and corresponding matching examples, as well as some statistical information are visualized upon user query (see Fig. 2(f)). Users can trigger and reﬁne new rounds of machine computation to complement their discovery based on the information representation.

110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125

3.1.1. Clustering in 2D event space Given an example set as input, mining algorithm ﬁrst projects examples into 2D event space (event starting time × event duration) where each point is an event (see Fig. 2(b)). To visualize a high-dimensional space (e.g. example space) is challenging and diﬃcult; however, in 2D event space it is straightforward. By examining the distribution of events in event space and choosing

126 127 128 129 130 131 132

JID:BDR AID:54 /FLA

[m5G; v1.195; Prn:18/01/2017; 8:02] P.5 (1-15)

G. Ruan, H. Zhang / Big Data Research ••• (••••) •••–•••

5

1

67

2

68

3

69

4

70

5

71

6

72

7

73

8

74

9

75

10

76

11

77

12

78

13

79

14

80

15

81

16

82

17

83

18

84

19

85

20

86

21

87

22

88

23 24 25 26 27 28 29

89

Fig. 2. Interactive event-based temporal pattern discovery. (a) A subset of example data set, displaying two examples, each with three event types. The example set is the input of the algorithm. (b) The algorithm projects each example comprised of events into points in 2D event space (event starting time × event duration) and clustering is performed based on event types. (c) Each cluster is used as an initial length-1 candidate pattern for an iterative Apriori-like procedure. (d) One potential pattern discovered by the Apriori-like procedure. (e) A ﬁnal adjustment to reﬁne and adjust the temporal relations by removing some potential noise from the data. Whether or not this is used is based on top-down knowledge of the data set, i.e., whether or not you would expect events to be completely synchronous/aligned. (f) Pattern query results that visualize the pattern and corresponding matching examples which are sorted in descending order of matching probability, as well as statistical information, e.g., histogram showing the distribution of matching probability of matching examples.

30 31 32 33 34 35 36 37 38 39 40

43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

91 92 93 94 95 96

appropriate clustering technique (i.e. centroid-based, distributionbased, density-based and etc.), and by reexamining the clustering results and making optional adjustment, human knowledge and machine computation can bootstrap each other. The centroids of the clusters are then used as seed length-1 (single event) candidate patterns for the subsequent Apriori-like procedure. We note that PESMiner eﬃciently performs clustering for each event type in 2D event space only once instead of repeatedly in high-dimensional example space in the mining process as algorithms in [19,20,26].

41 42

90

3.1.2. Apriori-like pattern searching Our customized Apriori-like procedure as shown in Algorithm 1 uses centroids in event space as the length-1 candidates. In each iteration, frequent patterns of length-n discovered in previous iteration and frequent length-1 patterns (line 7) are used to generate 1-item longer length-n + 1 candidates. Then the algorithm checks the frequency and only suﬃciently frequent ones are kept (line 8). Above candidate generation and frequency check processes are performed iteratively until there are no new frequent patterns or no candidates can be generated. The ﬁnal output is a set of frequent patterns. Our approach to generating longer candidates (line 7 of Algorithm 1) is to append length-1 frequent pattern (denoted as p 1 ) to length-n frequent pattern (denoted as pn ) so long as p 1 ’s starting time is no earlier than that of any events in pn , i.e., C n+1 = {cn+1 = pn ⊕ p 1 | p 1 ∈ L 1 , pn ∈ L n , ∀e ∈ pn , p 1 .s ≥ e .s} (⊕ denotes the concatenation operation on patterns). We note that Apriori procedure generates redundant patterns (denoted as R) that can actually be derived from longer patterns, i.e., R = {r | r ∈ P , ∃ p ∈ P ∧ p = r , ∀e i ∈ r , ∃e f (i ) ∈ p ∧ e i = e f (i ) }, where f is a monotonically increasing function, i.e., f : N+ → N+ , ∀i , j ∈ N+ , i < j ⇒ f (i ) < f ( j ). After the Apriori-like procedure is terminated, the algorithm (line 10) removes all redundant patterns.

Algorithm 1: AprioriLikeProcedure. Input : S: example set, M: centroids in event space, f min : frequency threshold, ε : pattern similarity threshold Output: P : set of frequent candidates 1 2 3 4 5 6 7 8 9

P = {}; C 1 = M; L 1 = FindFrequentCandidates( S, C 1 , f min , ε ); n = 1; while L n = ∅ do P=P Ln ; C n+1 = GenerateCandidates( L n , L 1 ); L n+1 = FindFrequentCandidates( S, C n+1 , f min , ε ); n = n + 1;

10 P = RemoveRedundantPatterns( P );

97 98 99 100 101 102 103 104 105 106 107 108 109 110 111

Algorithm 2: FindFrequentCandidates. Input : S: example set, C : candidate pattern set, f min : frequency threshold, ε : pattern similarity threshold Output: F P : set of frequent candidates 1 F P = {}; 2 foreach candidate ∈ C do 3 score_table = {} // a mapping table that maintains mapping between 4 5 6 7 8 9 10

matching example id and corresponding matching score; foreach example ∈ S do score = GetMatchProbability(candidate, example ); if score > ε then score_table.put(example .id, score);

normalized_prob = |1S | entr y ∈score _table entr y .score; if normalized_prob > f min then F P .put(candidate);

112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127

Algorithm 2 outlines the logic of ﬁnding frequent candidate patterns and Algorithm 3 shows the logic of calculating similarity

128 129 130

score given an example and a candidate. The method “GetEvent-

131

MatchScore” invoked in Algorithm 3 is the most crucial part and

132

JID:BDR

AID:54 /FLA

[m5G; v1.195; Prn:18/01/2017; 8:02] P.6 (1-15)

G. Ruan, H. Zhang / Big Data Research ••• (••••) •••–•••

6

1

67

2

68

3

69

4

70

5

71

6

72

7 8

Fig. 3. Event matching process. (a) An event of type B in candidate. (b) An example. (c) All events in the example of type B that overlap with the type B event in the candidate. (d) The mismatch (denoted by ↔) calculated by this process.

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

41 42 43 44 45 46 47 48 49 50

Algorithm 4 gives a general implementation. Fig. 3 shows the process of event match conducted by Algorithm 4. We note that Algorithms 3 and 4 are fuzzy-matching based and Algorithm 4 is able to handle signal interruption (the four events in Fig. 3(c) should effectively be treated as a single event) we mostly encounter during time series stream recording process. We also note that Algorithms 3 and 4 provide a generic implementation to measure similarity given a temporal example and a pattern while algorithms devised from speciﬁc domain knowledge are pluggable, i.e. the user can devise speciﬁc matching algorithm based on top-down knowledge and insight from underlying data set. 3.1.3. Integrating bottom-up machine computation and top-down domain knowledge Aligning patterns. Statistical analysis always comes with noises, and one important step that requires human in the loop is to apply domain knowledge to align the patterns discovered by machine computation. For example, an event that represents human gaze at object A simply should not overlap with another event that represents human gaze at object B in time within one pattern; if such patterns are presented from computation they should be synchronized instead. Algorithm 5 gives a general probability based adjustment using Gaussian distribution of the events. As shown in lines 6 and 7, for each event in the pattern, the algorithm estimates Gaussian from all events in matching examples that overlap with the event and uses estimated Gaussian as adjusted event. Adjusted events in turn compose the adjusted pattern. Algorithm 3: GetMatchProbability. Input : example: an example, candidate: a candidate pattern Output: score: score that measures the similarity between example and candidate 1 score = 1.0; 2 foreach event ∈ candidate do 3 event_match_score = GetEventMatchScore(event, example ); 4 if event_match_score > 0 then 5 score *= event_match_score; 6 else 7 return 0.0 // not a match; 8 return score

1/length(candidate )

// normalized score;

51

Algorithm 5: AdjustPattern. Input : S: example set, P : pattern set Output: A P : adjusted pattern set 1 A P = {}; 2 foreach pattern ∈ P do 3 matching_ex = GetMatchedExamples( pattern, S ) // get matching

56 57

Algorithm 4: GetEventMatchScore. Input : event: an event in candidate pattern, example: an example Output: score: score that measures the similarity between event and example

58

1 events = GetOverlapEventsByType(example, event ) // return a list

59

containing all events in example of type event .t that overlap with event; 2 mismatch = FindAllMismatch (events, event); // see Fig. 3(d) ; 3 score = (example.duration − mismatch) / example.duration // normalize;

60 61 62 63 64 65 66

77 78 79 80 81 82

7 8 9

A P .add(adjusted_pattern);

89

4 5 6

83 84 85 86 87 88 90 91

eters to control visualization, e.g., maximum number of most similar patterns in terms of matching probability to display, number of matching examples to display per page (the bottom of each page displays the pattern), number of bins of the histogram, and whether the histogram represents frequencies or probability densities (see Fig. 2(f)). User visually examines the discovered patterns in this step and gains knowledge of underlying data and in turn guides pattern mining of next round by provisioning top-down knowledge.

92 93 94 95 96 97 98 99 100 101

3.2. Large-scale multi-modal data analytics with iterative mapreduce tasks

102 103 104

Recently, a vast volume of scientiﬁc data is captured by new instruments in various means, and data-driven discovery has already happened in various research and experimentation settings. Oftentimes machine computation needs to work with tens of thousands of data points to generate reliable and meaningful discoveries. The large-scale high-density high-quality multi-modal data make the cost of data processing prohibitively high and aforementioned “human-in-the-loop” interactive data analysis infeasible. Therefore it requires that we use today’s state-of-the-art parallelization techniques and peta-scale computing powers to deal with the problem. In this section we ﬁrst discuss the rationale of our parallel design choice and then describe the parallelization of proposed sequential pattern mining algorithm.

105 106 107 108 109 110 111 112 113 114 115 116 117 118

53 55

76

examples, see score_table in Algorithm 2 that keeps track of ids of all matching examples; adjusted_pattern = {}; foreach event ∈ pattern do events = GetOverlapEvents(event, matching_ex) // get all events of type event.type in matching_ex that overlap with event; adjusted_event = EstimateGaussian(events ); adjusted_pattern.add(adjusted_event);

52 54

74 75

39 40

73

Validating hypotheses. In this step user queries patterns by specifying criteria, e.g., length of the pattern (number of events in the pattern), minimum number of matching examples, minimum averaged matching probability, and etc. User also speciﬁes param-

3.2.1. Parallelization choices Choosing parallel computing model. Both MPI and MapReduce [13] are popular choices when implementing embarrassingly parallel computational tasks where no complicated communication is needed (e.g., our temporal pattern mining algorithm). MPI is a popular parallelization standard which provides a rich set of communication and synchronization constructs from which user can create diverse communication topologies. Although MPI is quite appealing in terms of its diversity in creating communication topologies (great control to the programmer) and high-performance implementations, it requires that programmer explicitly handles the mechanics of the data partitioning/communication/data-ﬂow, exposed via low-level C routines and constructs such as sockets, as well as the higher level algorithm for the analysis. On the contrary,

119 120 121 122 123 124 125 126 127 128 129 130 131 132

JID:BDR AID:54 /FLA

[m5G; v1.195; Prn:18/01/2017; 8:02] P.7 (1-15)

G. Ruan, H. Zhang / Big Data Research ••• (••••) •••–•••

1 2 3 4 5 6

MapReduce operates only at the higher level: the programmer thinks in terms of functions (map and reduce) and the data-ﬂow is implicit [35]. Since several pieces are pluggable in our algorithm, e.g., the logic of event space clustering, event matching and pattern adjustment, we prefer a high-level programming model like MapReduce which allow us to focus on the business logic.

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

Enabling iterative discovery. The original MapReduce framework proposed by Google [13] focuses on fault-tolerance with the assumption that underlying hardware environment is built from heterogeneous and inexpensive commodity machines where component failures are the norm rather than the exception. Therefore, input data are replicated across multiple nodes and intermediate output such as output of map task is stored in ﬁle system for fault-tolerance. However, this assumption needs to be reexamined as most researchers run their scientiﬁc applications on HPC resources such as XSEDE and FutureGrid which are built from high-performance and homogeneous hardware where failures are actually rare. The second issue is that Google’s MapReduce focuses on single step MapReduce job. However, many parallel algorithms in domains such as data clustering, dimension reduction, link analysis, machine learning and computer vision have iterative structures, examples include K-Means [23], deterministic annealing clustering [32], PageRank [27], and dimension reduction algorithm such as SMACOF [22], just to name a few. With MapReduce’s open source implementation Hadoop, input data partitions for iterative applications have to be repeatedly loaded from disk (HDFS) into memory and map/reduce tasks need to be initiated in each iteration, which is very ineﬃcient and can degrade performance severely. In-memory MapReduce runtimes such as Twister [15], Haloop [12] and M3R [34], on the contrary, trade fault-tolerance for performance and are particularly designed for iterative data processing. In-memory MapReduce runtime does require larger amount of memory however in HPC environment where 16/32 GB memory is common node conﬁguration this should never be a practical issue. We choose Twister as our runtime which supports long running map/reduce tasks. Moreover, in Twister static input data only need to be loaded once and intermediate map outputs are transferred to reducers through eﬃcient in-memory communication through a publish-subscribe system without touching the disk.

42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Supporting human-in-the-loop principle. To support interactive data analysis/visual mining we need to provision a user-friendly interface to access the pattern mining algorithm which runs on HPC. We expose the algorithm as asynchronous web services which make the algorithm easily accessible from any endpoint outside HPC such as local desktop and mobile device. We currently implement two web services: (1) web service that accepts user input (choices of pluggable components such as event matching algorithm, conﬁgurable parameters for algorithm/query/visualization) and downloads visualization results to client by invoking user registered callback; and (2) web service that takes the same input but only notiﬁes completion in the callback. When using the ﬁrst web service, user employs local tool to display downloaded ﬁgures. For the second web service, user leverages advanced visualization tool with client-server architecture. One example is ParaView, in this scenario ParaView server runs on the same HPC as the algorithm and user simply invokes ParaView client for visualization in callback.

61 62 63 64 65 66

3.2.2. Parallel temporal pattern mining using twister mapreduce tasks In Fig. 4(a) we show the overview of our system architecture. Bottom-up machine computation takes place on the HPC, where one master node hosts the job client that submits Twister MapReduce job (pattern mining algorithm in our case), as well as the

7

web service server and an optional visualization server such as ParaView server. A set of worker nodes are deployed for computation. On the client-side, the user at desktop/mobile device receives visual representations of query results and intermediate results through a call-back function, generates the initial hypotheses, invokes the machine computation via web service client or, a ParaView client. In the following we describe parallelization of each algorithm step in detail. Interested readers can refer to our github repository [5] for all source code of our algorithms, web service client-server, and PBS scripts for deploying the system are hosted.

67 68 69 70 71 72 73 74 75 76 77 78

Event space visualization using mapreduce tasks. Visualizing distribution of events of large-scale data set is time consuming and needs to be conducted in parallel. Algorithm 6 gives such a parallel implementation. Map task plots a partial event space using events in its partition (a subset of examples, see line 2) based on event type. In reduce task, partial event space plots belonging to the same event type are grouped by the reduce input key and are stacked together to form the complete event space plot of that particular event type.

79 80 81 82 83 84 85 86 87 88 89

Algorithm 6: ParallelEventSpaceVisualization. 1 2 3 4

90 91

function map (key , value ): events = GetEventsByType( value.event_type, partition); eventspace_plot = PlotEventSpace(events ); emit_intermediate( value.event_type, eventspace_plot );

92 93 94

5 function reduce (key , value_list ): 6 stacked_plot = Stack( value_list ); 7 emit(key, stacked_plot );

95 96 97 98 99 100

Algorithm 7: ParallelKMeansClustering.

101

1 function map (key , value ): 2 events = GetEventsByType( value.event_type, partition); 3 centroids = GetCentroidsByType( value.event_type,

102 103

previous_centroids ); 4 current_centroids = Clustering(events, centroids ); 5 emit_intermediate( value.event_type, current_centroids );

104

6 function reduce (key , value_list ): 7 merged_centroids = Merge( value_list ); 8 emit(key, merged_centroids );

107

105 106 108 109 110 111

Clustering using mapreduce tasks. Algorithm 7 shows the pseudo code of parallel KMeans clustering. In each iteration, each map task processes its partition and the data sets only need to be loaded from disk into memory once in the ﬁrst iteration. Within the map, clustering is performed based on event type. For each event type, map task calculates and emits new centroids (current_centroid in line 4) using centroids derived from previous iteration (previous_centroid in line 3, which is the state of each iteration and is broadcasted to each map task by Twister runtime). We note that since each map task only deals with a portion of examples, current_centroids only contains partial results. Namely each newly derived centroid keeps track of only the partial average over the belonging events and the number of belonging events. In reduce task, these partial results are merged to get the ﬁnal result (line 7), as show in Equation (1). Iteration continues until convergence. Many other clustering algorithms can be implemented in parallel following the same fashion.

v ∈ value _list

v .num_events × v .centroid

v ∈ value _list

v .num_events

112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130

(1)

131 132

JID:BDR

AID:54 /FLA

[m5G; v1.195; Prn:18/01/2017; 8:02] P.8 (1-15)

G. Ruan, H. Zhang / Big Data Research ••• (••••) •••–•••

8

1

67

2

68

3

69

4

70

5

71

6

72

7

73

8

74

9

75

10

76

11

77

12

78

13

79

14

80

15

81

16

82

17 18 19 20 21 22 23

Fig. 4. (a) System architecture. Interactive exploration of large scale multi-modal data streams consists of three components steps: remote visualization of raw data and intermediate analysis results; applying top-down knowledge to reﬁne pattern and trigger new rounds of pattern discovery; and bottom-up temporal pattern mining powered by parallel computing model and HPC resources. (b)–(c): The two most reliable sequential patterns found in human’s gaze data stream. (b) A sequential pattern showing a 0.85 s human gaze following a robot partner’s gaze event, with 1.2 s statistically detected between the two events. (c) A sequential pattern showing a 1.2 s human eye–hand coordination event followed by a 1.1 s human gaze (at partner) event, with 0.6 s between two events. (d)–(e): Two relatively complicated sequential patterns detected from interacting agents’ action streams. (d) Example pattern showing that one agent’s gaze at an attentional object will trigger face-to-face coordination between two interacting agents, and will attract the other agent’s visual attention to the same object. (e) A more complicated sequential pattern showing that human partner dynamically adapts his/her behavior to reach the same visual and hand attention of his robot partner.

24 25 26 27 28 29

40 41 42 43 44 45 46

49 50 51 52 53 54 55 56

96

Apriori-like procedure using mapreduce tasks. Algorithm 8 shows the parallel Apriori-like procedure. In each iteration, the map task retrieves all matching examples for the given candidate using aforementioned Algorithm 2 and emits the partial score_table which maintains mapping between matching example id and corresponding score. In reduce task, partial score tables are merged (line 6) and frequent candidate is emitted (lines 7–8). The job launcher calculates 1-item longer candidate set and initiates next round iteration. The whole process terminates when no frequent patterns or no candidates can be generated. Algorithm 9: ParallelPatternAdjust. 1 2 3 4

function map (key , value ): pattern = value.pattern; pattern.gaussians = EstimateGaussian( pattern.score_table, partition); emit_intermediate( pattern.id, pattern.gaussians);

5 6 7 8

function reduce (key , value_list ): merged_gaussians = MergeGaussians( value_list ); adjusted_pattern = DerivePattern(merged_gaussians ); emit(key, adjusted_pattern);

57 58 59 60 61 62 63 64 65 66

91

5 pattern.displaying_examples = GetQualifiedExamples( pattern.score_

partition);

47 48

89

93

36

39

88

1 function map (key , value ): 2 pattern = value.pattern; 3 if not meetQueryCriteria( pattern)) then 4 return

35

38

87

1 function map (key , value ): 2 candidate = value.candidate; 3 candidate.score_table = FindAllMatchingExamples(candidate,

5 function reduce (key , value_list ): 6 merged_score_table = MergeScoreTables( value_list ); 7 if isFrequent(merged_score_table ) then 8 emit(key, candidate );

37

86

Algorithm 10: ParallelQueryResultsVisualization.

31

34

85

Algorithm 8: ParallelApriori-likeProcedure.

4 emit_intermediate(candidate.id, candidate.score_table);

33

84

90

30 32

83

The parallel pattern adjustment is described in Algorithm 9. In line 3, method EstimateGaussian retrieves all matching examples of pattern from partition and for each event in pattern, a Gaussian is estimated from overlapping events of the same event type in matching examples (see Algorithm 5). In reduce task, method MergeGaussians merges Gaussians for each event and the mean of merged Gaussian is used as adjusted event (line 6 and see Algorithm 5). Adjusted events in turn are used to compose the adjusted pattern (line 7).

table, partition); 6 emit_intermediate( pattern.id, pattern.displaying_examples); 7 8 9 10

function reduce (key , value_list ): all_displaying_examples = value_list; f igures = DrawFigures(all_displaying_examples ); emit(key, f igures );

92 94 95 97 98 99 100 101 102 103

Query and visualization using MapReduce tasks. Algorithm 10 shows parallel visualization based on query parameters. In map task it ﬁrst ﬁlters out pattern that doesn’t meet query criteria (e.g., number of minimum matching examples/average matching probability) as shown in lines 3–4. For qualiﬁed pattern, map task then retrieves all matching examples in partition that need to be displayed based on the user’s query condition (e.g., only displaying matching examples whose matching probability is greater than the speciﬁed threshold). In reduce task, all matching examples and some statistics are visualized. Each step (i.e., visualization, clustering, pattern matching and etc.) described in PESMiner is implemented in a parallel manner, here we use Fig. 5 to illustrate the parallel iterative processing of the frequent pattern discovery process, which lies in the core of PESMiner.

104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119

4. Performance evaluation

120 121

We use XSEDE Gordon [1] cluster to evaluate the performance of PESMiner and compare it with QTempIntMiner in [19] and QTIPreﬁxSpan-AP (AP stands for aﬃnity propagation clustering), QTIPreﬁxSpan-KMeans, QTIApriori-AP and QTIAPriori-KMeans in [20]. On Gordon, each compute node contains two 8-core 2.6 GHz Intel EM64T Xeon E5 processors and 64 GB of DDR3-1333 memory, and a single 300 GB SSD. Data set. We adopt similar approach as in [19,20] to generate synthetic data set for evaluation. Random temporal sequences are generated based on temporal pattern prototypes to which temporal noise is added. A temporal pattern prototype of length n is

122 123 124 125 126 127 128 129 130 131 132

JID:BDR AID:54 /FLA

[m5G; v1.195; Prn:18/01/2017; 8:02] P.9 (1-15)

G. Ruan, H. Zhang / Big Data Research ••• (••••) •••–•••

9

1

67

2

68

3

69

4

70

5

71

6

72

7

73

8

74

9

75

10

76

11

77

12

78

13

79

14

80

15

81

16

82

17

83

18

84

19 20 21 22 23

85

Fig. 5. The parallel iterative MapReduce tasks to discover frequent patterns. (1) Examples are distributed across Map tasks; (2) Within a Map task, for each candidate pattern, it is matched against each example that the Map task hosts, and producing a match score; (3) The match scores for a candidate pattern from multiple Map tasks are collected by a single reduce task through the shuﬄing phase; (4) The reduce task merges the scores for a candidate pattern and emits the pattern if its frequency exceeds the threshold; and (5) one length longer candidates are generated through combining the discovered frequent patterns from current iteration and centroids derived from the event-space clustering process. The new candidates trigger the next round pattern matching process until no frequent pattern can be found.

24 25 26 27 28 29 30 31 32 33 34

a set of n events (see DEFINITION 2.4) with Gaussian noises, i.e. {(t i , μsi , σsi , μdi , σdi )}i ∈Nn , where t i is event type, μsi , μdi (resp. σsi , σdi ) are the mean (resp. the standard deviation) of the generated starting times and durations of ith event e i . This prototype speciﬁes the temporal pattern to discover {(t i , μsi , μdi )}i ∈Nn and standard deviation parameters quantify the Gaussian noise. In addition we add random noise by generating random examples with ratio r ∈ [0, 1]. Random example is generated by ﬁrstly randomly choosing the example length, and then each event is generated by randomly picking the staring time, duration and event type.

36 38 39 40

sim(a, b) =

⎧ 0 if sortlex (πsym (a)) = sortlex (πsym (b)) ⎪ ⎪ ⎪ ⎨

max({overlap (a, b p ) | b p ∈ P E R M (b)}) ⎪ ⎪ ⎪ ⎩

42 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

(2)

otherwise

41 43

87 88 89 90

35 37

86

overlap (a, b) =

⎧ 0 if πsym (a) = πsym (b) ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩

n i =1

max(0,min(eai ,ebi )−max(sai ,sbi )) max( ni=1 eai −sai , ni=1 ebi −sbi )

|1a|

(3)

otherwise

Similarity measure. Given a pattern prototype and a discovered pattern, we need an approach to measure the similarity between them. In [19,20], the similarity between two patterns is considered as 0 if they have different signatures (see DEFINITIONS 2.5 and 3.6). This approach is problematic since event is interval-based and the order of events dictated by signature cannot truly reﬂect the extent of similarity. Take the prototype # 1 in Table 2 as an example, under above metric the similarity between the prototype and a discovered pattern {(A, [2, 3.5]), (D, [4, 6.5]), (G, [5.9, 7.5]), (C, [6, 7.5])} is 0 since they have different signatures, i.e. (A, D, C, G) vs. (A, D, G, C), however they are indeed quite similar and only the starting times of event G have 0.1 numerical difference. Therefore, we don’t simply judge the similarity by signature. Inspired by Earth Mover’s Distance [?], we use the following approach as similarity metric, as shown in Eq. (2), where πsym denotes the symbolic projection of the given pattern, sortlex denotes sorted symbolic projection by lexicographic order, PERM denotes the set that consists of all permutations of events in the given pattern, overlap is a similarity measure of two patterns in terms of overlapping rap tio, normalized by pattern length, as shown in Eq. (3), where si

and

p ei

denote starting time and ending time of ith event of patmax(0, min(eai , ebi ) − max(sai , sbi ))

tern p, respectively. gives the size of the intersection of two event intervals. Given a prototype a and a discovered pattern b, Eq. (2) (note it is a symmetric measure) calculates similarity between a and all possible permutations of b and ﬁnds the maximum similarity. We note that the ﬁrst logical condition in Eq. (2) checks whether lexicographically sorted symbolic projections of the two given patterns are exactly the same, i.e. the same event types and the same # of events per event type, this is different from Algorithm 3 which conducts a partial match between a candidate and an example. Firstly checking the lexicographically sorted symbolic projection allows us to identify unmatched patterns in the earliest. Metrics. The accuracy of the pattern mining algorithm is evaluated by the similarity between the prototype and the best extracted temporal pattern, as shown in Eq. (4), where P is the set of prototypes, C is the set of extracted patterns and sim is similarity measure in Eq. (2). In addition, we deﬁne precision as the ratio of discovered prototypes to extracted patterns as shown in Eq. (5) and recall as the ratio of discovered prototypes to prototypes as shown in Eq. (6). ε in Eq. (5) and Eq. (6) is the similarity threshold. In order to obtain statistically signiﬁcant results, all the curves presented in the sequel are obtained by averaging the results on 10 different data sets generated from the same parameters. Table 1 lists the notations that we use. Note that to simplify the result presentation, we ﬁx an unique parameter σs (resp. σd ) for quantifying all the standard deviation values of starting times (resp. durations). These two parameters quantify the temporal noise of a data set. Where not stated otherwise, each prototype takes the same proportion of the data set and following default parameter values are used: | D | = 10,000, | P | = 10, | p | = 6, | E | = 8, σs = 0.05, σd = 0.05, r = 0, N node = 16, N worker = 16. Each example in data set has 16 s total duration. The pattern similarity threshold ξ in Algorithm 2 takes default value 0.9.

accurac y =

1

|P |

max({sim( p , c ) | c ∈ C })

(4)

p∈ P

|{c | c ∈ C , max({sim( p , c ) | p ∈ P }) > ε }| precision = |C | |{ p | p ∈ P , max({sim( p , c ) | c ∈ C }) > ε }| recall = |P |

91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128

(5)

129 130 131

(6)

132

JID:BDR

AID:54 /FLA

[m5G; v1.195; Prn:18/01/2017; 8:02] P.10 (1-15)

G. Ruan, H. Zhang / Big Data Research ••• (••••) •••–•••

10

1

67

2

68

3

69

4

70

5

71

6

72

7

73

8

74

9

75

10

76

11

77

12

78

13

79

14

80

15

81

16

82

17

83

18

84

19

85

20

86

21

87

22

88

23

Fig. 6. Distribution of events of type C in event space under different temporal noise levels.

24 25 26 27 28 29 30 31 32 33 34 35

90

Table 1 Notations. Notation

Description

D P p E

data set D, | D | is # of examples prototype set, | P | is # of prototypes prototype, | p | is # of events in prototype event type set, | E | is # of event types standard deviation of generated starting times standard deviation of generated durations ratio of random examples in data set # of nodes in Twister MapReduce cluster # of worker threads per node

σs σd r N node N worker

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

89

Quality. We measure the quality of the algorithm using aforementioned metrics (i.e. accuracy, precision, and recall) under different noise conditions (i.e. standard deviation in prototype and random noise). In order to measure the quality in ﬁne grain we vary one factor at a time. Table 2 lists the prototypes we use in experiments. For ease of reading we use [starting time, ending time] instead of [starting time, duration] in presentation. Following notes are remarkable: (1) events can overlap, e.g. (D, [4, 6.5]) and (C, [6, 7.5]) in prototype 1, (2) events can be completely synchronous in terms of starting time and/or ending time, e.g. (C, [6, 7.5]) and (G, [6, 7.5]) in prototype 1 and (H, [9, 10]), (C, [9, 10.5]) in prototype 4, and (3) two different pattern may have the same signature, e.g. prototypes 3 and 4. Fig. 6(a) to Fig. 6(e) shows visualization of event space of event type C under 0.1, 0.2, 0.3, 0.4 and 0.5 temporal noise (i.e. σs and σd ), respectively. Based on visualization we choose centroid-based technique (i.e. KMeans) and specify # of clusters and initial seeds which can be readily derived from the ﬁgure. Fig. 7(a) shows accuracy of the six algorithms under different temporal noises levels, with σs and σd take the same value as indicated by X-axis. The frequency threshold f min (see Algorithm 2) of PESMiner is set to be 0.05. We can observe that the PESMiner signiﬁcantly outperforms other algorithms and is able to achieve over 0.92 accuracy even under 0.5 temporal noise for both starting time and duration. With ﬁxed σs = σd = 0.05, Fig. 10(a) to Fig. 10(c) shows visualization of event space of event type C under 20%, 40% and 60% random noise, respectively. Note that the trapezoid appearance of the distribution is the result of the restriction that event ending time cannot exceed the total example duration, i.e. 16 s in our case. Through visualization of event space we are

able to identify potential clusters even under very high random noise levels (see Fig. 10(c) and Fig. 11(a)). Fig. 11 shows the interactive clustering process of event type C under 80% random noise level. By examining the distribution in event space (Fig. 11(a)), we choose density-based clustering technique (i.e. DBScan [?]). Fig. 11(b) shows visualization of the clustering results, three clusters (marked in red, green, black color, respectively, blue points are classiﬁed as noise by DBScan) are identiﬁed. Since DBScan only speciﬁes cluster membership instead of centroids, we further launch KMeans on the identiﬁed cluster points to ﬁnd out the centroids and the results are in turn visualized for human veriﬁcation, as shown in Fig. 11(c). We note that visualization of event space and the clustering process are conducted based on event type and we may choose different clustering techniques for different event types based on their unique distributions. Fig. 7(b) shows accuracy of the six algorithms under different random noises levels, with ﬁxed σs = σd = 0.05. Frequency threshold f min of PESMiner is set to be 0.08, 0.07, 0.06, 0.05, 0.04, 0.03, 0.02 and 0.01 for corresponding 0.1 to 0.8 random noise levels. PESMiner again performs the best and is able to achieve above 99.6% accuracy even under 80% random noise. Fig. 8(a) to Fig. 8(c) shows the precision of the six algorithms under different temporal noises, with similarity threshold ε set to be 0.7, 0.8 and 0.9 respectively. Fig. 8(d) shows the case under different random noises, when similarity threshold ε set to be 0.7, 0.8 and 0.9 (same curve for all three ε settings). Fig. 9 shows the recall under different temporal noises (ﬁgures (a) to (c)) and random noises (ﬁgure (d)), using same ε settings as in Fig. 8. From Fig. 8 and Fig. 9 we observe that under all temporal and random noise settings PESMiner signiﬁcantly outperforms other algorithms and achieves 100% precision and recall even when ε set to be 0.9. Scalability. Since QTempIntMiner [19], QTIPreﬁxSpan and QTIApriori [20] are all sequential algorithms running on single node, the time complexity of these algorithms is too prohibitive to run on very large data sets which are used for scalability test. Therefore we focus on the investigation of the scalability of PESMiner itself. The time complexity of PESMiner is determined by multiple factors which can be grouped into two categories: factors from data set itself (i.e. # of examples, # of patterns, length of the pattern) and factors from parallelization (i.e. increased communication overhead as the level of parallelization increases, e.g., using more computing nodes). To fully analyze the parallelization

91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

JID:BDR AID:54 /FLA

[m5G; v1.195; Prn:18/01/2017; 8:02] P.11 (1-15)

G. Ruan, H. Zhang / Big Data Research ••• (••••) •••–•••

11

1

67

2

68

3

69

4

70

5

71

6

72

7

73

8

74

9

75

10

76

11

77

12

78

13

79

14

80

Fig. 7. Accuracy of 6 algorithms under different experimental settings.

15 16 17 18 19 20 21 22 23 24 25 26 27

81 82

Table 2 Prototypes.

83

#

Prototype

Length

1 2 3 4 5 6 7 8 9 10

{ { { { { { { { { {

4 4 6 6 6 6 6 6 8 8

(A, [2, 3.5]), (D, [4, 6.5]), (C, [6, 7.5]), (G, [6, 7.5]) } (B, [4, 6]), (F, [7.5, 9]), (B, [8, 9.5]), (H, [9, 10]) } (C, [2, 3.5]), (H, [5, 7]), (C, [6, 7.5]), (B, [8, 9.5]), (H, [9, 10]), (E, [12, 14]) } (C, [6, 7.5]), (H, [9, 10]), (C, [9, 10.5]), (B, [11, 12.5]), (H, [12, 13.5]), (E, [12, 14]) } (D, [4, 6.5]), (H, [5, 7]), (A, [6, 8]), (B, [8, 9.5]), (C, [9, 10.5]), (H, [12, 13.5]) } (E, [3, 4.5]), (D, [4, 6.5]), (C, [6, 7.5]), (H, [9, 10]), (C, [9, 10.5]), (E, [12, 14]) } (G, [2, 4]), (E, [3, 4.5]), (B, [4, 6]), (F, [7.5, 9]), (H, [9, 10]), (H, [12, 13.5]) } (C, [2, 3.5]), (D, [4, 6.5]), (H, [5, 7]), (G, [6, 7.5]), (E, [8, 10]), (B, [11, 12.5]) } (C, [2, 3.5]), (G, [2, 4]), (E, [3, 4.5]), (H, [5, 7]), (G, [6, 7.5]), (H, [9, 10]), (B, [11, 12.5]), (E, [12, 14]) } (G, [2, 4]), (B, [4, 6]), (C, [6, 7.5]), (F, [7.5, 9]), (E, [8, 10]), (H, [9, 10]), (B, [11, 12.5]), (H, [12, 13.5]) }

84 85 86 87 88 89 90 91 92 93

28

94

29

95

30

96

31

97

32

98

33

99

34

100

35

101

36

102

37

103

38

104

39

105

40

106

41

107

42

108

43

109

44

110

45

111

46

112

47

113

48

114

49

115

50

116

51

117

52

118

53

119

54

120

55 56

121

Fig. 8. Precision of 6 algorithms under different similarity thresholds.

57

123

58 59 60 61 62 63 64 65 66

122 124

overhead is out of the scope of this paper and here we focus on the factors related to data set. Similar to experiments that measure the quality of the algorithm, one factor is varied at a time. The reported running times in the sequel include all ﬁve steps of the algorithm and we run the experiments on a 64-node cluster on XSEDE Gordon. Fig. 12(a) shows execution times under different data set sizes, using 10 prototypes listed in Table 2. We can observe that PESMiner achieves linear scalability and is ca-

pable of handling 100 thousands examples within 36.15 minutes. By utilizing more computing nodes we can expect to achieve further reduction of turnaround time. Fig. 12(b) shows the breakdown of execution time of Fig. 12(a) by each algorithm step. We can observe that the execution time is overwhelmingly dominated by Apriori-like procedure. Amongst ﬁve steps, event space visualization and clustering are eﬃcient since they are conducted in 2-D space. By removing redundant patterns (see line 10 in

125 126 127 128 129 130 131 132

JID:BDR

12

AID:54 /FLA

[m5G; v1.195; Prn:18/01/2017; 8:02] P.12 (1-15)

G. Ruan, H. Zhang / Big Data Research ••• (••••) •••–•••

1

67

2

68

3

69

4

70

5

71

6

72

7

73

8

74

9

75

10

76

11

77

12

78

13

79

14

80

15

81

16

82

17

83

18

84

19

85

20

86

21

87

22

88

23

89

24

90

25

91

26

92

27 28

Fig. 9. Recall of 6 algorithms under different similarity thresholds.

93 94

29

95

30

96

31

97

32

98

33

99

34

100

35

101

36

102

37

103

38

104

39 40

105

Fig. 10. Distribution of events of type C in event space under different random noise levels.

106

41

107

42

108

43

109

44

110

45

111

46

112

47

113

48

114

49

115

50

116

51

117

52

118

53 54 55

119

Fig. 11. An interactive clustering process. (a) Visualization of event space of event type C under 80% random noise. (b) Choosing DBScan based on observation of the event distribution in (a) and three clusters are identiﬁed. (c) Further launching KMeans to ﬁnd out centroids. (For interpretation of the colors in this ﬁgure, the reader is referred to the web version of this article.)

56 57 58 59 60 61 62 63 64 65 66

120 121 122

Algorithm 1) and bookkeeping the matching examples (see line 3 in Algorithm 2), pattern adjustment and visualization can be conducted eﬃciently. However, Apriroi-like procedure has high time complexity due to its iterative nature and large searching space. Fig. 12(c) shows the case by varying example length, with ﬁxed | D | = 20,000 and | P | = 2. In this evaluation in each trial all examples in the data set and the underlying pattern prototypes have the same length. From Fig. 12(c) we can observe that execution time increases exponentially as the example length increases. This

is because Algorithm 3 which calculates the matching score given an example and a candidate has O ( M N ) time complexity, where M and N are the length of the example and that of the candidate, respectively. Fig. 12(d) shows the case by varying the number of underlying pattern prototypes, with ﬁxed | D | = 20,000 and | p | = 6. From Fig. 12(d) we can observe that more pattern prototypes only incur small increment in running time, i.e. less than 4% increment in execution time for each increment of 2 more prototypes. The increment of execution time mainly occurs in pattern adjustment

123 124 125 126 127 128 129 130 131 132

JID:BDR AID:54 /FLA

[m5G; v1.195; Prn:18/01/2017; 8:02] P.13 (1-15)

G. Ruan, H. Zhang / Big Data Research ••• (••••) •••–•••

13

1

67

2

68

3

69

4

70

5

71

6

72

7

73

8

74

9

75

10

76

11

77

12

78

13

79

14

80

15

81

16

82

17

83

18

84

19

85

20

86

21

87

22

88

23

89

24

90

25

91

26

92

27

93

28

94

Fig. 12. Scalability of the algorithm under different experimental settings.

29 30 31

95

and visualization phases whose computation time is proportional to the number of discovered patterns.

32 33

5. Preliminary results

object (see e.g., Fig. 4(d)); even more complicated, the human partner shows repeat pattern of using hand to manipulate that object with the hope to re-engage the robot’s attention (see e.g., Fig. 4(e)).

34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 51

• Quantitative timing information. Figs. 4(b)–(c) show two reli-

52

able patterns of human–robot eye gaze events detected by the algorithm. Fig. 4(b) shows a 0.85 s human gaze (at robot) will reliably follow a robot partner’s gaze (at human) event, with a 1.2 s pause between the these two events. Fig. 4(c) illustrates a sequential pattern showing a 1.2 s human eye–hand coordination event followed by a 1.1 s human gaze (at partner) event, with 0.6 s between two events. • Sequential relationships between events. Complicated temporal patterns such as adaptive behavior patterns can be identiﬁed with the algorithm as well. Figs. 4 provide two examples of interesting interactive behavioral patterns. The informal meaning of this pattern being identiﬁed is when the robot learner is not visually attending to the human teacher, the human will ﬁrst initiate a face-looking at the robot and expect an eye-contact, followed by a gaze event at the same attentional

54 55 56 57 58 59 60 61 62 63 64 65 66

97 98 99 100

Discovering sequential patterns from multimodal data is an important topic in various research ﬁelds. Our use case and principal test data in this work focus on the study of human–human communication, human–agent interactions, and human development and learning, as those ﬁne-grained patterns can advance our understanding of human cognition and learning, and also provide quantitative evidence that we can directly incorporate in designing life-like social robots. The size of the data set is 5.3 GB. We use a 8-node cluster. Each node has two 4-core 2.6 GHz Intel EM64T Xeon E5 processors and 16 GB of DDR3-1333 memory. On average the mining algorithm ﬁnishes within 6.2 minutes. User employs ParaView client at local desktop to examine results. We report some statistically reliable and scientiﬁcally meaningful interactive behavioral patterns in our human development and learning study below:

50

53

96

101

6. Related work

102

Pattern mining algorithms. Many studies have contributed to efﬁcient discovery of association rules/sequential patterns, e.g., [7,8, 11,17,31,30]. These efforts typically employ temporal pattern mining methods developed for point-based events, which dates back to Agrawal and Srikant in 1994 [7]. Although these data-driven approaches provide useful means to understand sequential relationships among temporal events, an intrinsic limitation is that interactions of many real-world data streams are indeed real-valued interval-based events in that a condition holds true for a duration of time. As Allen discusses in [10], interval-based event mining differs from point-based event mining in that interval-based events are time-stamped with continuous starting and ending timestamps instead of discrete time points by which point-based events are represented. In addition to sequential orders of temporal events, many studies are interested in extracting statistically reliable temporal properties of sequential patterns such as timings and durations. Interval-based event mining was ﬁrst proposed by Kam and Fu [33] in the form of an algorithm that discovers temporal patterns for interval-based events. Similar to point-based event mining algorithms, interval-based event mining algorithms use an Apriori-like algorithm to ﬁnd longer frequent patterns from examples. However most interval-based event mining algorithms developed since Kam and Fu focus on ﬁnding temporal orders between events [21,25,28,29,24,39] (e.g., event A happens before/after event [tl ,t u ]

B) or delta-patterns, i.e. rules having the form A − −−−→ B meaning that the delay between A and B lies between [tl , t u ]. Quantitative temporal properties, such as at what moment a temporal event happens and with what temporal duration, has been an important

103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

JID:BDR

AID:54 /FLA

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

[m5G; v1.195; Prn:18/01/2017; 8:02] P.14 (1-15)

G. Ruan, H. Zhang / Big Data Research ••• (••••) •••–•••

yet under addressed component in various data-mining and dataanalysis scenarios. To the best of our knowledge, [19,20,26] are the only efforts focused on analyzing continuous interval-based events using quantitative temporal mining algorithms. In [19], QTempIntMiner was proposed to represent temporal interval sequences using hyper-cubes and extend GSP [9] with a temporal sequence clustering algorithm to extract typical temporal intervals. QTempIntMiner however has signiﬁcant time complexity due to the sampling method and the EM algorithm that have to be applied repeatedly during the Apriori procedure. Moreover, its frequency check of a candidate pattern is based on symbolic signature and therefore cannot ﬁlter out infrequent example instances. In [20], QTIPreﬁxSpan, a recursive depth-ﬁrst algorithm based on PreﬁxSpan [30] was proposed. However, the clustering in QTIPreﬁxSpan still needs to be conducted in high-dimensional example space repeatedly in the depth-ﬁrst search. In [26], QPreﬁxSpan was proposed to mine item sets with attached interval-based quantitative bounds. However, QPreﬁxSpan relies on the reference point (i.e. starting time, mid time or ending time) to order temporal events and different kinds of reference points yield different mining results. Moreover the clustering in QPreﬁxSpan needs to be conducted individually and recursively from the last temporal event to the ﬁrst temporal event in the candidate pattern, which incurs signiﬁcant time complexity. Interactive data analytics systems. Hue [2] is an open source web interface for analyzing data with Apache Hadoop. It provisions query and search interface as SQL editors to manipulate popular data management systems like Hive, MySQL, Solr SQL, and etc. In the meantime, to support interactive analysis, it provisions Spark and Hadoop notebooks. Jupyter notebook [3] is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Through Jupyter’s kernel and messaging architecture, the notebook allows code to be run in a range of different programming languages, such as Python and R. Other popular data analytics and scripting environments include RStudio [6] and Matlab [4]. The proposed pattern mining algorithms in this paper can be wrapped as libraries and invoked through aforementioned data analytics systems. 7. Conclusion and future work

References

68

[1] [2] [3] [4] [5] [6] [7]

[8] [9]

[10] [11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Big data analytics requires two types of iterations: iterative computation supported by advanced distributed computing models and iterative knowledge discovery enabled by visualization and visual mining. The goal of this paper is to discuss how to construct closed-loop big data analytics with visualization and scalable computing. We ﬁrst present a novel temporal data mining method focusing on extracting exact timings and durations of sequential patterns extracted from large-scale datasets of multiple temporal events. We then show how to extend our design to allow users to use their top-down knowledge of the dataset to supervise the knowledge discovery process and redirect the machine computation with larger data sets. More importantly, we demonstrate a means to exploratively analyze large-scale multi-modal data streams with human in the loop, by integrating interactive visualization, parallel data mining, and information representation. Focusing on multi-source data streams collected from longitudinal multi-modal communication studies, our experimental results demonstrate the capability of detecting and quantifying various kinds of statistically reliable and scientiﬁcally meaningful sequential patterns from multi-modal data streams. Starting from this large-scale data exploration framework and the parallel sequential pattern mining algorithm, we plan to proceed to extend and continue our current research in various multi-modal social scenarios and scientiﬁc experiment settings.

67

[21]

[22] [23]

[24]

[25]

[26]

[27] [28]

[29]

[30]

Gordon homepage, http://www.sdsc.edu/us/resources/gordon/. Hue homepage, http://gethue.com/. Jupyter homepage, http://jupyter.org/. Matlab homepage, https://www.mathworks.com/. Parallel sequential pattern miner homepage, https://github.com/guangchen/ parallel-sequential-pattern-miner. RStudio homepage, https://www.rstudio.com/. R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB’94, Santiago, Chile, 1994, pp. 487–499. R. Agrawal, R. Srikant, Mining sequential patterns, in: Proceedings of the 11th International Conference on Data Engineering, ICDE’01, 1995, pp. 3–14. R. Agrawal, R. Srikant, Mining sequential patterns: generalizations and performance improvements, in: Proceedings of Advances in Database Technology, EDBT’96, 1996, pp. 1–17. J.F. Allen, Maintaining knowledge about temporal intervals, Commun. ACM 26 (11) (Nov. 1983) 832–843. J. Ayres, J. Gehrke, T. Yiu, J. Flannick, Sequential pattern mining using a bitmap representation, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’02, Edmonton, Canada, 2002, pp. 429–435. Y. Bu, B. Howe, M. Balazinska, M.D. Ernst, Haloop: eﬃcient iterative data processing on large clusters, in: Proceedings of the VLDB Endowment (VLDB’10), vol. 3, Sept. 2010, pp. 285–296. J. Dean, S. Ghemawat, MapReduce: simpliﬁed data processing on large clusters, in: Sixth Symposium on Operating System Design and Implementation, OSDI’04, CA, USA, vol. 37, Dec. 2004. M. Dolinsky, W. Sherman, E. Wernert, Y.C. Chi, Reordering virtual reality: recording and recreating real-time experiences, in: Proceedings of SPIE the Engineering Reality of Virtual Reality 2012, Feb. 2012, pp. 155–162. J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, G. Fox, Twister: a runtime for iterative mapreduce, in: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC’10, 2010, pp. 810–818. D. Fricker, H. Zhang, C. Yu, Sequential pattern mining of multi modal data streams in dyadic interactions, in: Proceedings of 2011 IEEE International Conference on Development and Learning, ICDL’11, Aug. 2011, pp. 1–6. M.N. Garofalakis, R. Rastogi, K. Shim, Spirit: sequential pattern mining with regular expression constraints, in: Proceedings of the 25th International Conference on Very Large Data Bases, VLDB’99, Edinburgh, Scotland, 1999. J.C. Guerri, M. Esteve, C. Palau, M. Monfort, M.A. Sarti, A software tool to acquire, synchronise and playback multimedia data: an application in kinesiology, Comput. Methods Programs Biomed. 62 (1) (May 2000) 51–58. T. Guyet, R. Quiniou, Mining temporal patterns with quantitative intervals, in: Proceedings of IEEE International Conference on Data Mining Workshops, ICDMW’08, Dec. 2008, pp. 218–227. T. Guyet, R. Quiniou, Extracting temporal patterns from interval-based sequences, in: Proceedings of the Twenty-Second International Joint Conference on Artiﬁcial Intelligence, IJCAI’11, Dec. 2011, pp. 1306–1311. J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.-C. Hsu, Freespan: frequent pattern-projected sequential pattern mining, in: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’00, Boston, USA, 2000, pp. 355–359. J.D. Leeuw, Applications of convex analysis to multidimensional scaling, Recent Develop. Stat. (1977) 133–146. J. MacQueen, Some methods for classiﬁcation and analysis of multivariate observations, in: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967. F. Moerchen, D. Fradkin, Robust mining of time intervals with semi-interval partial order patterns, in: Proceedings of SIAM Conference on Data Mining, SDM’10, 2010. M. Mouhoub, J. Liu, Managing uncertain temporal relations using a probabilistic interval algebra, in: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, SMC’08, Oct. 2008, pp. 3399–3404. F. Nakagaito, T. Ozaki, T. Ohkawa, Discovery of quantitative sequential patterns from event sequences, in: Proceedings of IEEE International Conference on Data Mining Workshops, ICDMW’09, Dec. 2009, pp. 31–36. L. Page, S. Brin, R. Motwani, T. Winograd, The Pagerank Citation Ranking: Bringing Order to the Web, Technical report, Stanford InfoLab, 1999. P. Papapetrou, G. Kollios, S. Sclaroff, D. Gunopulos, Discovering frequent arrangements of temporal intervals, in: Proceedings of Fifth IEEE International Conference on Data Mining, Nov. 2005, pp. 354–361. D. Patel, W. Hsu, M.L. Lee, Mining relationships among interval-based events for classiﬁcation, in: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD’08, Vancouver, Canada, June 2008, pp. 393–404. J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, M.-C. Hsu, Preﬁxspan: mining sequential patterns eﬃciently by preﬁx-projected pattern growth,

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

JID:BDR AID:54 /FLA

[m5G; v1.195; Prn:18/01/2017; 8:02] P.15 (1-15)

G. Ruan, H. Zhang / Big Data Research ••• (••••) •••–•••

1 2 3 4 5 6 7 8 9 10 11 12 13

in: Proceedings of the 17th International Conference on Data Engineering, ICDE’01, 2001, pp. 215–224. [31] H. Pinto, J. Han, J. Pei, K. Wang, Multi-dimensional sequential pattern mining, in: Proceedings of the Tenth International Conference on Information and Knowledge Management, CIKM’01, Atlanta, USA, 2001. [32] K. Rose, E. Gurewitz, G. Fox, A deterministic annealing approach to clustering, Pattern Recognit. Lett. 11 (9) (Sept. 1990) 589–594. [33] P.-shan Kam, A.W.-C. Fu, Discovering temporal patterns for interval-based events, in: Proceedings of the Second International Conference on Data Warehousing and Knowledge Discovery, DaWaK’00, 2000, pp. 317–326. [34] A. Shinnar, D. Cunningham, V. Saraswat, B. Herta, M3r: increased performance for in-memory hadoop jobs, in: Proceedings of the VLDB Endowment (VLDB’12), vol. 5(12), Aug. 2012, pp. 1736–1747. [35] T. White, Hadoop: The Deﬁnitive Guide, third edition, O’Reilly Media/Yahoo Press, May 2012.

15

[36] X. Ye, M.C. Carroll, Exploratory space–time analysis of local economic development, Appl. Geogr. 31 (3) (July 2011) 1049–1058. [37] C. Yu, D. Yurovsky, T. Xu, Visual data mining: an exploratory approach to analyzing temporal patterns of eye movements, Infancy 17 (1) (2012) 33–60. [38] C. Yu, Y. Zhong, T.G. Smith, I. Park, W. Huang, Visual mining of multimedia data for social and behavioral studies, in: Proceedings of the IEEE Symposium on Visual Analytics Science and Technology, VAST’08, Oct. 2008, pp. 155–162. [39] A.B. Zakour, S. Maabout, M. Mosbah, M. Sistiaga, Uncertainty interval temporal sequences extraction, in: Proceedings of 6th ICISTM International Conference, ICISTM’12, 2012, pp. 259–270. [40] H. Zhang, D. Fricker, T.G. Smith, C. Yu, Real-time adaptive behaviors in multimodal human–avatar interactions, in: Proceedings of the International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction, ICMI-MLMI’10, 2010.

67 68 69 70 71 72 73 74 75 76 77 78 79

14

80

15

81

16

82

17

83

18

84

19

85

20

86

21

87

22

88

23

89

24

90

25

91

26

92

27

93

28

94

29

95

30

96

31

97

32

98

33

99

34

100

35

101

36

102

37

103

38

104

39

105

40

106

41

107

42

108

43

109

44

110

45

111

46

112

47

113

48

114

49

115

50

116

51

117

52

118

53

119

54

120

55

121

56

122

57

123

58

124

59

125

60

126

61

127

62

128

63

129

64

130

65

131

66

132