Toward user patterns for online security: Observation time and online user identification

Decision Support Systems 48 (2010) 548–558 Contents lists available at ScienceDirect Decision Support Systems j o u r n a l h o m e p a g e : w w w...

Download PDF

1MB Sizes 0 Downloads 79 Views

Report

PDF Reader
Full Text

Decision Support Systems 48 (2010) 548–558

Contents lists available at ScienceDirect

Decision Support Systems j o u r n a l h o m e p a g e : w w w. e l s e v i e r. c o m / l o c a t e / d s s

Toward user patterns for online security: Observation time and online user identiﬁcation Yinghui (Catherine) Yang a,⁎,1, Balaji Padmanabhan b a b

Graduate School of Management, University of California, Davis, AOB IV, One Shields Ave., Davis, CA 95616, USA College of Business, University of South Florida, 4202 East Fowler Ave., Tampa, FL 33620, USA

a r t i c l e

i n f o

Article history: Received 24 December 2008 Received in revised form 21 October 2009 Accepted 6 November 2009 Available online 14 November 2009 Keywords: Web usage mining Behavioral signatures Online security User identiﬁcation Biometrics Electronic commerce

a b s t r a c t Research in biometrics suggests that the time period a speciﬁc trait is monitored over (i.e. observing speech or handwriting “long enough”) is useful for identiﬁcation. Focusing on this aspect, this paper presents a data mining analysis of the effect of observation time period on user identiﬁcation based on online user behavior. We show that online identiﬁcation accuracies improve with pooling user data over sessions and present results that quantify the number of sessions needed to identify users at desired accuracy thresholds. We discuss potential applications of this for veriﬁcation of online user identity, particularly as part of multi-factor authentication methods. © 2009 Elsevier B.V. All rights reserved.

1. Motivation Humans are believed to have many unique characteristics such as ﬁngerprints and handwriting styles. We use the term “signatures” here to refer to distinguishing characteristics that are behavioral (e.g. writing styles), as opposed to characteristics that are physiological (e.g. ﬁngerprints). The applications of methods for unique identiﬁcation are signiﬁcant, ranging from forensics and law enforcement to novel biometrics-based access to personal information that protects user privacy and mitigates fraud. The development and perfection of such unique distinguishing characteristics continues to be an important area of research. Given the vast impact technology has in everyday life, there has naturally been interest in recent years on whether there might be unique signatures in technology mediated applications. Ref. [24] shows that users have distinct ways in which they use computer keyboards and that users have unique keystroke dynamics. Ref. [7] extends this work to the use of mouse movements in addition to keystroke dynamics and note that the combination can often be used to uniquely identify humans. Ref. [20] shows that authors have unique writing styles that enable identifying them from text. In a similar vein, Ref. [19] shows that users have unique writing patterns when they author content for online message boards. A recent article [10] ⁎ Corresponding author. Tel.: + 1 530 754 5967. E-mail addresses: [email protected] (Y.(C.) Yang), [email protected] (B. Padmanabhan). 1 Tel.: + 1 813 974 6763. 0167-9236/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.dss.2009.11.005

published in Nature that studied user mobility patterns with cell phone GPS data showed that, perhaps not surprisingly, most users tend to have quite predictable daily mobility patterns. In the same spirit, are there unique “clickprints” based on how users browse, or consume content online? This is an open question, the answer to which can have signiﬁcant implications for applications such as online fraud detection and product recommendations. If individuals can be identiﬁed based on online patterns, even to a reasonable extent, then there are important server-side and clientside applications. As an example of a server-side application, if a ﬁrm can identify a user who is not explicitly signed in there may be opportunities for targeted recommendations. Of course in this case the ﬁrm will have to consider online privacy needs of their customers and the ﬁrm's own privacy policy before any such action. On the client-side there may be important practical applications of such technology that may mitigate online fraud and identity theft, issues that are known to be important to consumers [15]. For instance, users may opt-in to download client-side software from a trusted third party 2 that will track client-side activities to build user identiﬁcation models. Such models may be used to provide behavioral authentication services on behalf of the user. For instance, when this user makes a large online brokerage transaction, the ﬁnancial institution may, in real time, query the client-side software for a “user score”. If the returned score suggests that the user is unlikely to be who they claim to be, the ﬁrm may then proceed to seek additional 2 e.g. a ﬁrm such as Verisign that is known to provide certiﬁcation and authentication services.

Y.(C.) Yang, B. Padmanabhan / Decision Support Systems 48 (2010) 548–558

information. Such an application may offer users real beneﬁts such as fraud and online identity theft mitigation, while being sensitive to privacy concerns due to its opt-in nature and limited data (a user score) that it reveals, with consent, to third parties. In this research we do take such a “user-centric” perspective — the data we analyze is user-centric browsing data and the results in this research are relevant to client-side applications such as the one noted here. Related to the client-side security application, a key US federal agency, the Federal Financial Institutions Examination Council (FFIEC) recently issued guidance entitled Authentication in an Internet Banking Environment3. This document notes that: “existing authentication methodologies involve three basic “factors”: • something the user knows (e.g., password, PIN); • something the user has (e.g., ATM card, smart card); and • something the user is (e.g., biometric characteristic such as a ﬁngerprint).” The guidance notes that fraud and identity theft are often the result of exploiting single factor authentication systems and suggests that multi-factor authentication methods are stronger fraud deterrents. Indeed deterrence as a mechanism to improve IT security has been stressed in the IS literature. It is known that just the use of security software by ﬁrms can deter computer abuse from a network intrusion perspective [28]. While the accuracy of such systems matters, it is often the deterrence that comes from accuracy that actually contributes to better security [4]. The hypothetical client-side application discussed previously, if designed appropriately, may provide one such additional factor in a multi-factor approach to fraud deterrence. Designing such a system will require developing accurate user identiﬁcation models. This in turn requires a deeper understanding of the factors that can result in better or worse identiﬁcation accuracies. Our research in this paper focuses on one such factor as we note below in Section 2. It should also be noted that there are efforts on the part of Internet service providers to improve security for all users. In this context recent research [35] has proposed certiﬁcation mechanisms as a manner in which incentive alignment can be achieved. Indeed such efforts are complementary to better client-side approaches for security. Further, user and computer security within organizational settings has naturally attracted speciﬁc attention in research. Ref. [12] takes such a broader view of security in organizations and note that organizations should study employee “security behavior” in detail and discusses organizational mechanisms for this purpose. Related is the work of Ref. [11], where a genetic algorithm is used to determine an organization's optimal security proﬁle to balance cost as well as risk. 2. Focus of this research There are online user behavior theories, most notably the research in Web usage mining [1,26,32] which suggest that user behavior is not random and there is often a purpose that translates into revealed online behavior. However they do not provide speciﬁc answers on how unique the revealed behavior is. On the other hand there has been substantial work in biometrics over the last few decades that has speciﬁcally studied user identiﬁcation from various characteristics such as ﬁngerprints, handwriting or speech. Much of the biometrics work focusing on user identiﬁcation has been experimental, and has collectively highlighted two intuitive aspects: 1. The quality of user data impacts identiﬁcation accuracy. In the handwriting and ﬁngerprinting literature, quality refers to image quality as measured by the resolution or number of pixels. There is evidence in this literature [16] that higher quality improves identiﬁcation accuracies. For user identiﬁcation from online 3

http://www.fﬁec.gov/pdf/authentication_guidance.pdf.

549

behavior, quality of data can be measured by the features created from behavior. A large number of features can be generated from every click or page viewed online. The associated intuitive hypothesis here will be that better features result in better user identiﬁcation models. 2. The quantity of user data impacts identiﬁcation accuracy. In the speech recognition and handwriting literatures, quantity refers to how long this behavior is observed or monitored. The results show that if speech and handwriting can be observed “long enough”, fairly accurate models can be built [14]. For user identiﬁcation from online behavior, quantity is a measure of how much user data is observed. Intuitively we expect this result to hold as well. For instance from just one session of browsing data a user may not look sufﬁciently different from others, but over time there might be enough data to highlight differences. In this paper we focus on studying the quantity aspect in online user identiﬁcation. By doing so we hope this research can provide light on how much user data is needed before accurate user identiﬁcation models can be obtained.4 Insights into this issue are signiﬁcant for client-side applications such as the one noted above. To our knowledge this is the ﬁrst research paper that addresses this question. Further the methodology developed here can be used to study this aspect for any given quality of features used for identiﬁcation. For convenience we will use the term “aggregation” to describe the process of observing and collecting data over longer time periods. While time is a useful notion to intuitively measure “how much” data is needed, we will focus our analysis in this research to aggregation over multiple Web sessions as opposed to time periods. We note that the analysis methodology presented here can be directly applied to time if desired. However a user session is a commonly used unit of analysis when describing online behavior, and “long enough” in this context refers to observing a user over an adequate number of sessions. In this paper we answer the following questions: (Q1a). Does aggregation result in improved user identiﬁcation based on online behavior? While prior research in biometrics has shown the value of aggregation for other problems, to our knowledge this work is the ﬁrst to present such analysis for online user identiﬁcation. Rather than treating this as a single hypothesis to test, we break this down into a series of tests as described below. Intuitively we expect that as the number of users in the population grows, user identiﬁcation will be more difﬁcult. Hence the signiﬁcance of aggregation can be expected to depend on the number of users considered in a dataset. Further we restrict our consideration to a range of different aggregation levels and test if accuracies at a speciﬁc level of aggregation are lower than the accuracies at the immediate higher level of aggregation. Hence we answer Q1a in a table where the rows correspond to varying number of users (speciﬁcally M = 2, 3, 5, 10, 25, 50 and 100) and the columns represent pairs of adjacent aggregations (2 over 1, 3 over 2,…, 10 over 9). Each cell represents a hypothesis that the accuracies corresponding to the higher level of aggregation are greater than the accuracies corresponding to the lower aggregation. (Q1b). What are the accuracy gains from aggregation for online user identiﬁcation? Testing speciﬁc hypotheses related to Q1a will answer whether aggregation is useful. The empirical analyses needed for this will also 4 Our approach in this paper differs from the basic PAC learning [25] formalism that has been used in machine learning to relate the amount of training data to prediction errors in a probabilistic framework. Unlike in traditional PAC learning, our approach generates a different set of features across varying time periods of aggregation.

550

Y.(C.) Yang, B. Padmanabhan / Decision Support Systems 48 (2010) 548–558

As noted above as the number of users in the population grows, user identiﬁcation is more difﬁcult, and hence more aggregation is needed to attain a desired accuracy. Hence here we seek to answer how much aggregation is needed to achieve a desired accuracy level for a speciﬁc number of users. This is an empirical question and we seek here to develop a table where the rows correspond to different number of users, the columns correspond to different accuracy thresholds, and the cells are the aggregation levels needed on average to achieve the desired accuracy threshold. This is an important empirical contribution since it will provide insight into how much user-centric (client-side) information may be needed before users can be identiﬁed based on their online behavior. To our knowledge there has been no prior work that has provided such an analysis.

that many users share the same interests or buy similar products. While the observation that users may be similar in certain respects is intuitive, and has direct marketing applications, it does not preclude the notion that there may be systematic differences as well. Differences are important given potential applications to online fraud, but these have not been studied as extensively. However this literature does make the key point that users have behavioral signatures (not assumed to be unique) that manifest themselves in clickstream data. Motivated speciﬁcally by the signature learning problem, Ref. [32] shows how a pattern-based clustering approach can be used to group Web transactions such that individual user sessions could be accurately separated. This approach is based on choosing a representation for user signatures and still used only single sessions (with no aggregation considered) as the unit of analysis to learn patterns. On a broader note, this literature provides substantial empirical evidence of patterns in online behavior. However the questions that have not been addressed are whether such patterns may sometimes enable user identiﬁcation and what effect aggregation may have in this process. Finally, research in information foraging [26] also suggests that individual browsing behavior is not random and may be explained by a combination of rules and information scent-based heuristics. In this paper we focus more on understanding the extent and conditions under which any underlying patterns lend themselves to online user identiﬁcation than we do on the basis for any non-randomness in information seeking behavior online.

3. Theoretical bases and literature review

3.2. Biometrics and behavioral signatures

Studying the impact of aggregation on online identiﬁcation accuracies is new and there is no prior research that has addressed this speciﬁcally. However our research questions here are built on several important ideas that have been well developed in the literature, particularly prior to work in the areas of Web usage mining, methods for biometric identiﬁcation and behavioral signatures. Collectively they support our research in the following manner:

Converting physiological characteristics of users into techniques for biometric identiﬁcation has been an active area of research for several years, and Ref. [22] presents a review of many such approaches. With the widespread use of technology, recently there has been substantial interest in identifying unique behavioral characteristics that can possibly serve as identiﬁers. Ref. [24] shows that users have distinct ways in which they use computer keyboards and that users have unique keystroke dynamics. They show that depending on the classiﬁer used, between 80 and 90% of users can be automatically recognized using features such as the latency between keystrokes and the length of time different keys are pressed. Ref. [7] extends this work to the use of mouse movements in addition to keystroke dynamics and note that the combination can often be used to uniquely identify humans. Ref. [20] shows that authors have unique writing styles that enable identifying them from text. In a similar vein, Ref. [19] studies online message board postings and show that users may have online “writeprints” — unique ways in which they write messages on online bulletin boards. A post on a message board is converted to a large number of features (such as number of sentences and frequency of speciﬁc characters), and Ref. [19] applies a genetic algorithm based feature selection method to extract a set of subfeatures that are then used to identify users. Experiments involving 10 users in two different message boards suggest that “writeprints” could well exist since the accuracies obtained were between 92 and 99%. Research has also shown that humans may have distinct patterns even in everyday activities like walking and driving. Ref. [13] shows how driving behavior, such as the extent of braking, often can be used to identify the driver of a vehicle. Ref. [21] shows that users may also have unique gait patterns relating to how they walk when they talk on mobile phones. Such gait patterns are then used to provide an extra layer of security for personal information in mobile devices. Related to the concept of aggregation, in the ﬁeld of automatic speaker recognition, Ref. [3] states that there is generally a tradeoff between accuracy and the duration of the speech used in the training and testing period. It has also been well accepted in the human speaker identiﬁcation ﬁeld that the longer the duration of the speech observed, the higher the probability of correct identiﬁcation [14].

provide data to compute speciﬁc magnitude gains. These gains will provide an understanding of how much the predictive accuracies improve. Our results show that aggregation is indeed useful and that the value and amount of aggregation depend on problem difﬁculty, as measured by the number of users to distinguish among. Speciﬁcally, the more the number of users the greater the value from pooling data across sessions. This ﬁnding leads us to an important empirical question (Q2) which we study next. (Q2). What is the minimum aggregation needed to attain a speciﬁc threshold accuracy for online user identiﬁcation?

• The Web usage mining literatures provide ample evidence suggest that online user behavior is not random and that patterns do exist. • Research in biometrics and behavioral signatures suggests users may have unique behavioral traits in some areas and aggregation can be important for identiﬁcation.

that user that that

However, as noted earlier, the link between aggregation and online user identiﬁcation has not yet been studied. We discuss the major related results in these areas below. 3.1. Web usage mining An early technique developed in Ref. [2] shows how association rules could be generated from online data to build proﬁles. Ref. [1] describes a system for learning proﬁles that capture both facts about users as well as behavioral rules learned from transaction data. Instead of just individual user proﬁles, there has also been work on learning aggregate proﬁles, such as Ref. [23], which builds proﬁles that can apply across groups of users who may have similar interests. Rather than learning proﬁles from clickstream data, Ref. [9] describes an approach that unobtrusively monitors user activities on pages, such as how much they scroll, to build proﬁles that capture user interest in speciﬁc pages or content. Reviews of work in this area are presented in Ref. [27] and Ref. [17]. The main reason for learning proﬁles has been to use these for personalization and product recommendations. However these applications do not require that users have unique proﬁles. On the contrary it is recognized that these may be similar since it is possible

Y.(C.) Yang, B. Padmanabhan / Decision Support Systems 48 (2010) 548–558

Ref. [6] states that exposure duration seems to affect identiﬁcation accuracy positively. Ref. [18] shows a signiﬁcant performance improvement when exposure duration increases. Ref. [33,34] also conclude that longer exposure duration increases the probability of correct identiﬁcation but they also ﬁnd an increase in the number of false alarms. Ref. [8] states that the performance improvement with longer test sample duration for standard speaker recognition system eventually levels off. It is also generally accepted in the ﬁeld of handwriting identiﬁcation that the use of several words can provide better information than a single signature [36]. Also related is work on learning user signatures fraud detection. From a database of fraudulent and normal transactions, user signatures may be built from machine learning techniques that can then be used to predict if any transaction is fraudulent. Ref. [5] shows how user call records can be used to learn caller signatures. These signatures can then be tested against a current call to determine if fraud occurred. While this is a different application, building statistical measures across calls and generating feature distributions across multiple Web sessions are similar in spirit and lend further support to our hypothesis relating aggregation and identiﬁcation accuracies.

551

Fig. 1. Histogram of the number of sessions across users.

randomly sampled users from those who had at least 300 sessions in the data. Given that the time period was one year this would suggest that our analysis would not generalize to users who are online for more limited times (less than a session a day on average). This is a limitation of our study which we recognize in Section 8. 4.2. Creating training and testing datasets

4. Methodology for Q1a In this section we describe the methodology used to address Q1a (Does aggregation result in improved user identiﬁcation based on online behavior?). Computing magnitude gains to answer Q1b can be done directly from this process. For a given number, M, of users, determining signiﬁcance of one aggregation level over the immediate next level can be done using the following procedure: 1. Randomly sample M users' user-centric Web sessions. 2. Partition the session-level data into training and testing datasets. 3. Preprocess the training and testing datasets corresponding to different aggregation levels. 4. Corresponding to each aggregation level, build a classiﬁer and record its holdout accuracy. 5. Repeat the analysis above for a large number of runs. The output of the process above is a column of accuracy values for each aggregation level. From these columns signiﬁcance tests across aggregation levels are performed and magnitude gains are recorded. Below we describe in detail how each of these steps is done. 4.1. Sampling The data provided to us was the anonymized client-side (usercentric) browsing behavior of a random sample of 50,000 users over one year. There are two criteria that we used when selecting the users to sample from this population. First, each panelist in the sample represents a household that is tracked. For our analyses it is important for the clickstream data to represent a single user's data, as opposed to a household. Hence we restrict our random selection of users to those corresponding to a household size of one in the household demographic ﬁle provided. Second, we require users with enough browsing activity such that we have enough training and holdout data on which the models can be trained and tested. Fig. 1 shows the histogram of total number of sessions across all users corresponding to a household size of one (the ﬁrst bar corresponds to the number of users with b100 sessions, the second bar indicates the number of users with 100–300 sessions, and so on). The larger the (minimum) number of sessions per user the greater the training as well as the testing datasets available. However using a higher value may bias the sample towards users who are more active online. In the analysis here we used 300 sessions as the minimum cutoff to ensure that we would have sufﬁcient data. Hence we

When a given number of users are selected, the priors in the data may be unequal to start with. For instance, when constructing a dataset of three users, the number of sessions for each user may be 300, 600 and 2100 (resulting in class priors of 0.1, 0.2 and 0.7 respectively). A naïve classiﬁer predicting the most frequent class will start with an accuracy of 70%. To avoid this from providing any accuracy gains we pick the same number of sessions for all users selected in any given sub-dataset (in the example just mentioned this would mean selecting the ﬁrst 300 sessions for each user). This guarantees exactly equal class priors (a 20 user dataset would therefore have exactly 5% class priors for all the classes). Note that this would mean ignoring the later sessions of some of the users selected. However, this guarantees that any dataset of M users used in our study, at any level of aggregation, will always provide random guessing accuracies of 1/M. This uniform baseline will permit easy comparison across datasets, enhance interpretability of the results and ensure that any improved accuracy observed is due to behavioral patterns as opposed to an artifact of class distributions. A limitation of this though is that the actual accuracy numbers will not reﬂect the distribution on real data where the priors will be different for different users. For each user, we then keep the ﬁrst 2/3 of their sessions in a master training dataset and the last 1/3 of the sessions in a master testing dataset. For any level of aggregation, the training dataset will only be based on the master training dataset and the testing will be done only from this master testing dataset. We do this to ensure that even with aggregations (i.e. pooling sessions together) there is no overlap whatsoever between any testing and training dataset. 4.3. Constructing datasets for different aggregation levels There are two important issues here — how features are computed from aggregation and then the actual choice of features for this study. In Section 4.3.1, we will describe how features are computed from aggregation. In Section 4.3.2, we will then present the speciﬁc features used in this study. 4.3.1. Aggregation and feature creation The feature creation procedure is used to generate features that can be used in classiﬁcation models for different aggregation levels. Clearly in the experiments we can use any given method for creating single session and multi-session features. A statistical approach to doing so is to ﬁrst determine the core set of variables that can be created for a single session, and then use groups of sessions to learn

552

Y.(C.) Yang, B. Padmanabhan / Decision Support Systems 48 (2010) 548–558

the distributions of these variables. As sessions are aggregated, we obtain better estimates of the distribution of the core variables for each user. For instance, for a single session the core set of features may be session time, number of pages viewed and advertisements seen. Now given a set of sessions, the procedure may be deﬁned to return: bavg. session time = 5.3 min, variance of session time = 1.3, avg. number of pages = 4, variance of number of pages = 1.1, avg. number of ads = 7, variance of number of adds = 2.2, userid = 4N. Consider the speciﬁc example in Fig. 2 where, for simplicity, we just use the averages of the variables to represent the distributions. The term Dagg refers to the derived dataset at aggregation level agg. v F is the chosen feature creation method. There are two observations that we wish to draw attention to from this example: 1. A sliding window approach is used to generate the derived datasets at a given level of aggregation. Hence if the number of sessions for a user is m, then the number of records in the dataset for that user at level agg is m − (agg − 1). 2. Since the prior probabilities of the M different users are the same in the initial dataset (as noted in Section 4.2 above), the priors in any derived dataset are also exactly 1/M. Hence aggregation by itself will not artiﬁcially impact accuracy. 4.3.2. Speciﬁc features used in this study The raw data provided to us in the user-centric dataset contains information about basic page/time statistics during a session and the speciﬁc list of domain names (Web sites) visited in the user's session. Hence for each user session we construct four continuous measures: (i) the duration of the session (ii) number of pages viewed (iii) the starting time (in seconds after 12:00 am) and (iv) the number of sites visited. Since the speciﬁc sites visited are important features, for each user session we also construct a set of binary indicators for speciﬁc Web sites indicating whether there was access to a speciﬁc site in a session. Given the very large number of domains, it is impossible to construct a binary indicator for every domain in the data. Instead we construct this set of features in the following manner. Assume that for a three user dataset, there are 900 total user sessions. Recall that 2/3 of this (600 sessions) is used as the training data. From this training data

(the 600 sessions) we extract the top k sites for each of the three users (we report results for k = 5 and 10 in the results) and take the union of these sites to create speciﬁc binary variables. Since there can be common Web sites across different users, this results in a p b 3 ⁎ k binary indicators. Note that, as expected, the binary features created here are done only from the training data and do not use any information from the sessions in the hold out data. During aggregation, the value of these p binary variables will be decided depending on whether the p sites appear in the sessions grouped together. For example, let {site1, site2, site3} represent the p (=3) sites used as the binary variables (note: these p sites are derived from the entire training set, and do not change as aggregation level changes). When the aggregation is two sessions, each data record is corresponding to two sessions, and the values of the p variables will be 0 or 1 depending on whether these two sessions have these three sites in them. The same logic applies for other aggregation levels. Hence for a single user session we construct four continuous variables and p binary variables. For any aggregation we compute the mean, median, variance, maximum and minimum values for the four continuous measures giving us 5 ⁎ 4 = 20 speciﬁc variables. Based on the p binary variables in each single session, for any group of sessions we construct p integer variables that represent the counts of the binary variables in this group. Hence on any aggregation we construct a total of 20 + p variables plus the categorical dependent variable (userid.)

4.4. Build classiﬁers and record accuracies We chose weka's J4.8 [29] as the classiﬁer since classiﬁcation trees in general have been shown to be highly accurate classiﬁers. The speciﬁc choice of J4.8 was also for convenience since weka is an open source data mining platform that also lends itself easily to automation within scripts. We also ran experiments using neural networks and the general conclusions do not change. As noted previously we use time-based hold out testing where the ﬁrst 2/3 of data is used for model building, while the last 1/3 is used as the hold out set where the accuracies are measured. Hence if 300 sessions are selected for each of the three users, the ﬁrst 200 sessions

Fig. 2. An example of creating datasets at different levels of aggregation.

Y.(C.) Yang, B. Padmanabhan / Decision Support Systems 48 (2010) 548–558

553

Table 1 Accuracy gains (and p-values) across number of users and aggregation levels. increase in aggregation

# of users

2 3 4 5 10 15 20 50 100

2/1

3/2

4/3

5/4

6/5

7/6

8/7

9/8

10/9

0.85 (0.18) 1.34 (0.05) 3.36 (0) 3.88 (0) 5.25 (0) 6.47 (0) 6.92 (0) 8.86 (0) 10.13 (0)

0.95 (0.06) 0.25 (0.37) -0.16 (0.59) 1.28 (0) 1.2 (0) 2.33 (0) 2.05 (0) 3.19 (0) 3.9 (0)

-0.23 (0.6) 0.77 (0.16) 1.44 (0.01) 0.28 (0.21) 0.76 (0.11) 0.96 (0.11) 0.83 (0.02) 1.59 (0) 1.49 (0)

0.04 (0.49) 0.15 (0.39) -0.85 (0.92) -0.25 (0.72) -0.02 (0.51) 0.87 (0.01) 0.7 (0.02) 0.34 (0.07) 0.75 (0)

-0.64 (0.79) 0.19 (0.3) 0.64 (0.11) -0.05 (0.53) 0.61 (0.09) -0.56 (0.94) 0.66 (0.02) 0.03 (0.46) 0.33 (0.04)

-1.35 (0.92) -0.41 (0.62) -0.18 (0.67) 1.09 (0.04) -0.83 (0.94) -0.09 (0.59) -0.13 (0.67) -0.06 (0.59) 0.13 (0.27)

2.3 (0.05) 0.2 (0.41) -0.3 (0.83) -0.24 (0.74) 0.43 (0.12) 0.24 (0.3) -0.52 (0.92) 0.32 (0.11) 0.36 (0.06)

-0.7 (0.75) 0.37 (0.14) 1.34 (0.04) 0.34 (0.21) 0.05 (0.46) -0.01 (0.51) 0.08 (0.41) 0.03 (0.46) -0.4 (0.98)

0.7 (0.27) 0.56 (0.11) -0.31 (0.71) -0.79 (0.89) -0.84 (0.92) -0.56 (0.91) 0.51 (0.08) -0.09 (0.62) -0.11 (0.7)

of each user is used in the training set, while the last 100 sessions of each user is in the hold out test. When a model corresponding to agg = 10 is built, this will therefore use 191 (=200 − 10 + 1) points from the training data, and will use 91 (100 − 10 + 1) points from the hold out set. This guarantees that there is no overlap between the training and testing sets at any point in the process. Note that this process ensures that the hold out set also always has exactly equal class priors, making it difﬁcult to improve on the baseline accuracy of 1/(# of users) by chance. 5. Results for Q1a and Q1b: Effect of aggregation Table 1 summarizes the results of eighty-one tests examining the value of aggregation. Note that as evident from the methodology, each of these tests is on a different dataset. Hence we do not face the problem of multiple hypotheses testing on a single dataset. Each cell in the table has an accuracy gain and a p-value computed from 50 different runs. For example, the values in the lower left cell (100, 2/1) are computed as follows. In each of 50 runs, 100 different users are randomly chosen, and the classiﬁer is built for agg = 1 and agg = 2, thereby resulting in two columns of 50 accuracy values. From these two columns, the average accuracy gain due to the higher aggregation is computed, and a directional test of signiﬁcance (testing whether the accuracy gain is signiﬁcantly greater than zero) is used to compute the p-value. The average accuracy gains reported in this table are all absolute, and not relative ﬁgures. For example, in this particular cell (lower left), the average absolute accuracy improvement for agg = 2 over agg = 1 is 10.13%, and the test indicates signiﬁcant value of aggregation with p-value = 0. Does aggregation result in improved user identiﬁcation online? At the face of it, the results seem mixed to negative, with only 26 out of 81 tests showing signiﬁcant positive differences due to aggregation, almost suggesting a lack of value. However, the effect is not evenly distributed. There is a strong pattern where the lower left quadrant is almost entirely signiﬁcant with high absolute gains. Speciﬁcally, 17 of 20 cells in this quadrant here show highly signiﬁcant gains, suggesting that the main value from aggregation is in this range. These results hence show two effects: 1. The value of aggregation increases with problem complexity (# of users/classes). For the ﬁrst few columns, the absolute accuracy gains on average increase with increasing complexity (i.e. going down the columns). 2. For more complex problems the gains from aggregation seem to hold for a greater range of aggregations. For instance, for the 100

user datasets signiﬁcant gains from aggregation exist all the way until agg = 6, while for the less complex problems the beneﬁts seem to stop earlier. This suggests that for a ﬁxed complexity (i.e. # of users in the dataset) there may be a (different) minimum level of aggregation. This is what we investigate in Q2 where we seek to empirically determine this minimum value for different complexity levels. For illustration purpose, we graphed the average accuracy gains for 4 levels (number of users = 2, 5, 20, 100) in Fig. 3. As we can see, the improvement levels off as the level of aggregation increases. Table 2 examines speciﬁcally the value of going from no aggregation to agg = 2 as a function of the difference in problem complexity. While these values can be directly computed from those in Table 1, the results here focus on how much the accuracy gains increase when going from no aggregation to the ﬁrst level of aggregation (agg = 2). The cell corresponding to the bottom left value with coordinates (100,2) should be interpreted as follows. If the average increase in accuracy from going from no aggregation to agg = 2 for 100 user datasets is x, and if the same value for 2 user datasets is y, then x − y = 9.28 (the value in the cell). The signiﬁcance tests whether the accuracy gain for the higher number of users is signiﬁcantly greater than that of the lower number of users. All numbers are positive and mostly signiﬁcant, showing that as the relative problem complexity increases there is clear evidence of increased value with some aggregation.

Fig. 3. Average accuracy gains across 4 different number of users.

554

Y.(C.) Yang, B. Padmanabhan / Decision Support Systems 48 (2010) 548–558

6.1. Determining the minimum aggregation

Table 2 Accuracy gains going from no aggregation to agg = 2. # of users 2 # of users

3 0.49 (0.347) 4 2.51 (0.015) 5 3.03 (0.005) 10 4.4 (0) 15 5.62 (0) 20 6.07 (0) 50 8 (0) 100 9.28 (0)

3

4

5

10

15

20

50

2.03 (0.016) 2.54 (0.003) 3.91 (0) 5.13 (0) 5.58 (0) 7.52 (0) 8.79 (0)

0.52 (0.245) 1.89 (0.007) 3.1 (0) 3.56 (0) 5.49 (0) 6.77 (0)

1.37 (0.033) 2.59 (0) 3.04 (0) 4.97 (0) 6.25 (0)

1.22 (0.035) 1.67 (0.01) 3.6 (0) 4.88 (0)

0.45 (0.22) 2.39 (0) 3.66 (0)

1.93 (0) 3.21 (0)

1.28 (0)

6. Methodology for Q2 Conditional on aggregation helping identiﬁcation accuracy, in Q2 we seek to determine the minimum aggregation needed to attain a speciﬁc threshold accuracy for online user identiﬁcation. This can clearly be done by exhaustive search, which may be preferred when the expected levels are likely to be fairly small. In Section 6.1 we present this formally, thereby clearly delineating the inputs which will affect the results. In Section 6.2 we present a heuristic that can apply in some cases when dealing with much larger problems.

Let D = {S1, S2, …, SN} be a Web browsing dataset representing a total of N Web sessions. Assume that in this dataset the number of unique users is M and users are identiﬁed by a userid ∈ {1,2,…,M}. Next deﬁne F to be a feature creation procedure that takes a set of sessions belonging to the same user and returns a vector of attribute values bv1, v2,…, vq, useridN. For any given level of aggregation (of sessions), agg, when F is repeatedly applied to sets of agg consecutive sessions – and in such a manner for all users – we derive a dataset Dv, on which a classiﬁer, C, is built. As done in Section 4.3, when we need to make explicit the speciﬁc level of aggregation, agg, used in the process of creating Dv we will denote the dataset as Dagg v . Here C is any classiﬁcation algorithm that can be applied to data to learn the classiﬁcation function g: V → U that maps any feature vector v ∈ V to a user ID ∈ U, where U ∈ {1, 2,…, M}. Hence we write g = C(D) to refer to the trained classiﬁer built on the dataset D by applying algorithm C. Fig. 4 formally presents the procedure for determining the smallest level of aggregation. Here we start with a set of sessions and ﬁrst group all sessions belonging to the same user. For each user, we then sort all sessions chronologically, and then apply a sliding window of size agg to pick out all groups of consecutive sessions of size agg. The feature creation procedure is then applied to every such window to create feature vectors. On all the feature vectors constructed in this manner a classiﬁer is built. If the goodness of this classiﬁer is better than some threshold goodness then we say that users can be identiﬁed at this (agg) level of aggregation. If not, the process repeats after incrementing agg by one and terminates based on a user-speciﬁed stopping condition (we use a maximum agg of 30 for these experiments).

Fig. 4. Determining the minimum level of aggregation needed.

Y.(C.) Yang, B. Padmanabhan / Decision Support Systems 48 (2010) 548–558

Note that the goodness of the classiﬁer is also considered an input since it can be based on different criteria. Without loss of generality we assume that goodness is deﬁned such that a larger value corresponds to a better model. As done in Section 4 in our experiments we deﬁne goodness as accuracy on a separate test dataset created from the last 1/3 of all user sessions (before any aggregation). The procedure to determine the minimum agg in Fig. 4 was for a given set of users. We iterate this over a large number of runs to derive empirical estimates for Q2. 6.2. A heuristic for large problems In practice we may wish to limit the search to a small number of possibilities and hence the procedure suggested in Section 6.1 may sufﬁce and is the methodology used for the experiments in this paper. However for generality in this section we consider cases where the different levels of aggregation to be considered are perhaps unknown at the outset (worst case = the total number of sessions a user has) or are too large to permit exhaustive search across different levels of aggregation. For such cases (as we note below) we present a heuristic here that can reduce the complexity of the process, bringing it to O (log2|D|). Let g′ = C(Div), denoting the classiﬁcation function learned from applying the feature creation procedure F on the original dataset D at a level of aggregation i. Similarly let g″ = C(Djv). Also assume a given goodness measure T. If T can be guaranteed to be monotonic in the levels of aggregation considered then binary search can be used to

555

determine minimum aggregation levels. The monotonicity assumption speciﬁcally is the following. Given C, D, F and T, then T(g′) ≥ T(g″) whenever i N j, where as noted earlier: • g′ = C(Div), g″ = C(Djv) and • Div refers to the aggregated dataset created by applying F to D at level of aggregation i. This states that the goodness of the model when applied to more aggregated data is never worse than the goodness of the model applied to less aggregated data. It is difﬁcult to prove that this holds for arbitrary C, D, F and T, but it may be possible to empirically test if this is true within the range of aggregations that may be considered. When this is assumed to hold – which can be tested empirically – then the procedure described below can be used to answer Q2 in a more efﬁcient manner. Since the monotonicity assumption implies that the model goodness at increasing levels of aggregation is sorted in ascending order, binary search can be used in such cases reducing complexity to O(log2|D|). Using the notation in Section 6.1 we present the algorithm for this case (see Fig. 5). When the sequence is not exactly in ascending order (i.e. when the assumption does not hold) the binary search can converge to higher levels of aggregation when perhaps smaller aggregations are adequate. In such cases it may be possible to modify the technique such that it does not consider each point per se, but uses neighborhoods around each point to compute an average goodness in the neighborhood around a speciﬁc point considered. This may

Fig. 5. Binary search for determining the mininum level of aggregation.

556

Y.(C.) Yang, B. Padmanabhan / Decision Support Systems 48 (2010) 548–558

make it feasible to deal with sequences that are mostly sorted, but not exactly so. This is not the focus of this paper but is an interesting possibility to consider in future work. 7. Results for Q2: determining the minimum level of aggregation The results in Table 1 show that while there are beneﬁts from aggregation, they do not consistently increase as the aggregation level continuously increases, suggesting that monotonicity does not hold for a larger range of aggregation values. Hence we use the approach in Section 6.1 to determine the minimum level of aggregation in a range of agg = 1 to agg = 30. Recall that one feature creation parameter was the number of top (k) sites to use as binary indicators as discussed in Section 4.3. Table 3a–d and 4a–d present a summary of the results for different accuracy thresholds for k = 5, 10 respectively. While the median (the second column) in all the above tables is computed based on all the 20 runs, the mean is computed just based on the terminated runs, since in the other runs, an agg does not exist or may be greater than 30 (the stopping criterion used). The last column shows the % of terminated runs — i.e. % of runs where the threshold was reached before agg = 30. For instance, in the ﬁrst row of Table 3a, the value in the last column (100%) indicates that in this case however, all runs resulted in ﬁnding small agg values, and hence the stopping criterion (agg = 30) never came into effect.

Table 3 agg values across users for k = 5. Mean

% runs with agg b 30

a (Accuracy threshold 75%) 2 1 3 1 4 1 5 1 10 1 15 1 20 1 50 2 100 3

1.45 1.15 1.1 1.61 1.3 1.5 1.45 2.43 3.71

100 100 100 90 100 100 100 100 100

b (Accuracy threshold 80%) 2 1 3 1 4 1 5 1 10 1 15 2 20 2 50 4 100 6.5

1.7 1.3 1.25 1.72 1.9 3 2.2 3.75 7.03

100 100 100 90 100 100 100 95 70

c (Accuracy threshold 85%) 2 1 3 1 4 1 5 1 10 2 15 3 20 3 50 6 100 N/A

1.85 1.7 1.82 1.82 3.55 5.41 3.68 6.91 N/A

100 100 85 85 100 85 95 55 0

d (Accuracy threshold 90%) 2 1 3 1 4 2 5 1.5 10 4.5 15 5 20 7 50 N/A 100 N/A

2.05 2 4.29 2.38 5.17 7.17 6.94 N/A N/A

100 100 85 80 90 60 85 0 0

Num of users

Median

Table 4 agg values across users for k = 10. Mean

% runs with agg b 30

a (Accuracy threshold 75%) 2 1 3 1 4 1 5 1 10 1 15 1 20 1 50 2 100 2

1 1 1 1 1.05 1.2 1.45 2.14 2.48

100 95 100 100 100 100 100 100 100

b (Accuracy threshold 80%) 2 1 3 1 4 1 5 1 10 1 15 2 20 2 50 3 100 4

1 1 2.25 1.1 1.65 2 2.5 2.95 4.8

100 95 100 100 100 100 100 95 95

c (Accuracy threshold 85%) 2 1 3 1 4 1 5 1 10 2 15 2 20 3 50 4 100 11.5

1 1 1.16 1.16 2.21 2.84 3.78 6.18 12.5

100 95 95 95 95 95 90 80 40

d (Accuracy threshold 90%) 2 1 3 1 4 2 5 2 10 3 15 5 20 6 50 13 100 N/A

1.05 1.26 1.78 2.16 4.24 5.2 8.9 13 N/A

100 95 90 95 85 75 50 10 0

Num of users

Median

As noted in detail in Section 4, all the accuracies here are computed on hold out datasets (last 1/3 of sessions selected for each user) created before any aggregation. The distribution of the agg values (beyond the mean/median values reported) can be seen from speciﬁc histograms. Fig. 6 plots the histogram of the values corresponding to threshold accuracy of 90% for k = 5. The right tails of the histograms show the number of runs that were terminated due to the stopping condition. As noted before, terminating at agg = 30 without reaching threshold accuracies could mean (a) that there is a higher agg value or (b) there is no agg for which the threshold accuracy can be reached. In these runs we did not test higher values since we have limited data on which it may be difﬁcult to accurately test high values. Assuming that termination implies that there is no agg for which users can be accurately identiﬁed, we plot the percentage of successful terminations (agg b 30) across datasets with different number of users. The results in Table 3a–d and 4a–da–d show that in most cases the models accurately predict user IDs for the range of users considered here, often with very few sessions. In the range of users considered here these empirical results speciﬁcally indicate that: (1) the minimum level of aggregation needed increases with problem complexity and (2) the minimum level of aggregation increases with threshold accuracy levels.

Y.(C.) Yang, B. Padmanabhan / Decision Support Systems 48 (2010) 548–558

Fig. 6. Histograms of agg values across the 20 runs (accuracy threshold 90%, k = 5).

Qualitatively it is worth noting that the experiments demonstrated high levels of accuracy, often with low aggregations. For instance attaining 75% accuracy levels in 100 user datasets from agg = 3 on average is remarkable given that, as described in Section 4, the class priors of these datasets were exactly 1%. The high accuracies obtained come from two simple factors that the biometrics literature had shown (as noted in Section 2 of this paper) — quality of data, speciﬁcally the features available to describe a single session, and the quantity of data, namely the number of sessions to be aggregated. Related to the quality factor, the set of sites visited is a strong predictor as these experiments have shown. Across all the runs the hold out accuracies with no aggregation (i.e. for agg = 1) ranged from 56% to 98%, suggesting that the initial set of features is very informative for the classiﬁer even at the single session level. Recall that the priors in the hold out datasets are exactly at 1/M, and hence even 56% for a 20 user dataset suggests substantial structure (from these features) in the data at the single session level alone. Aggregations did increase the accuracies as the results show but the gains are not always dramatic and are typically more useful for the more difﬁcult datasets as discussed. Further in these runs, the gains are highest for the ﬁrst few sessions aggregated (also perhaps due to the high initial accuracies), suggesting perhaps a diminishing returns effect. 8. Discussion To our knowledge this is the ﬁrst study that has systematically studied user identiﬁcation from online behavior, particularly investigating the value of aggregation. At the highest level the results in this paper may have implications for online fraud detection as well as online privacy as we note below. These results do suggest that at the user-centric level we may be able to build reasonably accurate models identifying the user by observing “enough” data, at least for some users. This does suggest that client-side software, perhaps provided by a trusted third party or an authentication ﬁrm, may be developed that can monitor user behavior and provide identiﬁcation scores to third parties if the user chooses to do so. Online brokerages and banks today struggle with identiﬁcation mechanisms, often requiring users to register computers and remember answers to secret questions as veriﬁcation mechanisms. Client-centric user scores may be complementary to such existing methods, providing yet another layer of deterrence in a multi-factor approach to fraud deterrence recommended by the US federal agency FFIEC for online ﬁnancial institutions. Such an application may also be deployed in a privacy-preserving manner, whereby users may voluntarily choose to use such authentication

557

software and which works such that none of the client data is revealed, providing instead solely a user score. As noted in Section 1, our goal in this paper is not to build the best user identiﬁcation model. Rather it is to study one factor that inﬂuences user identiﬁcation accuracies. The design of accurate user identiﬁcation models will remain a question for future research and may be one that requires the understanding of other factors that contribute to improved accuracies as well, and may also require having user-centric data on a much larger scale than we have access to. On the ﬂip side, the results also suggest that in some cases enough user data may be used to reveal identities, possibly in an unintended manner. However the range of users considered here is limited and the results do suggest an indirect value from “crowds”, where we see the need for higher aggregations as the problem complexity rises. Nevertheless the high accuracies obtained in the experiments leave open the possibility that at least some users may be identiﬁable in this manner. While not the focus of this paper in future research we might investigate this question speciﬁcally, seeking to determine how many – and possibly what type of – users may actually be identiﬁed based on user-centric data. For instance, it may be possible that users with low levels of general activity do not leave enough or detailed trails to permit identiﬁcation, but more active and possible ‘niche’ users may be more prone to such identiﬁcation. These however remain conjectures and need to be separately investigated. As with any empirical study we developed a careful methodology but had to make speciﬁc choices in this process. Hence there are important limitations and challenges that limit the scope of our results. Here we discuss the signiﬁcant issues that limit generalizability and which may need to be further studied. First, we need to study how these results will change as the number of users considered increases to higher levels. We would expect the accuracies to diminish for large number of users, but it is not clear by how much and what the improvements with aggregation may then be. However exploring scale in this manner would still mean addressing another difﬁcult problem. Building accurate classiﬁers when the class label takes on a large set of values is known to be hard. In such cases the class priors will be very small and the heuristics used to build these classiﬁers may no longer work well. One possibility is to modify the approach here in the following manner. Speciﬁcally, rather than working on a dataset where the label takes several values, the approach can be used to build a classiﬁer for each user. Here for each user we can construct a dataset where the class label is now binary (this user or “anyone else”) and traditional classiﬁers can be built. The output of our approach then will be a level of aggregation that is required for each user to distinguish this user from anyone else. In this case though, a large number of classiﬁers need to be built and the data can be extremely unbalanced. This is a direction that we will pursue in future work. There are indeed other approaches that may be investigated for this too, such as reducing the size of the problem by clustering the original set of users into a more manageable number of segments before applying this procedure to distinguish between segments instead of users. In such cases we might even iteratively break segments down in this manner to ultimately predict individual users. This is similar in spirit to the idea of multi-stage classiﬁcation. The second challenge is determining the features to use. Our approach for determining minimum levels of aggregation and the accuracy results are for a speciﬁc choice of these. In this paper we presented some guidelines on how to do this. Speciﬁcally suggesting that session-level features can ﬁrst be identiﬁed, and then aggregations can be used to estimate various statistical measures of these features. However identifying good session-level features is important and may be critical to determine whether such identiﬁcation can scale to higher levels. The basic set of features used in this study was constrained by the session-level data provided and does miss

558

Y.(C.) Yang, B. Padmanabhan / Decision Support Systems 48 (2010) 548–558

potentially important information such as the content of pages and/or the nature of user activity at any of the sites (i.e. does the user spend most of the time at a site on business news or online games?). Such missing features can be expected to have signiﬁcant impact on identiﬁcation models and aggregations. We have also conducted sensitivity analysis, which shows that the binary variables for the top sites visited are more effective at identifying users than the continuous variables (and the models with all the variables are naturally more accurate). The third challenge is to have adequate data at the user-level. As noted in Section 4 we have one year's worth of data for each user in a random panel and had to select users for which adequate data was available. This limits the generalizability of our results. We hope this research triggers follow-on studies that may have access to larger datasets. Finally, we acknowledge generalization limitations implied by the “No Free Lunch” theorems [30,31]. Our results in this paper are derived from speciﬁc data sets and learning methods and we cannot show that it will generalize to other learning methods or data. Acknowledgements The authors thank the seminar participants at Microsoft Research, University of Minnesota, University of Washington, Tsinghua University and the Winter Conference on Business Intelligence at University of Utah for their useful comments. References [1] G. Adomavicius, A. Tuzhilin, Using data mining methods to build customer proﬁles, IEEE Computer 34 (2) (2001) 74–82. [2] C. Aggarwal, Z. Sun, P.S. Yu, Fast algorithms for online generation of proﬁle association rules, IEEE Transactions on Knowledge and Data Engineering 14 (5) (2002) 1017–1028. [3] J.P. Campbell Jr, Speaker recognition: a tutorial, Proceedings of the IEEE 85 (9) (1997) 1437–1462. [4] H. Cavusoglu, B. Mishra, S. Raghunathan, The value of intrusion detection systems in information technology security architecture, Information Systems Research 16 (1) (2005) 28–46. [5] C. Cortes, D. Pregibon, Signature-based methods for data streams, Data Mining and Knowledge Discovery 5 (3) (2001) 167–182. [6] K.A. Deffenbacher, J.F. Cross, R.E. Handkins, J.E. Chance, A.G. Goldstein, R.H. Hammersley, J.D. Read, Relevance of voice identiﬁcation research to criteria for evaluating reliability of an identiﬁcation, Journal of Psychology 123 (2) (1989) 109–119. [7] R.A. Everitt, P.W. McOwan, Java-based Internet biometric authentication system, IEEE Transactions on Pattern Analysis Machine Intelligence 25 (9) (2003) 1166–1172. [8] L. Ferrer, H. Bratt, V. Gadde, S. Kajarekar, E. Shriberg, K. Sonmez, A. Stolcke, A. Venkataraman, Modeling duration patterns for speaker recognition, Proc. EUROSPEECH 2017-2003, 2003. [9] J. Goecks, J. Shavlik, Learning users' interests by unobtrusively observing their normal behavior, Proc. 5th Int'l Conf. Intelligent User Interfaces, 2000, pp. 129–132. [10] M.C. González, C.A. Hidalgo, A. Barabási, Understanding individual human mobility patterns, Nature 453 (2008) 779–782. [11] M. Gupta, J. Rees, A. Chaturvedi, J. Chi, Matching information security vulnerabilities to organizational security proﬁles: a genetic algorithm approach, Decision Support Systems 41 (3) (2006) 592–603. [12] T. Herath, H.R. Rao, Encouraging information security behaviors in organizations: role of penalties, Pressures and Perceived Effectiveness, Decision Support Systems 47 (2) (2009) 154–165. [13] K. Igarashi, C. Miyajima, C.K. Itou, K. Takeda, F. Itakura, H. Abut, Biometric identiﬁcation using driving behavioral signals, Proc. Int'l Conf. Multimedia and Expo, 2004, pp. 65–68. [14] J. Kerstholt, N. Jansen, A. Van Amelsvoort, A. Broeders, Earwitnesses: effects of speech duration, retention interval and acoustic environment, Applied Cognitive Psychology 18 (3) (2004) 327–336. [15] D.J. Kim, D.L. Ferrin, H.R. Rao, A trust-based consumer decision-making model in electronic commerce: the role of trust, perceived risk, and their antecedents, Decision Support Systems 44 (2) (2008) 544–564.

[16] T. Ko, R. Krishnan, Monitoring and reporting of ﬁngerprint image quality and match accuracy for a large user application, Proc. 33rd Applied Imagery Pattern Recognition Workshop, 2004, pp. 159–164. [17] R. Kosala, H. Blockeel, Web mining research: a survey, SIGKDD Explorations 2 (1) (2000) 1–15. [18] G.E. Legge, C. Grosmann, C.M. Pieper, Learning unfamiliar voices, Journal of Experimental Psychology. Learning, Memory, and Cognition 10 (2) (1984) 298–303. [19] J. Li, R. Zheng, H. Chen, From ﬁngerprint to writeprint, Communications of the ACM 49 (4) (2006) 76–82. [20] D. Madigan, A. Genkin, D.D. Lewis, S. Argamon, D. Fradkin, L. Ye, Author identiﬁcation on the large scale, Proc. Interface and the Classiﬁcation Society of North America, 2005. [21] J. Mäntyjärvi, M. Lindholm, E. Vildjiounaite, S. Mäkelä, H. Ailisto, Identifying users of portable devices from gait pattern with accelerometers, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. [22] B. Miller, Vital signs of identity, IEEE Spectrum 31 (2) (1994) 22–30. [23] B. Mobasher, H. Dai, T. Luo, M. Nakagawa, Discovery and evaluation of aggregate usage proﬁles for web personalization, Data Mining and Knowledge Discovery 6 (1) (2002) 61–82. [24] F. Monrose, A. Rubin, Authentication via keystroke dynamics, Proc. 4th ACM Conference on Computer and Communications Security, 1997, pp. 48–56. [25] B.K. Natarajan, Machine Learning: a Theoretical Approach, Morgan Kaufmann, 1991. [26] P.L. Pirolli, Information Foraging Theory: Adaptive Interaction with Information, Oxford University Press, Cambridge, UK, 2007. [27] J. Srivastava, R. Cooley, M. Deshpande, P-T. Tan, Web usage mining: discovery and applications of usage patterns from web data, SIGKDD Explorations 1 (2) (2000) 12–23. [28] D. Straub, Effective is security: an empirical study, Information Systems Research 1 (3) (1990) 255–276. [29] H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed, Morgan Kaufmann, San Francisco, 2005. [30] D.H. Wolpert, The lack of a priori distinctions between learning algorithms, Neural Computation 8 (7) (1996) 1341–1390. [31] D.H. Wolpert, W.G. Macready, No Free Lunch theorems for optimization, IEEE Transactions on Evolutionary Computation 1 (1) (1997) 67–82. [32] Y. Yang, B. Padmanabhan, GHIC: a hierarchical pattern based clustering algorithm for grouping web transactions, IEEE Transactions on Knowledge and Data Engineering 17 (9) (2005) 1300–1304. [33] A.D. Yarmey, Voice identiﬁcation over the telephone, Journal of Applied Social Psychology 21 (22) (1991) 1868–1876. [34] A.D. Yarmey, E. Matthys, Voice identiﬁcation of an abductor, Applied Cognitive Psychology 6 (5) (1992) 367–377. [35] X. Zhao, F. Fang, A.B. Whinston, An economic mechanism for better internet security, Decision Support Systems 45 (4) (2008) 811–821. [36] E.N. Zois, V. Anastassopoulos, Methods for writer identiﬁcation, Proc. IEEE International Conference on Electronics, Circuits and Systems, 1996.

Yinghui (Catherine) Yang is an assistant professor of the Graduate School of Management at University of California, Davis. She received her Ph.D. in Operations and Information Management from The Wharton School at the University of Pennsylvania, and B.E. in Management Information Systems from The School of Economics and Management at Tsinghua University. Her research is interdisciplinary between data mining and marketing. Her research has been published in top data mining journals (e.g. IEEE Transactions on Knowledge and Data Engineering) and Marketing journals (e.g. Marketing Science).

Balaji Padmanabhan is the Anderson Professor of Global Management and Associate Professor of Information Systems and Decision Sciences at the University of South Florida. He received a B.Tech in Computer Science from the Indian Institute of Technology (IIT), Madras and a Ph.D. in Information Systems from New York University. His work has been published in leading conferences and journals including Management Science, Decision Support Systems, MIS Quarterly, INFORMS Journal on Computing, IEEE Transactions on Knowledge and Data Engineering and Proceedings of ACM KDD, IEEE ICDM, WITS and AIS. He has served on the program committees of several ACM, IEEE, SIAM and WITS conferences and has editorial appointments at Information Systems Research and INFORMS Journal on Computing.

Toward user patterns for online security: Observation time and online user identification

Toward user patterns for online security: Observation time and online user identification

Recommend Documents