Expert Systems With Applications 143 (2019) 113091
Contents lists available at ScienceDirect
Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa
Keystroke dynamics obfuscation using key grouping Itay Hazan a,b,∗, Oded Margalit a,c, Lior Rokach b a
IBM Cybersecurity Center of Excellence, Beer-Sheva, Israel Department of Software & Information Systems Eng., Ben-Gurion University of the Negev, Beer-Sheva, Israel c Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, Israel b
a r t i c l e
i n f o
Article history: Received 3 February 2019 Revised 15 November 2019 Accepted 15 November 2019 Available online 16 November 2019 Keywords: Behavioral biometrics Keystroke dynamics Hierarchical clustering
a b s t r a c t Keystroke dynamics is one of the most widely adopted identity verification techniques in remote systems. It is based on modeling users’ specific patterns of typing on the keyboard. When utilized in conjunction with the commonly used passwords, the use of keystroke dynamics can dramatically increase the level of security without interfering with the user experience. However, aspects of keystroke dynamics that applied on passwords, such as processing keystroke events and storing feature vectors or user models, can expose users to identity theft and a new set of privacy risks, thus questioning the added value of keystroke dynamics. In addition, common encryption techniques will be unable to mitigate these threats, since the user’s behavior changes from one session to another. In this paper, we suggest key grouping as an obfuscation method to ensure keystroke dynamics privacy. When applied on the keystroke events, the key grouping dramatically reduces the possibility of password theft. To perform the key grouping optimally, we present a novel method which produces groups that can integrated with any keystroke dynamics algorithm. Our method divides the keys into groups using hierarchical clustering with dedicated statistical heuristics algorithm. We tested our method’s key grouping output on five keystroke dynamics algorithms using a public dataset and managed to show a consistent improvement of up to 7% in the AUC over other, more intuitive key groupings and random key groupings. © 2019 Elsevier Ltd. All rights reserved.
1. Introduction The consequences of identity theft on the internet can be harsh. Under the guise of a stolen identity, a masquerader can commit actions, such as spreading fake news, withdrawing money, or making purchases, in the name of or at the expense of the stolen identity. The use of behavioral biometrics for identity verification is a means of fighting identity theft. Behavioral biometrics is a plethora of techniques to model users’ identities through their particular behavioral traits. Applying behavioral biometrics in sensitive systems can add an additional layer of protection against identity theft (Moskovitch et al., 2009). Keystroke dynamics is a behavioral biometric modality that focuses on users’ unique pattern of typing on the keyboard. Considered transparent, noninvasive, and easy to apply, it has become an increasingly popular for user identity verification (Teh, Teoh, & Yue, 2013). In contrast to more physical modalities, such as fingerprint or iris scanners (Al Solami, Boyd, Clark, & Islam, 2010), keystroke dynamics does not require the user to perform any additional tasks or necessitate additional sensors. ∗
Corresponding author. E-mail addresses:
[email protected] (I. Hazan),
[email protected] (O. Margalit),
[email protected] (L. Rokach). https://doi.org/10.1016/j.eswa.2019.113091 0957-4174/© 2019 Elsevier Ltd. All rights reserved.
In a world in which users are increasingly reluctant to share their physical biometric information, keystroke dynamics offer an alternative that is less intrusive and privacy invasive (Hwang, Lee, & Cho, 2009). Keystroke dynamics is usually divided into two main types: (1) free text keystroke dynamics, in which the user’s identity is verified based on the way the user types a changeable, unexpected text that he/she probably hasn’t typed before, and (2) fixed text keystroke dynamics, in which the user’s identity is verified based on the way the user types a predefined text that he/she is already familiar with. Free text keystroke dynamics requires a larger text sample and a longer training period before a model can be built to verify a user’s identity. Fixed text, on the other hand, requires only a handful of previous samples and a much shorter, repeatable text in order to build an accurate model. That is why techniques using fixed text keystroke dynamics fit very well to verifying users’ identity when applied on usernames and passwords (Teh, Teoh, Tee, & Ong, 2010). While passwords are still the most commonly used authentication mechanism, keystroke dynamics can be used for identity verification in cases in which the password has been leaked, stolen, or guessed. Keystroke dynamics work by tracking the timestamps of the user’s presses and releases on the keyboard. The presses and
2
I. Hazan, O. Margalit and L. Rokach / Expert Systems With Applications 143 (2019) 113091
releases tracked are processed and used for the extraction of descriptive features. Features that relate to the latency between two consecutive key events are usually referred as di-graphs. To the best of our knowledge, di-graphs are the most widely used descriptive features in the literature for keystroke dynamics both for free text and fixed text (Banerjee & Woodard, 2012; Dowland & Furnell, 2004; Gunetti & Picardi, 2005; Sim & Janakiraman, 2007; Teh, Teoh, & Yue, 2013; Giot, El-Abed, & Rosenberger, 2009a; Giot, El-Abed, & Rosenberger, 2009b; Giot, El-Abed, & Rosenberger, 2009c).When considering the latency between two consecutive key events, there are four feature types: press to release, press to press, release to release, and release to press. Once the features are extracted, they serve as the input for a machine learning algorithm. The machine learning algorithm can, in turn, produce a model or a score for the user’s identity based on the user keystrokes. The machine learning algorithm use previously learned instances in which the user engaged with the system. There are two main types of machine learning algorithms used for identity verification: supervised classification algorithms (i.e. supervised classifiers), such as, KNNs, SVM (Yu & Cho, 2003) and neural networks (Deng & Zhong, 2013), and one-class classification algorithms (i.e. one-class classifiers), such as self-detectors, fuzzy logic, outlier counting (Haider, Abbas, & Zaidi, 20 0 0), M20 05 (Magalhães & Santos 2005) and GMM (Hosseinzadeh & Krishnan, 2008; Deng & Zhong, 2013). Both types of algorithms can produce accurate results for user identity verification, depending on the user number of instances and the stability of the user. Together with the chosen features, the machine learning algorithm form the keystroke dynamics algorithm. Despite this level of performance, the use of keystroke dynamics algorithm for identity verification for sensitive online services, in which the user’s exact key presses and releases are collected, can expose the user to credential theft. Even if the information is safely transmitted using cryptographic protocols, events, features, and models that are stored in databases could be vulnerable to theft by “outside attackers” (i.e., attackers that have penetrated the organization) or “inside attackers” (i.e., attackers that abuse their information privileges). The main reason for this vulnerability is that keystroke events, extracted features, and even the derived models cannot be preserved hashed, as usually done with passwords. The fact that they change slightly from session to session, prevent them from being hashed. Further, as keystroke events, extracted features, and derived models contain the exact keys, it allows those who steal them to easily reverse engineer the passwords. The potential harm is exacerbated when passwords are reused or used with minor modifications in other systems, as often happens (Das, Bonneau, Caesar, Borisov, & Wang, 2014). This shows how keystroke dynamics, that is intended to provide a secure means of identity verification, might expose users to a new set of risks. Nevertheless, user confidentiality can be dramatically improved if the user’s identity could be verified without the need to know the exact keys, and instead be verified by building almost identical feature spaces and models. We propose an obfuscation in which the keys are grouped; instead of relying on the exact keys, the keys are divided into a small number of predefined groups, so that when a new user logs in, only the user’s grouped keys are collected instead of the user’s exact keys. Collecting the events after they have been obfuscated, can reduce the possibility of exposure later, when they are sent, processed, or preserved, as the grouping obfuscation makes the reverse engineering of the sensitive information computationally harder. This practice can significantly impair an adversary’s ability to derive the password from the events, features, or models, and it allows us to control the level of security by changing the number of groups – the fewer groups used, the greater the level of security. The use of this grouping and
obfuscation also reduces the dimensionality which thus improves complexity and reduces the memory required. The only remaining question then is how to divide to minimal number of groups such that the accuracy of the keystroke dynamics algorithms will be maximized. To solve this, in this paper, we present a novel method to divide the keys to groups. The method uses hierarchical bottom-up clustering with statistical heuristics scoring given using an algorithm we developed as well. The hierarchical clustering is trained on a handful of representative users, and the results groups can be used with both existing users and new users, without the need for retraining for each new user. This outputted groups later integrates with the keystroke dynamics algorithm and is able to improve their security. We evaluated our method’s groups outputs on five state-of-theart one-class classification algorithms, with a public dataset. Our results demonstrate that our method produces better AUC (Area Under ROC Curve) than other more intuitive divisions to groups, such as grouping based on type (i.e., characters, numbers, control buttons, etc.), location (i.e., dividing the keyboard into geographical areas). And better AUC than hierarchical clustering with the average random heuristics. The paper is organized as follows: in Section 2, we discuss related work. We explain the motivation for our new method in Section 3, and in Section 4, we provide a detailed description of our method and algorithms, and a runtime analysis. The keystroke dynamics algorithms evaluated are presented in Section 5, and the dataset used is presented in Section 6. In Section 7, we describe our experiments and results. Finally, in Section 8, we present our conclusions and suggest future work. 2. Related work Although much academic research has been performed in the field of keystroke dynamics, few of the studies have focused on the risks associated with it. Recently, however, some papers discussed ways of attacking keystroke dynamics and others discussed ways to mitigate them. These attacks can be roughly divided into three types according to the attacker’s knowledge. The first and most common type of attack is the zero-effort attack. In this type of attack, the attacker is not familiar with the user’s behavioral traits or other users at all. In the second type, the statistical attack, the attacker has some knowledge of the general population’s behavioral traits but not about the specific victim’s traits. In the third type of attack, the attacker has managed to eavesdrop and obtain knowledge of the victim’s behavioral traits. Although the three attack types are based on different levels of attacker knowledge, the underlying assumption in each case is that the attacker has already obtained the victim’s password or passphrase and plans to use a bot or manual hacker to impersonate his/her behavioral traits. Tey, Gupta, and Gao (2013) showed that attackers can be taught to act as a specific benign user by giving them proper intuitive feedback while they manually enter the benign user’s credentials. Serwadda and Phoha (2013) forged users’ keystroke dynamics without each user’s specific data, eventually showing that their method increased the EER (the equal error rate, which reflects the equality between the false positive and false negative rates) of three keystroke dynamics algorithms. Their work involved performing statistical analysis of general population users by exploiting several characteristics of the feature set, such as distribution and dependencies. Stefan, Shu, and Yao (2012) focused on identifying bots that forge real users’ activity after collecting their specific keystroke behavioral traits by running a keylogger on the client side. Then, they developed two types of bots that spoof the users’ traits, eventually showing that they could increase bot detection using supervised machine learning with dimensionality re-
I. Hazan, O. Margalit and L. Rokach / Expert Systems With Applications 143 (2019) 113091
duction. Rahman, Balagani, and Phoha (2013) developed an algorithm for keystroke dynamics attacks focusing on free text. The algorithm had three phases, based on snoop, forge, and replay keystroke events, that create events which will eventually appear as the benign user. The algorithm analyzes the user’s previous keystrokes, and in cases in which events are missing, the algorithm completes them using either a very slow event in the hope that the event will be eliminated by an outlier removal preprocess or by using some general population data in order to go ‘under the radar’. Stanciu, Spolaor, Conti, and Giuffrida (2016) discussed reducing keystroke dynamics password attacks in mobile environments using other mobile sensors, such as an accelerometer or gyroscope. They implemented both zero-effort and statistical attacks and showed that the addition of motion sensors improved the verification accuracy of the algorithm. In contrast to the abovementioned research, Monaco and Tappert (2016) did not focus on user behavioral traits but rather on the identity of the user. They were first to discuss the possible implications of leaked keystroke dynamics and proposed two obfuscation methods to address them, aiming to improve users’ anonymity in the system. Both of the proposed methods are a generalization of the Chaum mix. The first is a global mix that requires cooperation from all of the users in the system, and the second is a user mix that doesn’t require the users’ cooperation. Both methods focus on delaying and mixing events, such that it does not interrupt the user but preserves the user’s identifiability. These methods are different from ours as they focus on preserving the identity of the user and not necessarily the transferred or saved information. To the best of our knowledge, no research was done focusing on the protection of keystroke dynamics from information theft, and credential theft in particular. In our paper, we discuss the vulnerability of keystroke dynamics to credential theft and suggest a mechanism for obfuscation that maintains high verification accuracy while increasing the security from credentials theft. 3. Motivation A variety of obfuscation techniques have been developed with the aim of preventing humans from understanding the hidden (obfuscated) information. One-way hash functions are one such technique, commonly used to obfuscate passwords. However, applying obfuscation in applications that involve machine learning is not trivial. Machine learning techniques allow fuzziness between different samples, and the application of obfuscations such as hash functions is problematic, as slight deviation between two samples will results in totally different hash values. The other option of preserving the information without obfuscation can result in information theft or leakage. Especially when obfuscation is applied on sensitive information, such as passwords. Another possibility to obfuscate information is to divide its building blocks into groups and use the groups instead of the building blocks. When working with groups, even if an attacker was able to intercept a session and is familiar with how the groups are formed, he/she will be unable to identify the exact building blocks. Using brute-force to search for the exact building blocks in a group becomes computationally harder when larger groups must be searched. Furthermore, with this type of obfuscation, information is lost, and this can affect the accuracy of the keystroke dynamics algorithms verification, thwarting the goal of improved accuracy in this area. Thus, it is important to carefully select building blocks that maximize the accuracy of the keystroke dynamics algorithms used. Shimshon, Moskovitch, Rokach, and Elovici (2010) were the first to suggest the use of groups of di-graphs in the feature space, instead of using the di-graphs explicitly. They originally did so to im-
3
prove the complexity and memory footprint in keystroke dynamics free text algorithms. The suggestion was, to divide the di-graphs to groups specifically for each user in the dataset by applying a clustering algorithm on similar features once enough data is collected on the user. By doing so, the authors obtained almost the same accuracy as if they would use the original di-graphs, but now in a much more efficient way. Although not discussed in their paper, we noted this grouping method can be used for obfuscation, since a potential attacker can only see the groups used and not the exact keys which comprise the groups. This method is eventually much more secure than using the non-grouped di-graphs. Despite this benefit, the method isn’t perfectly suitable for grouping obfuscation due to the following reasons: 1. It was designed for one user; applying it on several users does not scale. Clustering di-graphs based on their similarity is problematic when applied to several users, since now the division to groups have to also optimize the distance between every two users (described in Section 4.1). 2. Clustering di-graphs for each user limits the obfuscation in the user first sessions. Dividing to groups specifically for each user requires that each user send several sessions of unobfuscated data for preprocessing. Repeating this process every time a new user registers, or even when an existing user changes his/her password, exposes the credentials. 3. It was designed for free text, applying on fixed text is different. In free text, each user instance contains much longer and var´ Camtepe, ied text than in fixed text. Messerman, Mustafic, and Albayrak (2011) and Monaco, Stewart, Cha, and Tappert (2013) show that even state-of-the-art free text, keystroke dynamics algorithms require that each instance contain hundreds of words and thousands of keystrokes in order to work well. 4. Clustering di-graphs can still expose the user to data leakage after all. The clustering of di-graphs to similar groups only seems like it encapsulates them, but this is actually misleading. To demonstrate the idea, consider the following groups of di-graphs: G1 :{a->b, h->j} and G2 :{b->c, u->b}. If an attacker manages to eavesdrop on a session with the following sorted vector of groups and latencies [G1 : 235, G2 : 192], the initial thought might be that if there are two possible di-graphs in G1 and two in G2 , the attacker will be faced with four possible passwords. However, it is clear that the password is “abc” due to the fact that no other continuous option other than a->b & b->c is possible within the other three combinations. Let’s use an even more realistic example to emphasize the computational problem. Consider we have 62 possible keys in our keyboard, and we have a user with an eight character password containing one capital letter, which generates the following vector of press events (for simplicity, we focus only on key press events), where si is a pressed key and ti is the timestamp of the event: [s1 : t1 ,s2 : t2 ,s3 : t3 ,s4 : t4 ,s5 : t5 ,s6 : t6 ,s7 : t7 ,s8 : t8 ,s9 : t9 ]. If we use a clustering technique that will equally divide the 3, 844( = 622 ) press to press di-graphs into 36 equal size groups, we will get ~106.7(∼ = 3,844 1 36 ) di-graphs per group for of the groups {G1 ,G2 …G36 }. As commonly assumed in cryptography, the groups are known both to the user and the attacker. According to Shimshon et al. (2010), this vector will be processed to a sequence of groups and latencies, such as: [Gx1 : t1 , Gx2 : t2 , Gx3 : t3 , Gx4 : t4 , Gx5 : t5 , Gx6 : 1 Some of the numbers are rounded to a certain extent after the decimal point to avoid very long numbers in the text, however this doesn’t affect the order of magnitude of the results.
4
I. Hazan, O. Margalit and L. Rokach / Expert Systems With Applications 143 (2019) 113091
t6 , Gx7 : t7 , Gx8 : t8 ], where ∀i xi ∈ {1, 2…36}. Now if this sequence is exposed to an attacker, the immediate assumption is that the number of options for rebuilding the password is 106.78 = 1.68 • 1016 which is the average number of di-graphs in each group, with the vector length as the exponent. This number indeed ensures protection against brute-force attacks, but it is actually incorrect. Although the possible di-graphs s1 → s2 ∈ Gx1 within the first group Gx1 can be any of the 106.7 options, the second one can only be a di-graph s2 → s3 ∈ Gx2 in accordance with the previous s2 . This notion helps the attacker rule out many possibilities and dramatically reduce the number of di-graphs s2 → s3 ∈ Gx2 . Hence, in the second group the attacker can go over the possible di-graphs within Gx2 such that they start with the end of the previous key. Thus, the actual possible number of options in average for s2 → s3 ∈ Gx2 is the group size divided by the number of all .7 ∼ the possible keys which is only 106 62 = 1.72. Means, there are, on average, only 1.72 options for s2 → s3 ∈ Gx2 and only, on average, 1.72 options for s3 → s4 ∈ Gx3 etc. Eventually the correct number of possibilities for a brute-force attack is only 4, 752 ( = 106.7 • 1.727 ), which is 106.7 options for the first di-graph and 1.72 for each continuation. Unfortunately, 4, 752 options are not considered sufficient for a protection against a brute-force attack. Longer, more complicated passwords won’t be of much help, as the coefficient will always remain at 1.72. Thus, through this example, we can see that grouping the di-graphs in this way might enhance dimensionality reduction but is not a sustainable option if effective obfuscation is the aim. But even though clustering di-graphs such as was done by Shimshon et al. (2010) creates serious problems when it is applied as an obfuscation, the use of groups and clustering inspired our solution for keystroke dynamics fixed text obfuscation, with the appropriate adaptations.
(vertex-groups) that in turn will create 36 groups of edges (edgegroups), which are combinations of two vertices including from each vertex-group to itself. Each vertex-group will be denoted as G˙ i such that 1 ≤ i ≤ 6, and each induced edge-group will be denoted as G˙ i → G˙ j or Gi,j for short. For example, all edges that start with a vertex of G˙ 2 vertex-group and end with a vertex of G˙ 5 vertex-group are in the same edge-group of G˙ 2 → G˙ 5 , which we shorten to G2,5 . As there are 62 keys and 6 vertex-groups, each G˙ i
contains 10.3 (= 62 6 ) keys, on average. For example, the vector presented in Section 3, [s1 : t1 ,s2 : t2 ,s3 : t3 ,s4 : t4 ,s5 : t5 ,s6 : t6 ,s7 : t7 ,s8 : t8 ,s9 : t9 ] will now be obfuscated to: [G˙ x1 : t1 , G˙ x2 : t2 , G˙ x3 : t3 , G˙ x4 : t4 , G˙ x5 : t5 , G˙ x6 : t6 , G˙ x7 : t7 , G˙ x8 : t8 , G˙ x9 : t9 ], where ∀i xi ∈ {1, 2, 3, 4, 5, 6}. This event vector will eventually be turned into the following digraphs and delta time vectors based on the induced edgegroups: [Gx1 ,x2 : t1 , Gx2 ,x3 : t2 , Gx3 ,x4 : t3 , Gx4 ,x5 : t4 , Gx5 ,x6 : t 5 , G x 6 , x 7 : t 6 , G x 7 , x 8 : t 7 , G x 8 , x 9 : t 8 ] . The game changer is the fact that we first obfuscated the keys and not the direct di-graphs, to groups. This ensures that all continuations are possible, and each vertex-group G˙ xi can only imply something about itself and not about the previous or next group. And as there are, in average 10.3 keys in a vertex-group, we have 126, 677, 008(∼ =10.38 ) possibilities with the same setup as we had previously and received only 4752 possibilities, thus making our method 26, 657 times more resistant to brute-force attack on the same eight character password than the method suggested by Shimshon et al. (2010). To generalize the numbers that compare the two methods, we use a password of length l characters, k different groups, and s number of possible keys. In this case, the number of possibilities in a brute-force attack in the two methods is as follows:
s2 Shimshon s Method : k
4. Our method Our goal was to design a method for keystroke dynamics obfuscation that addresses the four issues presented in the previous section. We suggest that instead of clustering each user, we select several representative users. In addition, instead of putting the digraphs into groups, we put the keys into groups. Using the grouped keys, we extract the features according to the groups induced combination. This might somewhat seem analogous to the method developed by Shimshon et al. (2010), but as we will explain, doing this actually allows us to overcome the four issues raised in the previous section. 4.1. Modeling the solution using graphs To demonstrate how our method effectively addresses the four issues, we model the solution using graphs. Let us define a graph G = (V, E) such that each vertex represents a key, and each directed edge in the graph from vi to vj (i.e. vi ,vj ) represents a di-graph of si → sj . Shimshon et al. (2010) suggested grouping the edges into k groups for each user. Instead of grouping the edges, we propose grouping the vertices and use the induced groups of edges from one group of vertices to another. If Shimshon et al. use k groups of √ edges, we use k groups of vertices resulting in k induced groups of edges out of it. By grouping the vertices instead of the edges, we ensure that all continuations of di-graphs are possible, and thus, prevent from the attacker the elimination of options which made the brute-force attack easier to perform. Let’s examine this solution in the same setup used by Shimshon et al. (2010), where k = 36 (they originally used 35, but we needed a number that has a natural square root, and 36 is close enough). We cluster the vertices into 6 groups of vertices
Our Method :
s √ k
l =
s2 1 · k s
l−2
=
sl kl−1
sl kl/2
Hence, we can see that our method is generally kl/2 − 1 more resistant, meaning that our method will be linear more resistant if we increase k (i.e., the number of groups) in terms of 2l − 1 and exponentially more resistant if we increase the length l. 4.2. Three-dimensional hierarchical grouping The only problem left unsolved is how to optimally divide the keys into groups such that the induced combination of di-graphs will maximize the accuracy of the keystroke dynamics algorithm that needs to verify the users’ identity. An intuitive idea is to divide the keys into groups according to their location on the keyboard or according to their type (e.g., a group of characters and a group of numbers), but as we show in the results section, this can be improved. Another reasonable idea is to use clustering, but as mentioned in Section 3, traditional clustering techniques work with two-dimensional problems. For example, in Shimshon et al. (2010) the dimensions are di-graphs and instances (i.e. feature vectors), where all of the instances belong to the same user. Thus, the di-graphs can be clustered according to their similarity within the cluster of di-graphs and differences to other clusters of di-graphs. However, in our problem, we want to create one set of groups for all users and not one for each user. This means we have a three-dimensional problem in which the dimensions are di-graphs, instances, and different users. If we want to use traditional clustering algorithm, we must ignore one of the dimensions.
I. Hazan, O. Margalit and L. Rokach / Expert Systems With Applications 143 (2019) 113091
Fig. 1. A graph of four vertices, with two groups of vertices, and hence, four groups of edges.
One option is to use different datasets for different users, but then each user will get a different set of groups, instead of having the same output groups for everyone. Another option is to aggregate all instances of the same user into an “average user instance”, but then we will cluster features that are similar among different users which is exactly the opposite of what we are trying to achieve. Thus, we see that the development of a new method for dividing the keys, that will address our three-dimensional clustering problem is inevitable. To demonstrate what we want to achieve, we use the following example. Consider a graph of only four keys {a, b, c, d}, as seen in Fig. 1, which is divided into two vertex groups G˙ 1 = {a, c} (in black) and G˙ 2 = {b, d} (in white). From these grouping we receive the induced four edge-groups: G1,2 = {a → d, a → b, c → d, c → b}, G2,1 = {b → a, c → c, d → a, d → c}, G1,1 = {a → a, c → c, c → a, a → c}, and G2,2 = {b → b, d → d, b → d, d → b}. Consider there are two users in the system; for user1 , the edges inside group G1,2 must be as similar as possible to each other, and at the same time, the same edges must be as different as possible from G1,1 and any other group. For user2 , the same set of edges G1,2 has to be different from the set of edges G1,2 of user1 . When generalizing this two user problem to a multiple user problem it become NP-hard. Like in such problems, in our solution, we use a greedy based approach to find a possible output which is not promised to be the optimal. Our solution is based on hierarchical clustering, which is a procedure used to partition mutually exclusive subsets that are maximally similar with respect to specified characteristics (Ward Jr.,
1963). This is usually done using greedy algorithms, and once a selection has been made, it is not reversible. There are two main types of hierarchical clustering: agglomerative and divisive (Jain, Murty, & Flynn, 1999). Agglomerative hierarchical clustering has a bottom-up mechanism in which each object starts as its own cluster (i.e., singleton), and in each step, two clusters are merged according to a certain heuristics score. Thus, after each step, there is one less cluster, starting with n and stopping when there is one cluster or when another condition is satisfied. Divisive hierarchical clustering, on the other hand, is a top-down mechanism that starts with one cluster which is a combined set of all objects, and in each step one cluster is divided into two, according to a certain heuristics score, starting with one cluster and stopping when there are n clusters or another condition is satisfied. Our method is based on agglomerative hierarchical clustering. In each step we try to score the loss, using a predefined heuristic, on each pair of groups nominated for a G˙ i , G˙ j merge. Those that have the minimal loss of the heuristics score will be merged in the end of each step. When nominating pair of groups G˙ i , G˙ j (i.e. ˙ vertex-groups) to be merged into a new group G˙ new (= G˙ i ∪Gj ), we make a change that will affect every induced group (i.e. edgegroup) that contains G˙ i or G˙ j in the current set of groups and every combination that contains G˙ new in the new set of groups. No changes are made to the induced groups that do not contain G˙ i or G˙ j in the current set of groups and to the induced groups without G˙ new in the new set of groups. To score the loss, we devel-
oped an algorithm that predicts a heuristic score to each pair of groups and return the pair with the minimal loss. We refer this algorithm as the Full Best Combination algorithm and its pseudocode is presented in Algorithm 1. The algorithm goes over all possible pair of groups G˙ i , G˙ j and nominate them for a merge. Every other group will be checked against G˙ i , G˙ j and a final loss score will be given. Those G˙ i , G˙ j with the minimal loss will be returned and later merged. Note that the algorithm uses the score of some Max Fisher algorithm that we define next. The only missing particle is the Max Fisher that evaluate the benefit of two ordered groups G˙ i , G˙ j which contains di-graphs s1 → s2 such that s1 ∈ G˙ i and s2 ∈ G˙ j . In order to solve this, we suggest an assistance algorithm that query all the di-graphs s1 → s2 , and perform three steps: (1) For each user (i.e., userm ) of the representative users, calculate the AVG (Average) and STD (Standard deviation) of the s1 → s2 latencies, (2) For each of pair of users (i.e., userm , usern ) in the representative users, cal-
Algorithm 1 Pseudocode presenting the full best combination algorithm. Full Best Combination Constants: None Input: allGroups – the group of all groups so far in the run Output: bestCombination – the groups {G˙ i , G˙ j } with the best gain from being combined 1. bestMerge ← ∅ 2. minimalLoss ← ∞ 3. for each G˙ i , G˙ j in allGroups such that G˙ i = G˙ j : 3.1. otherGroups ← {G˙ k ∈ allGroups | G˙ k = G˙ i ∧ G˙ k = G˙ j } 3.2. oldScore ← G˙ k ∈otherGroups (MaxF isher (G˙ k , G˙ i ) + MaxF isher (G˙ i , G˙ k ) + MaxF isher (G˙ k , G˙ j ) + MaxF isher (G˙ j , G˙ k )) 3.3. oldScore ← oldScore+ MaxFisher(G˙ i ,G˙ i )+MaxFisher(G˙ j ,G˙ j )+MaxFisher(G˙ i ,G˙ j )+MaxFisher(G˙ j ,G˙ i )) 3.4. newScore ← (MaxFisher(G˙ k , G˙ i ∪G˙ j )+MaxFisher(G˙ i ∪G˙ j , G˙ k )) G˙ k ∈otherGroups
˙ ˙ 3.5. newScore ← newScore+MaxFisher(G˙ i ∪Gj , G˙ i ∪Gj ) 3.6. currentLoss ← oldScore – newScore 3.7. if currentLoss < minimalLoss: 3.7.1. minimalLoss ← currentLoss 3.7.2. bestMerge ← {G˙ i , G˙ j } 4.1. return bestMerge
5
6
I. Hazan, O. Margalit and L. Rokach / Expert Systems With Applications 143 (2019) 113091
Algorithm 2 Pseudocode presenting the Max Fisher algorithm. Max Fisher Constants: k - minimal number of instances in order to be evaluated Input: G˙ 1 , G˙ 2 - ordered groups of keys to be evaluated users - all the users in the set Output: score - the evaluated score of the two groups in the data 1. usersAVGS ← [] 2. usersSTDS ← [] 3. for userm in users: 3.1. userFeatureSet ← {feature ∈ featureSet | feature.user = userm } 3.2. userReducedFeatureSet ← {feature ∈ userFeatureSet | feature.s1 ∈ G˙ 1 ∧ feature.s2 ∈ G˙ 2 } 3.3. if Size(userReducedFeatureSet) >= k: 3.3.1. usersAVGS[userm ] ←AVG({feature.latency | feature ∈ userReducedFeatureSet}) 3.3.2. usersSTDS[userm ] ← STD({feature.latency | feature ∈ userReducedFeatureSet}) 4. allUsersScores ← [] 5. for userm , usern in users: [userm ]− usersAVGS[usern ]) 5.1. currentFisher ← (usersAVGS usersSTDS[userm ]+usersSTDS[usern ] 5.2. allUsersScores ← allUsersScores ∪ [currentFisher] 6. score ← Max(allUsersScores) 7. return score 2
culate the Fisher score using the AVG and STD, (3) Score the two ordered groups G˙ i , G˙ j according to the maximal Fisher score. To avoid outliers, we consider users with at least k instances. The pseudocode of the Max Fisher assistance algorithm is presented in Algorithm 2 and will be referred to as the Max Fisher algorithm. The runtime analysis of the full best combination algorithm in the hierarchical grouping is O(n3 ) Max Fisher runs, where n is the number of distinct keys (explained in Section 4.4). Because Max Fisher algorithm runs can be demanding, we developed a faster heuristics algorithm to give a score to each possible merge but at a dramatically reduction in the number of runs of the Max Fisher algorithm. In the faster version we only calculate combinations assembled from G˙ i , G˙ j in the old set of groups and combinations assembled from G˙ new in the new set of groups. In other words, we remove the Max Fisher runs of G˙ i or G˙ j with all the other groups in the old set. And remove the Max Fisher runs of G˙ new with all the other groups in the new set. Thus, we reduce the running time to O(n2 ) (as explained in Section 4.4). The faster version pseudocode is presented in Algorithm 3 and will be referred to as Fast Best Combination algorithm.
4.3. Location-based tie breaker As we witnessed in our experimentations, in many cases, there is more than one possible combination for merge with the minimal loss. In these cases, we suggest merging the groups which are closest on the keyboard. As there are several keys in each group, the term “close” can be interpreted in many ways. We chose to set this closeness measure as the Euclidean distance between the centers of the two groups. To calculate distances on the keyboard, we mapped the x,y coordinates of each key on the common QWERTY and AZERTY keyboards, so we can easily calculate the center of a group and the distance between groups. In the rare occasion of multiple combinations with the minimal loss and minimal distance, we chose one of them randomly for the merge. 4.4. Runtime analysis The basic calculation in all of our evaluations is running the Max Fisher Algorithm. This algorithm receives two ordered groups G˙ i , G˙ j and calculates, for each of the representative users, the average and standard deviation of latencies according to the di-graphs s1 → s2 , such that s1 ∈ G˙ i and s2 ∈ G˙ j . Then, the Fisher score of
Algorithm 3 Pseudocode presenting the fast best combination algorithm. Fast Best Combination Constants: None Input: allGroups – the group of all groups so far in the run Output: bestCombination – the groups {G˙ i , G˙ j } with the best gain from being combined 1. bestMerge ← ∅ 2. minimalLoss ← ∞ 3. for each G˙ i , G˙ j in allGroups such that G˙ i = G˙ j : 3.1. oldScore ← Max(MaxFisher(G˙ j ,G˙ j ), MaxFisher(G˙ i ,G˙ i ), MaxFisher(G˙ i ,G˙ j ), MaxFisher(G˙ j ,G˙ i )) ˙ ˙ 3.2. newScore ← MaxFisher(G˙ i ,∪Gj , G˙ i ∪Gj ) 3.3. currentLoss ← oldScore - newScore 3.4. if currentLoss < minimalLoss: 3.4.1. minimalLoss ← currentLoss 3.4.2. bestMerge ← {G˙ i , G˙ j } 4. return bestMerge
I. Hazan, O. Margalit and L. Rokach / Expert Systems With Applications 143 (2019) 113091
each pair of users is computed. And eventually, the maximal fisher score is returned. As the ordered groups G˙ i , G˙ j can repeat numerous times throughout the hierarchical clustering, we maintain Memoization for the Max Fisher calculation, so that in cases in which a nominated combination has already been calculated, we use the previously calculated score instead of running the algorithm again. If it was already calculated, we only perform a simple query from a hash table. In the full best combination algorithm presented in Algorithm 1, there is excessive calculation of the Max Fisher, mainly in lines 3.2 and 3.4 where all of the other groups are traversed in order to evaluate their interaction with the nominated groups Gi and Gj . The Max Fisher is also used in lines 3.3 and 3.5 to evaluate the inner interaction of the nominated groups. In this case, the full best combination algorithm is eager to evaluate how each nominated merge influences all of the other groups. Thus, to summarize the number of runs, in the first step of the hierarchical grouping, we calculate ∀i, j | MaxFisher(G˙ i , G˙ j ) and ˙
˙
∀i, j: i = j | MaxFisher(G˙ i ∪Gj , G˙ i ∪Gj ), which is O(n2 ). In addition, we evaluate the combinations from every pair G˙ i , G˙ j to every G˙ k ˙
from the other groups, thus ∀i, j, k | MaxFisher(G˙ i ∪Gj , G˙ k ). Hence, the total number of runs for the first step is O(n3 ). Now consider that in the first step the groups G˙ 1 and G˙ 2 were chosen to be ˙ merged into G˙ new (= G˙ 1 ∪G2 ); in the second step we are left with ˙ ˙ the evaluation of ∀i | MaxFisher(G˙ new ∪Gi , G˙ new ∪Gi ) and ∀i, j | Max˙ Fisher(G˙ new ∪Gi , G˙ j ), which results in O(n2 ) for this step and subsequent steps, from n − 1, until one which is additional O(n3 ). Thus, the total runtime remains at O(n3 ). In the fast best combination algorithm seen in Algorithm 3, we perform much less Max Fisher calculations. In the first step of the hierarchical grouping, we run Max Fisher for every two groups ˙ ˙ ∀i, j | MaxFisher(G˙ i , G˙ j ) and ∀i, j: i = j | MaxFisher(G˙ i ∪Gj , G˙ i ∪Gj ), which results in a total of O(n2 ). In the second step and the subsequent steps, most groups remain the same, and the calculation has already been performed, except for the calculations ˙ needed for the new group G˙ new = G˙ 1 ∪G2 after groups G˙ 1 , G˙ 2 were merged. Hence, next, we are only left with the calculation of ∀i | MaxFisher(G˙ new , G˙ i ), MaxFisher(G˙ i , G˙ new ), and Max˙ ˙ Fisher(G˙ new ∪Gi , G˙ new ∪Gi ), which is O(n). As it runs for n − 1 iterations it results in O(n2 ), and the total runtime will remain O(n2 ). 5. Classification algorithms evaluated Unlike the widely used di-graphs for feature space, there is no consensus on the best keystroke dynamics algorithm. In fact, many algorithms have been developed and their evaluation has shown different results, depending on the circumstances. As stated in Section 1, there are two main types of keystroke dynamics algorithms according to their base classifier: supervised classifiers and one-class classifiers. Supervised classifiers require tagged keystrokes instances of both the benign user and masqueraders, in order to build a model, whereas one-class classifiers only require the benign user keystrokes instances. The latter is more suited to real-world applications, as the availability of the tagged masquerader instances cannot be assumed. Thus, we only focused on oneclass classifiers in our work. Killourhy and Maxion (2009) compared several one-class classifiers that are based on similarity, such as the Manhattan distance, Euclidean distance, and Mahalanobis distance, as well as other non-similarity based approaches, such as one-class SVM and neural networks. The best results in terms of EER were obtained using the following: scaled Manhattan distance, nearest neighbor Mahalanobis distance, and outlier counting Z-score. The Mahalanobisbased algorithm cannot be used in our case as the data used in our evaluations is not structured enough and some features may be
7
missing. However, we implemented and evaluated both the scaled Manhattan and outlier counting algorithms. The scaled Manhattan algorithm was originally described by Araújo, Sucupira, Lizarraga, Ling, and Yabu-Uti (2005). In this algorithm, the mean and absolute error of each feature are calculated to assemble a centroid for the entire training set. When a test instance appears, it is tested against the centroid instance, and their Manhattan distance is calculated and scaled by the absolute error of each feature. The outlier counting Z-score algorithm was originally introduced by Haider et al. (20 0 0) and was referred to as the “statistical technique”; the algorithm calculates the mean and standard deviation of each feature in the training set. Later when a test instance appears, it counts the number of Z-scores that exceed a certain threshold (in their paper the threshold was set at 1.96) after normalized by each feature’s standard deviation. In addition, we evaluated and implemented algorithms mentioned in a more recent survey that compared several one-class classifiers with several configurations using smaller training sets. The survey showed that the best-performing algorithms were selfdetector usage control R and M2005 double parallel (Pisani, Lorena, & de Carvalho, 2015). These algorithms performed best with several tested datasets and configurations; hence, we chose both for our evaluations as well. Self-detectors, in general, are one-class classifiers inspired by the way the immune system detects malicious cells in the body. These algorithms use a lazy modality, in which in the training phase they only extract descriptive features and preserve them without building a model. When a test instance appears, the algorithm compares it to all the existing training instances using a similarity measure. In case a match is found (i.e., the similarity score is higher than a specified threshold), the algorithm classifies it as “self” (i.e., benign); otherwise, the algorithm classifies it as “nonself” (i.e., masquerader). The threshold acts as a radius surrounding each example and is configurable. The advantage of using a lazy modality is that it is easily updated over time. In the specific implementation of usage control R used by Pisani et al. (2015), the algorithm is updated using instances classified as “self.” At the same time, the algorithm contains two values for each training instance, and these values are used to identify instances that are obsolete and should be removed: (1) usage_count: the number of occasions in which the instance was used to identify benign instances, and (2) recent_usage: the number of test instances identified by other instances since the last time this instance detected another instance. Another version of this algorithm uses negative instance selection, however we didn’t evaluate this approach, as previous work (Stibor, Mohr, Timmis, & Eckert, 2005) which examined the use of negative selection in self one-class classifiers showed that this approach does not work well. M2005 was originally described by Magalhães and Santos (2005). It uses the median, average, and standard deviation of each feature. The algorithm was the best performing out of several algorithms tested with the CMU dataset which was collected in a controlled environment. The double parallel addition shown in Pisani et al. (2015) keeps two parallel models, one with an increasing window and one with a sliding window. When a test instance appears, the algorithm produces a score using the average between them. The last one-class classifier we used is the Gaussian computation distance algorithm, which used earlier for keystroke dynamics by Hocquet, Ramel, and Carbot (2006). The algorithm computes a model θ on the training set using a Gaussian. For each of the features i the average and standard deviation θ i = (μi ,σ i ) are computed. When a test instance x appears, the distance between the model and the test instance d(x, θ ) is calculated as follows: n ( − | x i −μ i | ) σi d (x, θ ) = 1 − 1n e . i
8
I. Hazan, O. Margalit and L. Rokach / Expert Systems With Applications 143 (2019) 113091
To summarize, we evaluated our method on five keystroke dynamics algorithms that are based on one-class classifiers: (a) Gaussian computation distance, (b) outlier counting Z-score, (c) scaled Manhattan, (d) M2005 double parallel, and (e) self-detector usage control R. 6. Dataset There are several public datasets for keystroke dynamics which were summarized and compared by Giot, Dorizzi, and Rosenberger (2015). For our evaluations, we had to choose a dataset with as many different passwords (and usernames) as possible so the representative users will have different passwords than the tested users. The hierarchical clustering algorithm was then to run on the representative users to build the groups; and later we applied these groups on the tested users that don’t necessarily have the same passwords. In addition, we needed a dataset that included users that entered their credentials and others trying to impersonate them. The only dataset that was applicable is the GREYC-web dataset (Giot et al., 2012). The GREYC-web dataset was collected during 2010-2012 at the University of Bordeux in France and consist of 118 users. The dataset actually contains two sub-datasets: (1) Passphrases: all of the users type the same login and password, and (2) Passwords: each user has a unique username and password, and the users entering their own usernames and passwords several times in a row and masqueraders trying to access their accounts. Due to the reasons discussed above, we only used the sub-dataset containing self-chosen usernames and passwords. The collection was formed from the data of users utilizing their own devices and different operating system and browser combinations. Each user entered his/her credentials in different bursts, where each burst contains several instances in a row. The “masqueraders” are actually other users who try to impersonate the users by entering the legitimate users’ credentials several times in a row. In most cases, each masquerader enters the legitimate users’ credentials 5-10 times in a row which makes it more realistic for authentication systems limited number of trials but allows the masqueraders to obtain experience in entering others’ credentials. These factors make this dataset best for our study as it has password variance, long duration, uncontrolled environment and total number of ~90 0 0 genuine instances and ~10,0 0 0 fraudulent instances. 7. Experimentation To evaluate the performance of our method in dividing the keys into optimal groups, we designed an experiment using the dataset presented in Section 6 and the five keystroke dynamics algorithms discussed in Section 5, in which we refer as classifiers to not confuse with our developed algorithms. In our experiment, we examined the value of groups produced using our hierarchical grouping for every wanted number of groups. Inside the hierarchical grouping we used the fast best algorithm version to produce the heuristics score. We compared it to two other intuitive divisions and to several hierarchical groupings with random heuristics score. Of the 118 users in the dataset, we used 30 randomly selected users to be the representative users to help us find the groups, and from the rest, we randomly sampled half (i.e., 44 out of 88) to be the test users. We start the hierarchical grouping run with the superset (i.e., each key in a different group), and in each step, we merge two groups based on the 30 representative users assigned for the grouping until we are left with one group (i.e., group that contains all of the keys). Simultaneously, after a predefined number of steps, we apply the classifiers described in Section 5 using the 44 test users with the current grouping. The classifiers were
trained using the first three bursts (i.e., an average of 30 instances) of each user, and the rest of the bursts were used for testing. We applied the parameters used in the original papers and did not change them during the experiments. In addition, to avoid the use of additional parameters, we did not update any of the classifiers over time. There were 81 different keys in the dataset, including both writing keys (e.g., letters, numbers) and control keys (e.g., caps lock, tab). Thus, in our experimentation, the hierarchical grouping started with exactly 81 groups and continue until all of the groups merged into one. Each time we reached a number of groups N such that N%10 = 0 we ran the five classifiers with the current N groups. In addition, we ran the classifiers for every N < 10 to see whether the hierarchical grouping produced a valuable group division when there are small number of groups and the constraints are harder. First, we compared our hierarchical grouping with the heuristics scores given by the fast best algorithm to the hierarchical grouping with a random scoring heuristics. Random scoring means, in each step of the hierarchical grouping we assign a random loss score to each nominated merge; eventually we still merge the minimal, but it is meaningless. To avoid random minimal/maximal points, we ran the hierarchical grouping with the random scoring using several different seeds {1,2,3,4,5}. The comparisons of the different heuristics score (i.e. fast best and random) showed that the differences when N > 20 were negligible, thus we zoomed in on N ≤ 20 which is the more interesting case in terms of user privacy, as described in Section 3. In Fig. 2 (a–e) and Fig. 3 (a–e) we present the results of the classifiers. In each case, the results of our method are indicated by a solid black line, and the results of the compared method are indicated by a gray dashed line which represent the average result of the different seeds used. The x-axis shows the number of groups; the y-axis shows the average AUC (area under the curve) in Fig. 2 (a– e) and the average EER in Fig. 3 (a–e). Each sub-figure represents the results on a different classifier, in the following order (a) Gaussian computation distance, (b) outlier counting Z-score, (c) scaled Manhattan, (d) M2005 double parallel, and (e) self-detector usage control R. The sub-figures of Fig. 2 showing the AUC, show that our method (black line) with the fast best heuristics is either outperforms or is equal to the random heuristics with the different seeds (gray dashed line). The gap to the favor of the fast best heuristics is smaller when N = 20, and it gets larger until N = 3; after this point the gap lower back until it is vanished when N = 1. The average AUC when 1 < N < 10 of the hierarchical grouping with the random heuristics is 0.9262, whereas with the fast best heuristics it is 0.9364. The gap in the AUC seen on the different classifiers is highest when N = 3 where the average AUC jumps from 0.9105 to 0.9310. The sub-figures of Fig. 3, which compare the EER, show complementary results. The black line representing our method with the fast best heuristics is either below or equal to the gray dashed line representing the random heuristics with different seeds. For the EER there is an average reduction from 0.1175 to 0.1060 with the largest reduction seen at N = 3 from an average of 0.1373 to 0.1141. Next, we wanted to compare our method to other groupings that are not based on hierarchical grouping but are rather more intuitive ones. Thus, we used two other groupings: (1) dividing the keys into five groups according to their type: letters, special characters (i.e., slash, comma, brackets, etc.), control keys (i.e., shift, caps lock, tab, etc.), and two groups of numbers (one group consists of the numbers that are above the letters on the keyboard, and the second group consists of the numbers on the numpad); and (2) dividing the keys into five groups based on the keyboard’s location zones. This method was suggested by Revett and Khan (2005) who showed that keystroke dynamics performance
I. Hazan, O. Margalit and L. Rokach / Expert Systems With Applications 143 (2019) 113091
9
Fig. 2. Average AUC for N number of groups.
improves when users select a password assembled from multiple location zones on the keyboard. We compared these two grouping methods to our hierarchical grouping with the fast best heuristics that halted at exactly five groups as well. The plots in Fig. 4 (a–c) illustrate the keyboard division based on the three methods. A comparison of the performance of the classifiers using the three groupings is presented in Table 1 (presents the average
AUC) and Table 2 (presents the average EER). We can see that our method consistently outperforms the other two. Our method’s grouping AUC was in average 10.8% higher than the type grouping (average of 7, 7, 6, 11, and 23) and in average 3.2% higher than the location zones grouping (average of 3, 2, 2, 3, and 6). We see complementary results for the average EER as well, with our method’s 12% lower average EER than the keys type grouping (average of
10
I. Hazan, O. Margalit and L. Rokach / Expert Systems With Applications 143 (2019) 113091
Fig. 3. Average EER for N number of groups.
Table 1 Classifiers average AUC of the five groups divided by the three grouping methods.
Keys type Location zones Our method
Gaussian
Outlier
Manhattan
M2005
Self-detector
0.89 0.93 0.96
0.90 0.95 0.97
0.89 0.93 0.95
0.86 0.94 0.97
0.65 0.82 0.88
Table 2 Classifiers average EER of the five groups divided by the three grouping methods.
Keys type Location zones Our method
Gaussian
Outlier
Manhattan
M2005
Self-detector
0.17 0.13 0.08
0.15 0.10 0.07
0.16 0.11 0.08
0.21 0.11 0.07
0.37 0.22 0.16
I. Hazan, O. Margalit and L. Rokach / Expert Systems With Applications 143 (2019) 113091
11
Fig. 4. Keyboard plot of the five groups of keys based on the three grouping methods.
9, 8, 8, 14, and 21) and 4.2% lower average EER than the location zones grouping (average of 5, 3, 3, 4, and 6). When running our hierarchical grouping method, we witnessed an interesting phenomenon. We first suspected that with every step, the merging of two groups would result in a loss of information. And as we reduce the old score of before the merge from the new score of after the merge, the heuristics score of the fast best algorithm will always be positive, and hence the accuracy of the classifiers applied on the groups will always decrease. This is a reasonable assumption, as with each merge of groups we must “pay” for the extra privacy we obtain. We of course will always aspire for minimal loss score that will hopefully generate minimal reduction in the accuracy. However, throughout the experiments, the heuristic score wasn’t necessarily positive; in some cases, it was zero and in others it was even negative. From this we intuited that perhaps some merging can actually improve the accuracy of the classifiers, in addition to the privacy. Given this, we performed a more in-depth inspection, using one classifier (i.e. the Gaussian) and running it for each 1 ≤ N ≤ 81. The heuristic score and the classifier’s AUC results at each step are presented in Fig. 5 (a & b). Fig. 5 (a) presents the heuristic score (y-axis) of the best possible merge according to our method for each N number of groups (x-axis), whereas Fig. 5 (b) shows the actual AUC of the classifier (y-axis) for the same N
number of groups (x-axis). We square rooted the absolute value of the heuristics score to make it more visible. Examining Fig. 5 (a) we can see that when N < 12, the fast best algorithm predicts a positive heuristic score, which means that the predictions suggest that this merge, will cause some deterioration in the performance of the classifier, which is our initial assumption. Indeed, we can see that this prediction is realized in Fig. 5 (b) where the real AUC score is decreasing. But when moving forward to when 12 ≤ N < 40, we can see that the fast best algorithm predicts a zero heuristics score, which means that the prediction suggests that if we merge the groups with the zero score, the classifier’s performance won’t change. And here again we can see that the real AUC does not change much in this area, as reflected in a flat line. This can also happen due to merging pairs that are not very usable. The interesting phenomenon take place when N ≥ 40 in which the fast best algorithm give an unexpected negative heuristics score for the best merge, which means the heuristics suggest that merging of some groups is even better than using each group separately. And we can certainly see in Fig. 5 (b) there is almost a continuous increase in the classifier’s AUC when N ≥ 40. This means that the classifier’s performance does improve when the fast best algorithm, predicts a negative heuristics score. Although it is true that the predictions aren’t always correct, and we can also observe certain spikes where N = 70 and N = 50
12
I. Hazan, O. Margalit and L. Rokach / Expert Systems With Applications 143 (2019) 113091
Fig. 5. Heuristic score versus real AUC on the Gaussian classifier.
due to overfitting, overall the fast best algorithm we developed, provide an accurate predictions. A statistical evaluation of the same data shows that when the heuristics prediction has a positive score, the AUC has a 91% chance of either decreasing or staying the same. In cases in which the heuristics prediction has a negative score, the AUC has an 88% chance of increasing or staying the same. Thus, we can conclude that our method is not only capable of improving users’ privacy, it can also improve the verification of the classifiers using it, especially at larger N values. 8. Conclusion and future work In this paper, we presented one of the privacy risks involved in using keystroke dynamics that applied on sensitive information such as passwords. That as opposed to passwords, that can be hashed, keystroke dynamics events, features and models cannot be used with common encryptions, which makes the keystroke dynamics vulnerable. In turn, we suggested the possibility of using grouping methods, such as di-graphs grouping, that proposed by Shimshon et al. (2010); but showed that it does not fit precisely the obfuscation task. Therefore, we presented a new obfuscation method based on keys grouping and showed that it can efficiently improve the security of keystroke dynamics systems while maintaining a high per-
formance of the used keystroke dynamics algorithms. Our method is based on a three-dimensional hierarchical grouping mechanism with heuristics scoring algorithm we refer as the fast best algorithm that identifies the best possible merge in each step and perform it in a greedy manner. We evaluated our method on five different state-of-the-art keystroke dynamics algorithms on a public dataset and showed that our method performed better than several other compared methods, including two more intuitive groupings. Lastly, we presented the interesting phenomenon happening in the fast best algorithm scoring used in the hierarchical grouping. We showed that it can predict the consequences of the merging and is able to give a related negative, positive or neutral score. We found that when a negative score is given, the merge has the potential to improve the overall accuracy of the keystroke dynamics algorithm used. Hence, our method works well not just for obfuscation but can also improve keystroke dynamics algorithms, particularly when the number of groups is high. Despite our method’s strengths, it does not, however, lack limitations. It requires a preprocessing phase in which the keystrokes of several representative users are needed for the hierarchical grouping. This means that those implementing the method would probably need to set aside several users to base the grouping on. In addition, like other methods aimed at many users, there will
I. Hazan, O. Margalit and L. Rokach / Expert Systems With Applications 143 (2019) 113091
likely be users for which the grouping wouldn’t fit; detecting these users will be left for future work. Also, we found that in some cases there were “spikes” in the heuristics scoring that highlighted certain combinations that were actually misleading. The fact that there were only two spikes out of the many possible merging in the 80 steps of the hierarchical clustering is encouraging, but it still shows that in order to avoid that, more users are probably needed in the preprocessing phase. Another limitation of our method is that we have to make sure that the groups are not divided to singletons, in that case one of the keys will reveal immediately. In future, we plan to enhance our method by focusing on the creation of non-greedy algorithms. Although greedy algorithms solve hard computational problems quickly, they might converge to a local maximum; we, however, seek the global. Thus, the use of non-greedy algorithms has increased chances of converging to the global maximum and eventually produce better groupings that in turn will produce more accurate keystroke dynamics algorithms. And that is why we believe our method is just the first of many possible and that our proposed method will serve as the foundation for future improvements. Declaration of Competing Interest The First Author (Itay Hazan) and Second (Oded Margalit) are IBM employees. All authors approved the manuscript being submitted. The article is the authors’ original work, hasn’t received prior publication and isn’t under consideration for publication elsewhere. Credit authorship contribution statement Itay Hazan: Conceptualization, Methodology, Software, Investigation, Validation, Visualization, Writing - original draft. Oded Margalit: Conceptualization, Writing - review & editing, Supervision. Lior Rokach: Conceptualization, Writing - review & editing, Supervision. References Al Solami, E., Boyd, C., Clark, A., & Islam, A. K. (2010). Continuous biometric authentication: Can it be more practical? In High performance computing and communications (HPCC), 2010 12th IEEE international conference on (pp. 647–652). IEEE. Araújo, L. C., Sucupira, L. H., Lizarraga, M. G., Ling, L. L., & Yabu-Uti, J. B. T. (2005). User authentication through typing biometrics features. IEEE Transactions on Signal Processing, 53(2), 851–855. Banerjee, S. P., & Woodard, D. L. (2012). Biometric authentication and identification using keystroke dynamics: A survey. Journal of Pattern Recognition Research, 7(1), 116–139. Das, A., Bonneau, J., Caesar, M., Borisov, N., & Wang, X. (2014). The tangled web of password reuse. In NDSS: 14 (pp. 23–26). Deng, Y., & Zhong, Y. (2013). Keystroke dynamics user authentication based on gaussian mixture model and deep belief nets. ISRN signal processing, 2013. Dowland, P. S., & Furnell, S. M. (2004). A long-term trial of keystroke profiling using digraph, trigraph and keyword latencies. In Security and protection in information processing systems (pp. 275–289). Springer. Giot, R., Dorizzi, B., & Rosenberger, C. (2015). A review on the public benchmark databases for static keystroke dynamics. Computers & Security, 55, 46–61. Giot, R., El-Abed, M., & Rosenberger, C. (2009a). Greyc keystroke: A benchmark for keystroke dynamics biometric systems. In Biometrics: Theory, applications, and systems, 2009. BTAS’09. IEEE 3rd international conference on (pp. 1–6). IEEE. Giot, R., El-Abed, M., & Rosenberger, C. (2009b). Keystroke dynamics authentication for collaborative systems. arXiv preprint arXiv:0911.3304. Giot, R., El-Abed, M., & Rosenberger, C. (2009c). Keystroke dynamics with low constraints svm based passphrase enrollment. In Biometrics: Theory, applications, and systems, 2009. BTAS’09. IEEE 3rd international conference on (pp. 1–6). IEEE.
13
Giot, R., El-Abed, M., & Rosenberger, C. (2012). Web-based benchmark for keystroke dynamics biometric systems: A statistical analysis. In Intelligent information hiding and multimedia signal processing (IIH-MSP), 2012 eighth international conference on (pp. 11–15). IEEE. Gunetti, D., & Picardi, C. (2005). Keystroke analysis of free text. ACM Transactions on Information and System Security (TISSEC), 8(3), 312–347. Haider, S., Abbas, A., & Zaidi, A. K. (20 0 0). A multi-technique approach for user identification through keystroke dynamics. In Systems, man, and cybernetics, 2000 IEEE international conference on: 2 (pp. 1336–1341). IEEE. Hocquet, S., Ramel, J. Y., & Carbot, H. (2006). Estimation of user specific parameters in one-class problems. In Pattern recognition, 2006. ICPR 2006. 18th international conference on: 4 (pp. 449–452). IEEE. Hosseinzadeh, D., & Krishnan, S. (2008). Gaussian mixture modeling of keystroke patterns for biometric applications. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(6), 816–826. Hwang, S. S., Lee, H. J., & Cho, S. (2009). Improving authentication accuracy using artificial rhythms and cues for keystroke dynamics-based authentication. Expert Systems with Applications, 36(7), 10649–10656. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM computing surveys (CSUR), 31(3), 264–323. Killourhy, K. S., & Maxion, R. A. (2009). Comparing anomaly-detection algorithms for keystroke dynamics. In Dependable systems & networks, 2009. DSN’09. IEEE/IFIP international conference on (pp. 125–134). IEEE. Magalhães, P.S.T., & Santos, H.D.D. (2005). An improved statistical keystroke dynamics algorithm. ´ T., Camtepe, S. A., & Albayrak, S. (2011). Continuous and Messerman, A., Mustafic, non-intrusive identity verification in real-time environments based on free-text keystroke dynamics. In Biometrics (IJCB), 2011 international joint conference on (pp. 1–8). IEEE. Monaco, J. V., Stewart, J. C., Cha, S. H., & Tappert, C. C. (2013). Behavioral biometric verification of student identity in online course assessment and authentication of authors in literary works. In Biometrics: Theory, applications and systems (BTAS), 2013 IEEE sixth international conference on (pp. 1–8). IEEE. Monaco, J.V., & Tappert, C.C. (2016). Obfuscating keystroke time intervals to avoid identification and impersonation. arXiv preprint arXiv:1609.07612. ... Moskovitch, R., Feher, C., Messerman, A., Kirschnick, N., Mustafic, T., Camtepe, A., & Elovici, Y. (2009). Identity theft, computers and behavioral biometrics. In Intelligence and security informatics, 2009. ISI’09. IEEE international conference on (pp. 155–160) IEEE. Pisani, P. H., Lorena, A. C., & de Carvalho, A. C. (2015). Adaptive positive selection for keystroke dynamics. Journal of Intelligent & Robotic Systems, 80(1), 277–293. Rahman, K. A., Balagani, K. S., & Phoha, V. V. (2013). Snoop-forge-replay attacks on continuous verification with keystrokes. IEEE Transactions on Information Forensics and Security, 8(3), 528–541. Revett, K., & Khan, A. (2005). Enhancing login security using keystroke hardening and keyboard gridding. IADIS.. Serwadda, A., & Phoha, V. V. (2013). Examining a large keystroke biometrics dataset for statistical-attack openings. ACM Transactions on Information and System Security (TISSEC), 16(2), 8. Shimshon, T., Moskovitch, R., Rokach, L., & Elovici, Y. (2010). Clustering di-graphs for continuously verifying users according to their typing patterns. In Electrical and electronics engineers in Israel (IEEEI), 2010 IEEE 26th convention of (pp. 0 0 0445–0 0 0449). IEEE. Sim, T., & Janakiraman, R. (2007). Are digraphs good for free-text keystroke dynamics? In Computer vision and pattern recognition, 2007. CVPR’07. IEEE conference on (pp. 1–6). IEEE. Stanciu, V. D., Spolaor, R., Conti, M., & Giuffrida, C. (2016). On the effectiveness of sensor-enhanced keystroke dynamics against statistical attacks. In Proceedings of the sixth ACM conference on data and application security and privacy (pp. 105–112). ACM. Stefan, D., Shu, X., & Yao, D. D. (2012). Robustness of keystroke-dynamics based biometrics against synthetic forgeries. Computers & Security, 31(1), 109–121. Stibor, T., Mohr, P., Timmis, J., & Eckert, C. (2005). Is negative selection appropriate for anomaly detection? In Proceedings of the 7th annual conference on Genetic and evolutionary computation (pp. 321–328). ACM. Teh, P. S., Teoh, A. B. J., Tee, C., & Ong, T. S. (2010). Keystroke dynamics in password authentication enhancement. Expert Systems with Applications, 37(12), 8618–8627. Teh, P. S., Teoh, A. B. J., & Yue, S. (2013). A survey of keystroke dynamics biometrics. The Scientific World Journal, 2013. Tey, C.M., Gupta, P., & Gao, D. (2013). I can be you: Questioning the use of keystroke dynamics as biometrics. Ward, J. H., Jr. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301), 236–244. Yu, E., & Cho, S. (2003). GA-SVM wrapper approach for feature subset selection in keystroke dynamics identity verification. In Neural networks, 2003. Proceedings of the international joint conference on: 3 (pp. 2253–2257). IEEE.