Android-GAN: Defending against android pattern attacks using multi-modal generative network as anomaly detector

Android-GAN: Defending against android pattern attacks using multi-modal generative network as anomaly detector

Expert Systems With Applications 141 (2020) 112964 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www...

4MB Sizes 0 Downloads 40 Views

Expert Systems With Applications 141 (2020) 112964

Contents lists available at ScienceDirect

Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa

Android-GAN: Defending against android pattern attacks using multi-modal generative network as anomaly detector Sang-Yun Shin, Yong-Won Kang, Yong-Guk Kim∗ Department of Computer Engineering, Sejong University, 209, Neungdong-ro, Gwangjin-gu, Seoul, Korea

a r t i c l e

i n f o

Article history: Received 24 March 2019 Revised 17 September 2019 Accepted 18 September 2019 Available online 19 September 2019 Keywords: Android pattern GAN LSTM Anomaly detection Authentication

a b s t r a c t Android pattern lock system is a popular form of user authentication extensively used in mobile phones today. However, it is vulnerable to potential security attacks such as shoulder surfing, camera attack and smudge attack. This study proposes a new kind of authentication system based on a generative deep neural network that can defend any attacks by imposters except a registered user. This network adopts the anomaly detection paradigm where only normal data is used while training the network. For this purpose, we utilize both Generative Adversarial Networks as an anomaly detector and Long Short Term Memory that processes 1D time varying signals converted from 2D Android patterns. To handle the stability problem of GANs during the training, Replay Buffer, which has been effectively used in Deep Q-Networks, is also utilized. Evaluation of the proposed method was carried out thoroughly and the accuracy reached to 0.95 in term of the Area Under Curve. Although training this network requires extensive computing resources, it runs on a mobile phone well since the testing version is very light. Further experiments conducted using a group of mobile phone users, including posture variation study, provided comparable performance as well. Results suggest that the proposed system has a potential for real world application. © 2019 Elsevier Ltd. All rights reserved.

1. Introduction Today, smartphones are essential in our daily lives since we use them for web browsing, e-mailing and financial transaction, etc. As the amount of valuable data within the phones increases, diverse malicious attempts are made to steal private information from them. Therefore, demand for stronger security measures has also increased. Defending smartphone against diverse attacks is ultimately linked to the authentication issue. Though passwords and Personal Identification Number (PIN) systems have been widely used in a desktop environment, it seems that smartphone users prefer user friendly authentication methods such as pattern locks or biometric authentication systems where fingerprint, voice, face, or iris are used for preserving identity of each user (Bicego, Lagorio, Grosso, & Tistarelli, 2006; Crouse, Han, Chandra, Barbello, & Jain, 2015; Heck & Larry, 2003; Khan, Zhang, & Wang, 2008; Kim, Chung, & Hong, 2010; Trewin et al., 2012; Xi, Ahmad, Fengling, & Hu, 2011; Yang & Verbauwhede, 2007). Android pattern lock system has long been used for authentication on touchscreen devices such as smartphone and 40% of Android users are known to use pattern lock rather than PIN for



Corresponding author. E-mail addresses: [email protected] (S.-Y. Shin), [email protected] (Y.-W. Kang), [email protected] (Y.-G. Kim). https://doi.org/10.1016/j.eswa.2019.112964 0957-4174/© 2019 Elsevier Ltd. All rights reserved.

their smartphone locks (Loge & Dybevik, 2015). However, it is vulnerable to smudge attack (Kwon & Na, 2014) and shoulder surfing. Previous studies have shown that the security level under those attacks could go high or low depending on some conditions (Aviv, Davin, Wolf, & Kuber, 2017; Aviv, Gibson, Mossop, Blaze, & Smith, 2010). Moreover, a recent study shows that any pattern lock can be attacked using video content analysis of a user in the process of authentication (Ye et al., 2017). One way to handle such problems, users can create a more complicated pattern to reduce these vulnerabilities by sacrificing its usability in some degree (Harbach, Luca, & Egelman, 2016). For a new touch analytics for the smartphone users, this study employs the latest techniques in deep learning to deal with a situation where the Android pattern lock system is actually intruded by one of these attacks such as shoulder surfing, smudge attack or camera attack. Given that an imposter already knew the target pattern, what we can do is only to discriminate the real user from other fake users to defend the attack. To deal with this problem, we thought that GAN is the best candidate among diverse deep neural networks mainly because its network architecture is designed to discriminate the real signal from the fake one. Note that if such a network carries out this task successfully, it plays a role as an anomaly detector (Mehrotra, Mohan, & Huang, 2017) that can segregate a genuine user from many imposters.

2

S.-Y. Shin, Y.-W. Kang and Y.-G. Kim / Expert Systems With Applications 141 (2020) 112964

Fig. 1. Schematic diagrams of Generative Adversarial Network (GAN) (a), Deep Reinforcement Learning (b) and Actor-Critic reinforcement learning (c).

In general, recent deep learning techniques span at least three representative learning methods and their corresponding networks: Convolutional Neural Network (CNN) (Lecun et al., 1995) and Long Short Term Memory (LSTM) for supervised learning, GAN (Goodfellow et al., 2014) for unsupervised learning, and Deep Q-Network (DQN) (Mnih et al., 2015) and Replay Buffer (RB) for reinforcement learning, respectively, as shown in Fig. 1(a) and (b). Though every Android pattern swiped on a touch screen has the 2D trajectory shape, we transform it into a 1D temporal signal during the pre-processing step. To deal with such a signal, a LSTM is employed in the system. It is known that training a GAN is difficult since its stability is not always guaranteed. Inspired by the fact that performance of DQN can be much improved by adding RB within the feedback loop as shown in Fig. 1(b), we propose that RB is introduced into our GAN in this study. Although the model contains both GAN and RB during its training phase, it requires only the discriminator part when it actually runs on a target device, i.e. smartphone, and therefore it is fast and light. Contributions of this study are: 1. In contrast to the conventional authentication methods that try to extract features from a group of users’ data and use them in distinguishing a registered user from others, we adopt a GAN based anomaly detection paradigm wherein it utilizes only a user data and generate a vast amount of synthetic data and treat them as the potential attackers during training. To the best of our knowledge, this is the first report that employs GANs in defending Android pattern attacks. 2. To enhance stability during GAN training, we introduce an RB that has been typically used in reinforcement learning. It is found that the RB stabilizes the generator of the GAN in producing high quality synthetic data, and that leads to improvement of system performance. 3. Although GAN has been used for anomaly detection as an uni-modal case before, the proposed system has the multimodal GAN architecture where two touch features, i.e. trajectory and pressure, are used to improve its performance. It is found that contributions of two features are different depending on the complexity of Android patterns: contribution by trajectory is high at lower complexity, whereas that by touch is high at higher complexity. 4. One of the standing questions on touch analytics would be what happens when the users take different postures, for the present case, while entering Android pattern. A systemic investigation was carried out to address this issue by varying posture such as sitting, lying down, lying prone and standing. 5. Our framework consists of touch data collection from users, the model training and validation with the server station, and the testing using mobile phones. Result can be used as a baseline for future researches, and an application for touch data collection is opened for public use.

2. Related work Recently, there have been several studies to enhance performance of touch based authentication such as Android pattern lock system either by adding more features on the authentication steps or recommending a secure pattern (Cho et al., 2017; Tupsamudre, Vaddepalli, Banahatti, & Lodha, 2018). However, as it is shown that users prefer to use simple pattern rather than a complex one (Andriotis, Oikonomou, Mylonas, & Tryfonas, 2016), an inevitable trade-off between usability and security has been an issue of adding more features on the authentication steps. A possible solution for this issue could be adopting a machine learning model. 2.1. Machine learning approaches for keystroke dynamics Xu, Zhou, and LYU (2014) had shown that features from the touch screen could be used as an indicator for touch data based authentication schemes. In their study, several classifiers had been trained to distinguish the right user using data from all users. Since they defined the problem as a binary-class classification problem, the classifier was to classify the right user while all other data should be classified as attackers. By using similar touch features, Antal et al. (2015) had shown that the machine learning models could be trained for keystroke-dynamics. So that, some feature such as touching pressure could be used in identifying the registered user among others. Random Forest among several classifiers showed the best performance. 2.2. Machine learning approaches for android pattern lock Angulo and Wästlund (2011) reported that some classifiers, such as SVM, Random Forest, could learn the features to distinguish the right user in a pattern lock authentication scheme and it was found that Random Forest classifier outperforms others for their three Android patterns. Although there have been many studies that employ machine learning, it seems that a fundamental issue resides on the training sets. Because previous studies have trained their models using both authenticated users and attackers, those models are unavoidably hard to guarantee consistency in the performance wise especially when attackers come outside of their training set. To handle such an issue, we adopt anomaly detection scheme with dataset wherein we use the data of an authentic user for the training and test it using all kind of data, including that of attackers. Our study is an initiative for deep learning based mobile authentication, particularly adopting a generative model. 3. Deep learning-based authentication Deep learning based authentication methods were centered at face, facial expression, and speech recognition. In the present study, the proposed network is to utilize the GAN architecture for anomaly detection. A brief explanations of some of these networks are presented in 3.1–3.4, followed by the description of our approach in 3.5.

S.-Y. Shin, Y.-W. Kang and Y.-G. Kim / Expert Systems With Applications 141 (2020) 112964

3

Fig. 2. Our network during training (a), testing (b) and their input data (c). The general structure is similar to GAN and yet a LSTM feature descriptor for dealing with temporal 1-D data and a replay buffer for GAN stability are included. Both Real(R) and Fake(F) data are accumulated in order within the replay buffer, maintaining 10 0 0 data in the FIFO style, and then randomly sampled 64 data among them as a batch are actually given to the discriminator. Replay buffer prevents the generator from sticking to the undesired state while training, called the correlation problem. Once training is completed, the network runs without the replay buffer and the generators as shown in (b) to tell whether the input comes from the authentic user or not. Touch data, consisting of trajectory and pressure, for 36 subjects are used for this experiment. Each subject swipes 20 times for each Android pattern, and the half of them (n = 10), enclosed with red line, are used for training and the other half (n = 10) with all data from remaining 35 subjects, enclosed with green line, for testing as illustrated in (c). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

3.1. Convolutional neural network Lecun et al. (1995) proposed a deep neural network, called it convolutional neural network. Within the trained convolutional layer, the feature maps use different weights to extract different features from a single piece of data. When these three components implement, the CNN ends up with features that are invariant against data shifts and distortions. After the features from the input data has been extracted by several convolutional layers, CNN typically has a softmax (Wiki, 2019) function on the last layer for general classification problems. Since CNN is very effective in extracting important features from the data, it is often used as a powerful classifier. Indeed, the discriminator in our GAN has a CNN as shown in Fig. 2.

using feedback loops. Long Short Term Memory (LSTM) network (Hochreiter & Schmidhuber, 1997) is a recent version of RNN that was designed for learning time domains with long sequences. Traditional RNN structure typically had a gradient vanishing problem in which errors either increase or decrease exponentially during back-propagation. Memory blocks in a LSTM are connected as a recurrent structure. Although a touch trajectory from the cellphone is 2D data consisting of x, y coordinates, we initially transform it into 1D-data that varies in time domain, call it pre-processing (See Section 4.1). To process this data, we utilize LSTMs for feature descriptors as shown in Fig. 2(a): a LSTM is for the real touch data and other for the generated data from the generators. Then the discriminator tells whether the generated data is real or fake by comparing two data.

3.2. Long short term memory 3.3. Generative adversarial network (GAN) Along with CNN, Recurrent Neural Network (RNN) is a class of artificial neural network that is useful for dealing with timevarying signal. The hidden nodes of RNN connect edges in a certain direction, forming a cycle. The reason why RNN is called as recurrent is that the same task is performed for every element in a sequence, and the outputs are influenced by previous calculations (Graves, Mohamed, & HINTON, 2013). Then, RNN has access to all previous calculations up to the present one. The main difference between RNN and other artificial neural networks is the presence of feedback loops in RNN, which create recurrent connections. It can learn contextual information within a sequence by

Recently, Goodfellow et al. (2014) introduce the GAN as shown in Fig. 1(a). They train a generative model that implements a twoplayer zero sum game between a discriminator D, which has a function aiming to identify the real from the fake; and a generator G, which has a function that generates input (from a latent vector) that fools the discriminator. In each step, the generator produces an example that has the potential to fool the discriminator. Between the real data and the example produced by the generator, the discriminator is to tell whether it is the real or fake. Here, the discriminator will be rewarded for answering

4

S.-Y. Shin, Y.-W. Kang and Y.-G. Kim / Expert Systems With Applications 141 (2020) 112964

correctly and the generator for generating examples that fools the discriminator. Both models are then updated, and the next cycle begins. This process can be formalized as follows: Let X = x1 , . . . , xn be the dataset from a real user, which is labeled as 1, with dimensionality I (x ∈ RI ), and D denotes the discriminative function and G the generator function, respectively. Then, G maps latent vectors z ∈ RZ to generated input x∼ = G(z ) and D is to predict the probability of example x being present in the dataset. The objectives of D and G are:

3.5. Anomaly detection for android pattern lock

et = (st , at , rt , st+1 )

(4)

In contrast to the typical applications of GAN where the focus is typically given to its generator function, recently it is shown that a GAN can be trained as anomaly detector that can tell the difference between normal tissue and abnormal one (Schlegl, Seebck, Waldstein, Schmidt-Erfurth, & Langs, 2017). Also, it is shown that the discriminator in GAN can play a role as an anomaly detector in detecting anomaly objects in video (Ravanbakhsh, Sangineto, Nabi, & Sebe, 2019). In this study, we would like to demonstrate that GANs can be trained to discriminate the real Android pattern from the fake ones and yet its performance is very powerful by utilizing other deep learning techniques. Our proposed model consisting of GANs, RBs and LSTMs is shown in Fig. 2(a), which represents our network in the training mode, whereas Fig. 2(b) shows when it is in the testing mode. Notice that the collected touch data from subjects is divided into normal data and abnormal data. In Fig. 2(a), real data (normal ones) and fake data (generated ones) are used for training the network. After training, both abnormal data and remaining half of the normal are used for testing as shown in Fig. 2(b) and (c). The architecture of our network during training faithfully follows that of the original GAN (Goodfellow et al., 2014). During training, 1D-data with random noise added (See Section 4.1 for pre-processing) is fed into a LSTM feature descriptor since it is a necessary component for dealing with temporal data as shown in Fig. 4. When the generator creates an output signal, that has the similar process of the real data, the discriminator evaluates the similarity between the output signal and the actual data. If the output of the generator is almost indistinguishable from the real data, the discriminator classifies the output as real. Otherwise, it classifies the output as false. During this process, with the feedback from the discriminator, the generator learns to generate a signal that could be classified as real. The discriminator, in turn, tries to make accurate classifications. So that, the discriminator becomes an anomaly detector as the generator generates the vast amount of data, while the discriminator learns to classify all of them as ‘fake’. In fact, this process could be also explained by utilizing capability of the generator to imitate real data as shown by Ravanbakhsh et al. (2019). Therefore, given the real data and large amount of imperfect data from the generator, the discriminator learns to distinguish between the registered user and potential attackers very well. However, this process is not easily implemented because the generator failed to generate the data that are nearly good enough for the discriminator to learn. To overcome this problem, this study adopts a RB. It is well known that RB plays an important role in deep reinforcement learning task (Mnih et al., 2015). A recent study has pointed out that reinforcement learning and GAN have several similarities in their structure and function (Pfau & Vinyals, 2016), indicating that one component of RL such as replay buffer may be used to solve some problem of GAN. In fact, we have found that replay buffer plays an important role for Android-GAN proposed in this study (see Section 4.3 for details). When our network has completed the training and it works for testing, the generator and replay buffers are not necessary anymore as shown Figs. 2(b) and 4. Touch data we use for this study consists of touch trajectory and touch pressure, recorded from the screen sensor of the smartphone.

B = ( e1 , e2 , . . . , en )

(5)

4. The network and processes

ming maxd Ez [log(1 − D(G(z )))] + Ex [log(D(x ))]

(1)

where the latent vector z has a uniform disribution. If both the discriminator and the generator are implemented with Deep Neural Network (DNN) and each network is trained separately, then they can be trained by alternating the stochastic gradient descent (SGD) steps with an objective function. There are many variation of GANs such as DC-GAN (Radfort, METZ, & Chintala, 2015), cycle GAN (Zhu, Park, Isola, & Efros, 2017), and Wasserstein GAN (Arjovsky, Chintala, & Bottou, 2017). 3.4. Reinforcement learning and replay buffer Reinforcement learning (RL) is a machine learning technique that concerns with how a software agent takes actions in an environment so as to maximize some notion of cumulative reward. The environment is typically formulated as a Markov Decision Process (MDP). Q-learning is a reinforcement learning technique that is model-free and yet it can handle problems with stochastic transitions and rewards. It can also find an optimal policy to make the total reward return maximum over the successive steps. Deep QNetwork (DQN) uses a convolutional neural network, with layers of tiled convolutional filters. In a problem of a sequential decision making like MDP, an agent interacts with an environment  over discrete time steps (Sutton & Barto, 1998) as shown in Fig. 1(b). For instance, in an Atari game, an agent observes frames from a video at time step t.

st = (xt − p + 1, . . . , xt ) ∈ S, and at ∈ A = {1, . . . , |A|}

(2)

Eq. (2) shows ingredient of the state, consisting of several frames from video at time step t: here x is the frame, t is the time step, |A| is the number of actions, respectively. And s is a state that the agent observes and a is an action that the agent takes. The agent chooses an action at from a set of possible ones and then receives a reward signal rt . The goal of this agent is, of course, to maximize the reward.

Rt =

∞ 

γ

 γ −t rγ

(3)

Eq. (3) shows the cumulative reward along the time steps, where rγ is a reward at time step γ and  ∈ [0, 1] is the discount factor that makes the agent deal with a trade-off between immediate and future rewards. One of the issues in early deep reinforcement learning was the correlation of the data. It leads to significant deterioration of the performance because the gradient is updated with mini-batch unit where the batch may only have the adjacent information.

Definition (4) and (5) show how experiences are accumulated in buffer B, where st is a state, at is an action, rt is a reward and st+1 is the next state caused by at .

The present system has three stages as shown in Fig. 4: first, the pre-processing stage to transform a raw 2D touch data, consisting of touch trajectory and touch pressure, into the temporal 1D

S.-Y. Shin, Y.-W. Kang and Y.-G. Kim / Expert Systems With Applications 141 (2020) 112964

5

Fig. 3. Illustration of 15 Android patterns chosen for the present study. Note that, for each pattern, 10 touch trajectories are shown in the left and one touch pressure in the right acquired from a subject. According to PSM, these patterns can be grouped into 5 different categories depending on pattern complexity: (1) Very weak, (2)-(6) Weak, (7)-(10) Medium, (11)-(13) Strong, (14)-(15) Very strong, respectively.

Fig. 4. A flowchart of the proposed Android-GAN system. Each touch data is converted into 1D temporal data, that is fed to a LSTM descriptor. Both normal data from a designated user and latent vector are used in training the model (top) and then the trained model is used in discriminating between him and attackers in term of Android pattern (bottom).

data; secondly, the prediction stage with an LSTM that predicts the next value based on input 1D data, which we call LSTM feature descriptor; thirdly, the classification stage between the real data and the generated data using the GANs. In contrast to the conventional anomaly detection methods, such as one-class SVM, Isolation Forest and Elliptic Envelope (Liu, Ting, & Zhou, 2008; Rousseeuw & Driessen, 1999; Schölkopf et al., 2001), we propose an anomaly detection method where the GANs utilize both the artificially generated data and the augmented real data. For instance, since a number of training data that a user has to draw for training the system should be reasonably small (n=10) to meet the usability, we add random noise to the real data after the pre-processing during the training. The amplitude of the random noise has uniform distribution with the interval from −0.01 to +0.01. Therefore, the maximum value of the data would be +1.01 and the minimum −0.01, respectively. The network typically generates about 210,0 0 0 synthetic data for a specific pattern during the training. By using these data, our network is able to learn how to sharply distinguish the real data among the data possibly come from impostors using diverse attacking ways. Note again that we do not use any Android pattern data from imposters during the training as shown in Fig. 4, except during the testing for evaluating its performance, similarly as the most anomaly detection algorithms do during their training and testing. Following sections will describe the data preprocessing method, generation of synthetic data, and testing using mobile devices.

4.1. Pre-processing Android mobile device produces several touch events, such as time-stamp, event type, touch coordinates, touch pressure and touch area, when the user touches the display. Since it is known that the touch area is often unstable (Kolly, Wattenhofer, & Welten, 2012), the touch events adopted for the present study are touch trajectory and touch pressure as shown in Figs. 3 and 5. What we found during the preliminary study is that, given the same pattern, it is very difficult to distinguish two raw touch trajectories acquired from subjects. To make such a process easy, we adopt a data pre-processing scheme initially proposed by Meng, Wong, Schlegel et al. (2012) in which a touch trajectory can be represented by multiplying the speed and the weight, which is determined by the angle. Then, each trajectory becomes a 1Dvector as shown in Fig. 5 (top). Because the length of the touch trajectories varies depending on Android patterns as well as subjects, the normalization is applied for the fixed input length during the pre-processing. The coding scheme for the present pre-processing is shown in Fig. 5. Here, Xn , Yn stand for the Nth touch coordinates along x and y axis, respectively.

Dn = wt=n · speedt=n

(6)

where Dn stands for the pre-processed data at time step n. wt=n is the weight, indicating the direction to where xn−1 , yn−1 moved at time step n, and speedt=n is the speed when t = n. Note that touch

6

S.-Y. Shin, Y.-W. Kang and Y.-G. Kim / Expert Systems With Applications 141 (2020) 112964

Fig. 5. The pre-processing for transforming an Android touch data, touch trajectory (Top) and touch pressure (Bottom), respectively. The coding scheme for touch trajectory consists of two components: moving direction and its velocity. There are 8 possible directions and corresponding weights such as upper-left (1), up (2), upper-right (3), left (4), right (5), down-left (6), down (7), down-right (8), respectively. Note that every 1D-vector has 200 time-steps and the amplitude of signal at each time-step varies from 0 to 1. The coding scheme for touch pressure has only one component, i.e. amplitude of touch pressure and its normalization value varies from 0 to 1.

trajectory is drawn in x and y coordinates, whereas the 1D-vector changes along the time-step. Each vector is normalized into 200 time-steps and the amplitude of it varies from 0 to 1. For the touch pressure, we use the pressure values measured along these 200 time-steps and represent them as amplitudes as illustrated in Fig. 5 (Bottom). The present normalization is essential because the time taken to draw a pattern and the length are all different depending on the subjects. It is required to normalize and re-size the 1D-vector to make a trainable and testable fixedsize input for the network. Moreover, since two different mobile phones, which have different touch screen sensors inside, are used for collecting data as well as for testing, such a normalization procedure is necessary. The benefit of the coding scheme and normalization process is that the present system can be operated on diverse touch-based Android phones without any extra calibration process. Scipy library (Jones, Oliphant, Peterson et al., 2001) was used for this normalization process. 4.2. LSTM feature descriptor It is well known that training GANs is a tough task because of its stability problem. There have been many studies to improve the stability problem during GAN training mostly for 2D image dataset (Salimans et al., 2016). Our preliminary study also confirms that training a GAN for 1D-data was not an easy task. To handle such a problem, we introduce two additional components into the vanilla GAN to handle stability problem: (1) the LSTM as a feature descriptor and (2) the replay buffer. Given that our 1D-vector varies along time domain, it seems that a LSTM is a crucial component as shown in Fig. 6(a). It receives the pre-processed data depicted in Section 4.1 as input and provides its output via sigmoid function. However, if the generator is not well-trained, the accuracy of the discriminator deteriorates, which negatively affects the accuracy of the entire system. To overcome this problem, we used 10

real pieces of data to train this LSTM network, which would assist the generator. The LSTM learns to predict the next value based on the time-step sized input data. As each 1D-vector consists of 200 dimensions varying along time domain, the LSTM is trained using 10 1D-vectors of an Android pattern for recognizing the unique feature exhibited in it. Result suggests that such a training make it possible to distinguish an Android pattern of a trained (or authentic) user from those of many untrained users (or imposters). So that, it maximizes the difference between authenticated pattern and un-authenticated ones. Moreover, it is found that the LSTM feature descriptor plays a role as a filter that removes the noise typically contained in the real data. With the help of the LSTM feature descriptor, the generator is able to produce the synthetic data that closely resemble the real data. Fig. 6 shows how the LSTM network predicts the values. Once an input data is pre-processed, it is sequentially divided in to the time step size and LSTM feature descriptor learns to predict the next value. The time-step of LSTM was set to 10. The feed-forward error and loss of the LSTM calculated using sigmoid function for optimization are given as follow:



Er ror δ = y −

1 Loss =

2

δ2

δ−

1 1 + e−LST Mout if

1 2

δ<1

otherwise



(7)

(8)

where y is the next value of the 10 time-step input. If it is multimodal learning case, we simply concatenate two pre-processed vectors. Since the length of either touch trajectory or that of touch pressure is 200, the length of each training vector for the multimodal case becomes 400 as illustrated in Fig. 2. The training of LSTM feature descriptor is made before the training of the GANs, since both the generator and discriminator make use of the LSTM feature descriptor during their trainings.

S.-Y. Shin, Y.-W. Kang and Y.-G. Kim / Expert Systems With Applications 141 (2020) 112964

7

Fig. 6. The structure of our generator for generating abnormal samples (touch trajectory) (a) and the effect of replay buffer size on training (b).

4.3. Replay buffer (RB)

and a transition in RB defined in Eq. (4) is made as :

As Goodfellow et al. (2014) indicated before, there are several unstable factors while training GANs, often called mode-collapse. For example, when either discriminator or generator converges faster than the other, the whole performance deteriorates rapidly, given that both generator and discriminator tries to learn from each other. Our network is inspired by the essential role of replay buffer in the recent progress of deep reinforcement learning: first the replay buffer has been an important component in improving the performance of DQN; secondly it has been suggesting that the critic component within Actor-Critic architecture as shown in Fig. 1(c) is similar to the discriminator and the actor network the generator, respectively (Pfau & Vinyals, 2016). Note that the contents of our replay buffer are much simpler than that of DQN or AC network since it stores only the training data and label, whereas typical RB in reinforcement learning stores state, action, reward, and the next state. We would like to utilize RB not only for taking advantage of the generated data in the past but also for obtaining a regularization effect by randomly retrieving data. That is, we initially have 10 samples of training data on touch trajectories and pressure. However, during the training, we keep adding the data within the maximum size of the buffer by treating the samples with the random noise added as different data. With this approach, we were able to augment the data for the real samples (positive) along with the fake samples (negative) from the generator as shown in Fig. 2(b). By defining xtr as touch trajectory data, xpr as touch pressure data, and LSTMf as LSTM feature descriptor, training data X for the discriminator with both features, i.e. touch trajectory and touch pressure, is given as:

et =

 X=

LST M f (x pr



xtr )

LST M f (G pr (z pr ; θ pr )





i f real _user _data Gtr (ztr ; θtr ) ) i f generated_data

(9)

(Xt , 1 ) i f real _user_data (Xt , 0 ) i f generated_data

(10)

 where stands for an operation of data concatenation. The present replay buffer operates as FIFO (First-In First-Out) style by keeping the pre-defined size of data sequentially. And yet, the data is retrieved randomly, like the reinforcement learning case, and goes into the discriminator as shown in Fig. 2. The more data is stored in the replay buffer up to 1,0 0 0, the better synthetic data that is close to the real pattern is generated as shown in Fig. 6. The replay buffer is used for the real data pathway as well as for the generator pathway as shown in Fig. 2(a). However, when the number of buffer is decreased from 10 0 0 to 0, performance of the network deteriorates rapidly as illustrated in Fig. 6, suggesting that replay buffer plays a crucial role while training our GAN. 4.4. Training and testing of the networks The training of our network consists of two parts: training the discriminator and the generator, respectively, as illustrated in Algorithm 1 that shows the uni-modal training case for clarity. For the multi-modal training case, two generators are used and yet their training processes are identical to the uni-modal case. The discriminator (or binary classifier) is trained using output from the LSTM feature descriptor, which has 10 of 1D-data input. These data are labeled as ‘true’, whereas the data created by the generator are labeled as ‘false’. The generator receives the latent vector as input and then it goes into the LSTM feature descriptor described in Section 4.2. After the pattern data has passed through the LSTM, its output is given to the discriminator as input. Let θ pr denotes feed forward parameters for the node in the last layer used for estimating touch pressure, and θ tr as the other final node used for estimating touch trajectory. Then, our objective function for optimizing the discriminator network shown in Fig. 2 is :









maximizeEx log D(xreal ; θ pr , θtr ) + log 1 − D x f ake ; θ pr , θtr



(11)

8

S.-Y. Shin, Y.-W. Kang and Y.-G. Kim / Expert Systems With Applications 141 (2020) 112964

Algorithm 1 Training and Testing of Our Network. Training of Our Network: 1: Input: X-Half of a registered user 2: Input: LST M f - LSTM feature descriptor, Nb -Buffer size 3: Input: Bd -Buffer for the discriminator 4: Output: Trained Discriminator model 5: Initialize discriminator D and generator G 6: for training steps do x ← X + Random Noise 7: Store (LST M f (x ),1) into Bd 8: 9: Store (LST M f (G(m number of latent vectors)),0) into Bd 10: e, label ← Sample mini-batch of m from Bd 11: if |Bd | < Nb then Delete oldest row vectors form Bd 12: 13: end if Update D with gradient: 14: 15: ∇ m1 m i=1 [l abeli ∗ l og(D (ei )) + (1 − l abeli ) ∗ l og(1 − D (ei ))] z ← LST M f (G(m: number of latent vectors)) 16: Update G with gradient: 17: i 18: ∇ m1 m i=1 log(1 − D (G (z ))) 19: end for Testing our Networks: 20: Input: X-Remaining half of a registered user or Nonregistered users 21: Input: D-trained discriminator, LST M f - LSTM feature descriptor  Pass D(LST M f (X )) > threshold 22: Authent icat ion_scores = Re ject otherwise

where xreal and xfake are data retrieved from RB. xreal refers to the data with label 1, whereas xfake stands for the data with label 0, respectively, as given in Eq. (10). As the goal of the generators is to produce a data that would be classified ‘true’ in the discriminator, the objective function of generator Gpr for touch pressure is:









minimizeEz log 1 − D LST M f G pr (z pr )



Gtr (ztr ) ; θ pr



(12) and the objective function of generator Gtr for touch trajectory is:









minimizeEz log 1 − D LST M f G pr (z pr )



Gtr (ztr ) ; θtr

 (13)

where zpr and Gpr stand for the latent vector and the generator for touch pressure, respectively, whereas ztr and Gtr are those for touch trajectory, respectively. Note that our objective function shown in Eq. (11) is based on binary cross entropy loss. Using 2 outputs from the discriminator as training data, we train a simple logistic regression FC (Fully Connected layer), as shown in Fig. 2, for single output value as a final score to check whether the given user is authentic or not.

maximizeEx [log(F C (D(xreal ; θ pr , θtr ) ) )



 

+ log 1 − F C D x f ake ; θ pr , θtr



(14)

Eq. (14) shows the objective function for optimizing FC. Here, x is 1 when X comes from the real user, and it is 0 when X comes from the generator. For the case of training only one feature i.e. either touch trajectory or touch pressure, the objective functions is given as :

min max Ex,z [log(D(LST M f (x )))) + log(1 − D(LST M f (G(z )))))] g

d

(15)

where x is either Xpr or Xtr retrieved from RB. Since the output node of the network for the single feature case is only one, we use it directly without FC as the final output for the classification. Note that only one generator is used for this case as shown in Eq. (15). Through those optimization processes, the discriminator is trained to classify between ‘true’, which is labeled 1, and ‘false’ data, which is labeled 0. With sigmoid function at the last layer of the discriminator network, the network output can have a value range from 0 to 1, depending on how similar the test data is based on what the network learns. If the test data is similar with ‘true’ data, then the network output is close to 1, whereas the network output is close to 0 when the input data is not similar with the trained data. In this study, because we assumed that impostor already knew the password pattern via diverse attacks mentioned above, what our network had to do was to separate a registered user among other imposters according to what it had been learned from the given data. Table 2 shows the parameters we set in training our network. The specification of parameters for our training shown in Algorithm 1 are as follow: maximum buffer size Nb = 10 0 0, mini-batch size m = 64, training steps = 20 0 0. To visualize how the generator evolves to follow the discriminator, or the generated data catches up the real data during training, a pre-processed Z-pattern is drawn in green color by averaging 10 samples with its maximum and minimum value, and the corresponding generated pattern in red color by averaging 10 samples is superposed on the green real pattern by increasing the trained sample number from 2,520, 74,760, 143,430 to 209,790, respectively, as shown in the top of Fig. 7. Here, the generated pattern in the left-most panel where the red area is a straight line, starts to spread into the green pattern in the central 2 panels and finally the red color almost covers up the green area as shown in the right-most panel. Note that the discriminator’s performance measured by the test set at each step in AUC is also increased dramatically from 0.017, 0.3213, 0.6156 and 0.9387, respectively. The similar trend can be seen for the touch pressure case in the bottom of Fig. 7, although AUC for the pressure case is lower than that for the trajectory one. See Section 5.2.1 and Fig. 9 for further analysis how these two features contribute to the system performance. In any case, this result suggests that the number of samples is critical for the successful training of the network. When the training is done well, the generated samples are very similar to the real samples, indicating that the network can be an high performing anomaly detector working against any Android pattern attacks. Terminating the training has been made using early-stopping method. In other words, the system decides whether or not to terminate the training by monitoring the discriminator loss given by Eq. (11). If the performance in terms of loss does not improve for certain number of training epochs, the system terminates the training. In this study, this number of epochs is set to 20 with the maximum training step set to 20 0 0. Once the training has been completed, a compact version of the network, only consisting of a LSTM feature descriptor, the discriminator, and FC (See Figs. 2(b) and 4), is transferred to a mobile platform for testing with mobile users in real-time as shown in Fig. 13. Here, the real data from the user goes into the discriminator via the LSTM feature descriptor, and then its output value can be from 0 to 1 depending on how much it is similar to the trained data. In other words, it should approach to 1 if the user is an authentic person; otherwise it should be near to 0.

5. Experiment To evaluate whether our proposed scheme works well against potential attacks, we tested our system on the server and the mobile platform, respectively.

S.-Y. Shin, Y.-W. Kang and Y.-G. Kim / Expert Systems With Applications 141 (2020) 112964

9

Fig. 7. Visualization of the generated data (red) by our network for the touch trajectory (Top) and touch pressure (Bottom), respectively, by increasing the number of training samples with RB size 10 0 0. The real data with green color are drawn by averaging 10 Z-patterns of a specific user with its maximum and minimum value at each time step. In the top, the generated data with red color is seen as a horizontal line when the number of samples is 2520, but it becomes thick as the samples are increased. When the number of samples is 209,790, the red color almost covers up the greed color. In the bottom, the similar trend can be seen as the number of samples is increased, suggesting that the number of samples is critical to generate abnormal samples that look like real samples. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

5.1. Evaluation metric For testing, the data from the authentic user is considered as positive and that from others as negative and yet this inevitably leads to the imbalance in the dataset as there are always more attackers than the authentic user. To deal with this issue, we utilize several evaluation metrics such as AUC, Precision, Recall, Equal Error Rate (EER), and F1 score. AUC represents the degree or measure of separability. It means how much a model is capable of distinguishing between classes. The AUC value varies in the range from 0 to 1, indicating that the higher value corresponds to the better performance. Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. High precision is related with the low false positive rate. Recall is the ratio of correctly predicted positive observations to the all observations in the actual class. In other words, Precision and Recall are all interested in predicting the true answer of the positive label. ERR is a point where False Acceptance Rate (FAR) intersects with False Rejection Rate (FRR). Lower EER means that the system is more accurate. F1 score takes both Recall and Precision into account, and therefore it provides an useful indicator, especially when the classes are unevenly distributed. 5.2. What was its performance on the server? The experiments on the server-side were designed to see how pattern complexity, touch features, and posture variations affected the system performance, respectively. First, we describe how touch features such as trajectory and pressure affect the performance of the anomaly detector. Secondly, we describe the impact of diverse body postures on the authentication process. Lastly, we describe the comparison between our anomaly detector and other anomaly detection algorithms. The workstation used for the present development was an Nvidia Devbox that was equipped with an i7 18core CPU and 4 Titan X GPUs running on Ubuntu 16.04 as shown in

Table 1 PSM scores for 15 different patterns used in this study. Pattern(1)

10.0

Pattern(2)

18.094

Pattern(3)

18.095

Pattern(4) Pattern(7) Pattern(10) Pattern(13)

19.401 19.401 27.0 32.773

Pattern(5) Pattern(8) Pattern(11) Pattern(14)

19.401 24.0 33.944 40.25

Pattern(6) Pattern(9) Pattern(12) Pattern(15)

19.401 27.0 35.575 44.01

Fig. 13 (left). The programming languages were Python and CUDA. We trained the LSTM feature descriptor and the GAN using Tensorflow library. 5.2.1. What were the impacts of pattern complexity and touch features? For our experiments, 15 Android patterns were chosen: the 10 most popular Android patterns from Loge and Dybevik (2015) and 5 complex patterns by our design. The complexity of each pattern was determined by Pattern Strength Meter (PSM) (Sun, Wang, & Zheng, 2014):

P S p = S p ∗ log2 (L p + I p + O p )

(16)

where, PSp is the strength score of a pattern and Sp , Lp , Ip and Op are the size, the physical length, the number of intersections, and the number of overlaps of P, respectively. The PSM values calculated by Eq. (16) for our 15 Android patterns are given in Table 1. These patterns are grouped into 5 different categories depending on their pattern complexity: (1) Very weak, (2)-(6) Weak, (7)(10) Medium, (11)-(13) Strong, (14)-(15) Very strong, respectively, as shown in Fig. 3. Using these 15 patterns, we investigated how two touch features together or individual feature contribute to the performance of our system: (1) training with touch trajectory, (2) training with touch pressure, (3) training with touch trajectory and pressure together, respectively. An uni-modal model having a GAN was used for experiment (1) and (2), respectively, and a multi-

10

S.-Y. Shin, Y.-W. Kang and Y.-G. Kim / Expert Systems With Applications 141 (2020) 112964 Table 2 Detail specification of our network. Operation

Kernel

Stride

FeatureMaps/Units

BN

Activ.

512 1024 2048 1900

N Y Y N

ELU ELU ELU Sigmoid

40 2 1 42

N N N N

ReLU ReLU ReLU Sigmoid

Y N N

ReLU Linear Sigmoid



MLP in Generator Dense-FC Dense-FC Dense-FC Dense-FC



Discriminator 1D-Conv 1D-Conv 1D-Conv LSTM

4 4 4

1 1 1

4 4

1 1



LSTM Feature Descriptor 1D-Conv 1D-Conv LSTM D Optimizer G Optimizer Latent z dimension Weight, Bias init ELU α

40 1 8 SGD (α = 0.001, β = 0.9 ) Adam (α = 0.001, β = 0.9 ), decay = 0 100(μ = 1, σ = 1.25 ) Normal (μ = 0, σ = 0.02 ), Constant(0) 0.99

FC - Fully Connected layer BN - Batch Normalization ReLU - Rectified Linear Unit ELU - Exponential Linear Unit

Fig. 8. Participation of subjects for the present study. In total, 41 subjects participated at our 3 experiments. Among them, 36 subjects participated for the server platform experiment, 10 subjects for the mobile platform experiment, and 5 subject for the body posture experiment, respectively. Every subject was treated as a right user once, while all others as attackers at that time. The touch data of each subject is divided into training and testing. However, for the mobile experiments (Section 5.3), subjects 37–41 participated only as attackers, whereas subjects 32–36 already participated while training our model. For the body posture experiment, the procedure was identical with that of the server case.

modal model having two generators was employed for experiment (3). In addition, since these Android patterns used for these experiments had different pattern strength in term of PSM as shown in Fig. 3 and Table 1, it was possible to know how pattern complexity affects the system performance. To conduct these empirical evaluations of our networks, we collected touch data from 36 men and women in their teens to their fifties, as shown in Fig. 8, using two Android phones: LG-G4 and Samsung Galaxy S7. Sensors within these devices were used to detect and store both touch trajectory and touch pressure. Participants were asked to swipe these patterns while they sat as shown in Figs. 3 and 11(b). Each subject swiped 20 times for each pattern. Therefore, each subject made 15 (patterns) × 20 touch records in total. These data were saved in our workstation and used for the experiments. The data collection session for each subject typically took from 40 min to 70 min.

Among 20 touch data for each pattern, half of them (n=10) were used for training and remaining half (n=10) for testing. For training, all abnormal data were generated from two generators: one for touch trajectory and the other for touch pressure, as shown in Fig. 2(a). For testing, the remaining 10 touch data from an authentic user were used as normal data and every touch data from all other subjects were used as abnormal data as illustrated in Fig. 2(b) and (c). Since we chose 15 Android patterns and collected corresponding touch data from 36 subjects using two mobile phones, 15 × 36 different networks were trained individually with their corresponding training set. For example, when a network was to train for a pattern (1) from subject (1), the training set came from the first half (n = 10) of 20 swipe records from him. Then, the testing set consisted of the other half (n = 10) swipe records from the same subject for positive samples and 350 (35 × 10) swipe records from all other subjects on pattern (1) for negative samples. In addition, we were able to perform cross-validations on the dataset by swapping the first half data and the second half data. Having said that we used only half (n = 10) of each Android pattern for training, it was necessary to augment this data extensively since GAN training often required a lot of data. Indeed, our preliminary result suggested that our network required about 210,0 0 0 touch data to guarantee a decent performance. Data augmentation was carried out by adding random noise to each 1D sequential data. Such data were initially stored in RB, keeping 10 0 0 touch data constantly in a FIFO style. Among these 10 0 0 data, 64 touch data were randomly selected as a mini-batch to provide actual input to the discriminator. After testing, we obtained ROC curves as shown in Fig. 12(left). These ROC curves were drawn based on all the networks’ output scores on their testing sets. AUC for touch trajectory was 0.85 and that for touch pressure was 0.78, whereas that for both features with the multi-modal case was 0.95, clearly suggesting that the multi-modal case outperformed the uni-modal cases. In addition, Fig. 9 is drawn to analyze the relationship between pattern complexity and touch features in term of performance. A noticeable fact is that AUC by touch pressure increased as the patterns become complex, whereas that by touch trajectory decreased as the patterns altered to complex ones. So that, the contribution by touch trajectory was maximum when the pattern complexity was

S.-Y. Shin, Y.-W. Kang and Y.-G. Kim / Expert Systems With Applications 141 (2020) 112964

11

Table 3 Performances for subjects. Subject

AUC

F1

Precision

Recall

Subject

AUC

F1

Precision

Recall

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

0.945 0.939 0.988 0.987 0.975 0.955 0.963 0.898 0.964 0.93 0.924 0.946 0.975 0.938 0.972 0.915 0.969 0.996

0.87 0.878 0.877 0.951 0.926 0.852 0.936 0.837 0.966 0.853 0.895 0.889 0.966 0.877 0.926 0.848 0.911 0.955

0.952 0.983 0.926 0.989 0.985 0.903 0.98 0.942 0.991 0.951 0.984 0.961 0.986 0.96 0.985 0.929 0.977 0.986

0.81 0.793 0.833 0.907 0.873 0.807 0.88 0.753 0.933 0.773 0.82 0.827 0.933 0.807 0.873 0.78 0.853 0.927

19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

0.966 0.946 0.9 0.964 0.929 0.962 0.924 0.918 0.963 0.969 0.887 0.936 0.964 0.868 0.914 0.958 0.911 0.957

0.899 0.864 0.777 0.921 0.858 0.875 0.9 0.885 0.888 0.91 0.816 0.875 0.955 0.742 0.814 0.87 0.842 0.905

0.942 0.959 0.918 0.98 0.944 0.939 0.969 0.99 0.969 0.949 0.91 0.975 0.984 0.939 0.98 0.894 0.99 0.962

0.86 0.787 0.673 0.86 0.787 0.82 0.84 0.793 0.82 0.873 0.74 0.793 0.913 0.613 0.687 0.847 0.727 0.853

at the medium level and yet the contribution by touch pressure increased monotonically up to the level of ‘very strong’. Despite the fact that contribution to the performance between two touch features varied, overall performance for the multi-modal system was enhanced as the complexity of Android patterns increased. To better understand how the multi-modal case performs better than the uni-modal cases, the prediction scores by the networks are analyzed as shown in Fig. 10 where red bars indicates abnormal samples and green bars denotes normal samples, respectively. Separa-

tion between two color groups for the multi-modal case is much clearer than that for the uni-modal cases, confirming that performance of the multi-modal case outperforms that of uni-modal case. For the multi-modal case, with RB size 10 0 0, performances for each subject in terms of AUC, F1, Precision, and recall is shown in Table 3. Here, we first separate networks into 36 groups based on their corresponding authentic users. Therefore, each user has 15 networks serving for each pattern with their corresponding performances.

Fig. 9. Performance of average AUCs by varying complexity of 15 Android patterns used for the present study. There were 5 different categories depending on their PSM values: very weak, weak, medium, strong, and very strong. Testing was carried out for 3 different features cases: touch trajectory (blue), touch pressure (green) and the combined features (red). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

12

S.-Y. Shin, Y.-W. Kang and Y.-G. Kim / Expert Systems With Applications 141 (2020) 112964

Fig. 10. Performance comparison between uni-modal and multi-modal cases in term of prediction score by the networks. Reddish area in each graph indicates abnormal samples and their predicted score, whereas greenish area shows normal samples and their predicted score, respectively. Y axis indicate predicted frequency that is normalized between 0 to 10. X axis denotes the score by both samples, indicating that the low score means abnormal and the high score indicates normal, respectively. The left panel shows the pressure case and the center panel the trajectory case, and the right the multi-modal case that combines these two features, respectively. Note that separation between normal and abnormal is clearer in trajectory case than in pressure case and yet such a separation is the clearest in the multi-modal case.

Fig. 11. Illustration of a subject participating experiments for 4 postures: (a) standing up, (b) sitting, (c) lying down, (d) lying prone. Subjects were recruited via advertisement in the university with a proper payment and were asked to swipe patterns using a finger they primarily used for the unlocking. All procedure for this experiment was approved by Institutional Review Board (IRB) in the university. Our android application(.apk) for gathering touch data can be found at: https://github.com/yunshin/ DataGatheringApp.

Fig. 12. ROC curves of our proposed system for two uni-modal cases such as touch trajectory (red) and touch pressure (yellow), respectively, and a multi-modal cases (black) where two features are combined on the server and its mobile (blue) platform version (Left), Comparison between ours (black) and the conventional anomaly detection algorithms such as one class SVM (green), elliptic envelope (blue), and Isolation Forest (sky blue), respectively (Right). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

S.-Y. Shin, Y.-W. Kang and Y.-G. Kim / Expert Systems With Applications 141 (2020) 112964

13

Fig. 13. Development platform within a server workstation and deployment/test platform within a mobile phone. In the former, GAN network and replay buffer are implemented using TensorFlow and Python by running an Ubuntu with the GPUs station. In the latter, a compact version of the network runs on Android with TensorFlow and Java within a mobile phone. Dataset collected from subjects are divided into training/validation and testing purpose.

Table 4 Impact of RB size on Training Time, Memory Usage, and accuracy (AUC). AUC was maximum when the size of Replay buffer was 10 0 0. AUCs for both features (T+P) cases were higher than single feature cases, either touch trajectory (T) or touch pressure (p). Performance/RB Size

0

200

400

600

800

10 0 0

1200

Training Time(min) Memory Usage(MiB) AUC(T) AUC(P) AUC(T+P)

134 440 0.72 0.69 0.91

57 1015 0.78 0.73 0.93

40 1413 0.80 0.74 0.93

32 2377 0.81 0.76 0.94

27 3351 0.83 0.77 0.94

20 4320 0.85 0.78 0.95

22 5274 0.84 0.78 0.94

5.2.2. How does replay buffer affect the performance? An experiment was designed to test what is the impact of the size of RB on performance in terms of training time and memory usage during the training. For this experiment, we used the same dataset from 36 subjects as described in Section 5.2.1. Table 4 shows the performances by varying the size of RB. AUC increased when the size of RB increased, and it was maximum when the size of RB was 1,0 0 0. Note, however, that AUC declined when the size of RB was 1,200, suggesting that storing too much data in RB for the training was not good strategy. Similar result had been reported by Zhang and Sutton (2017) in reinforcement learning, where performance of their system was rather deteriorated as the size of RB was increased greatly. Also, it was found that the training time became shorter as the size of RB increased. The reason would be that the early-stopping method was installed during the training, given that the large RB in general leaded to a faster convergence. On the other hand, it was shown that the memory usage during the training had increased inevitably as the large RB stored more data. Fig. 6 shows that the generated outputs by large RB is smoother than that by small RB and look like the touch data acquired from human subject, supporting that RB improves the GAN training as well as its performance.

5.2.3. How does body posture affect the performance? To investigate how body posture variation makes any impact on performance, we have chosen 4 postures, typically taken by mobile users as shown in Fig. 11. As illustrated in Fig. 8, 5 subjects participated in this experiment and we chose 5 Android patterns, i.e. (2), (4), (7), (9) and (11), from Fig. 3 by varying the pattern complexity. We trained each network using the data collected from a posture of one subject and tested it against the data collected from other 3 postures. To collect the training touch data, 5 subjects were asked to swipe 10 times for each pattern as well as for each posture. We trained 100 (5 subjects × 5 patterns × 4 postures) different networks with RB size 10 0 0. Of course, each network required only 10 touch data for its training as the case of the experiment in Section 5.2.1. For testing, a subject was asked to make 3 trials for each posture. Therefore, a subject made 60 (3 trials × 4 postures × 5 patterns) trials in total. These trials were used to see whether the networks trained with his own data produced some values close to 1 (granted), whereas the data from 4 remaining subjects were used to see whether the networks gave some values close to 0 (rejected). For example, in order to evaluate a model trained on pattern (2) with a sitting posture against a different posture i.e. lying down posture, we arranged the testing set to have only the data collected in lying down postures on pattern (2). That is, 3 trials made by an authentic user with lying down posture became positive samples, and 12 samples (3 trials × 4 subjects) from others while they were lying down became negative samples. The overall performance in terms of AUC was calculated by considering all scores of these networks for each posture and their corresponding labels. A posture confusion matrix is drawn with the experimental result as shown in Table 5. Undoubtedly, AUCs from the sameposture cases were always higher than AUCs from the differentposture cases. Among the same-posture cases, their AUCs were ranked from standing, sitting, lying down and lying prone. This indicates which posture provides more solid body-form than the

14

S.-Y. Shin, Y.-W. Kang and Y.-G. Kim / Expert Systems With Applications 141 (2020) 112964

Table 5 Confusion matrix for posture variation experiment. We have trained 4 posture models with RB size 10 0 0 as shown in vertical direction and their tested results are shown in horizontal direction in AUCs. When a network was trained with a posture and tested with the same posture, their AUCs were high, indicating less confusion, as shown along the diagonal direction in the table, whereas when a trained posture differed from the tested posture, their AUCs were low, indicating more confusion. Training/Testing

Standing up

Sitting

Lying down

Lying prone

Standing up Sitting Lying down Lying prone

0.98 0.95 0.91 0.95

0.94 0.95 0.89 0.91

0.92 0.93 0.96 0.93

0.93 0.94 0.91 0.97

others while swiping an Android pattern on the mobile devices. Overall results suggests that the body posture variation makes a certain impact on the performance and yet the deterioration by posture variations was not significant since average of the different-posture cases was 0.925 in term of AUC. 5.2.4. Does it outperform the previous anomaly detection algorithms? Given that the proposed method is designed as an anomaly detector where it takes the data of the designated user as normal whereas it regards all other data as anomaly, it would be necessary to compare its performance to those of the conventional anomaly detection algorithms such as one-class SVM (Schölkopf et al., 2001), Elliptic Envelope (Rousseeuw & Driessen, 1999), and Isolation Forest (Liu et al., 2008) under the same condition. First, it is well known that Support Vector Machine (SVM) is a supervised learning model since it deals with a set of training examples, labeled as belonging to several different classes. In one-class SVM, the support vector model is trained on data that has only one class, which is the normal class. This is useful for anomaly detection simply because the scarcity of training examples is what defines anomalies, occurring in the cases of network intrusion or fraud detection. Secondly, Isolation Forest is based on the fact that anomalies are data points that are few and different. Because of that, anomalies are susceptible to a mechanism called isolation. Thus, it introduces the use of isolation as a more effective and efficient means to detect anomalies than the commonly used basic distance and density measures. In addition, this method is an algorithm with a low linear time complexity and a small memory requirement. It builds a good performing model with a small number of trees using small sub-samples of fixed size, regardless of the size of a dataset. Thirdly, Elliptic Envelope models the data as a high dimensional Gaussian distribution with possible co-variance between feature dimensions. The co-variance matrix with the smallest determinate from all sub-sample forms an ellipse that embraces a fraction of the original data and then data within the ellipse surface are labeled as normal and data outside of the ellipse are labeled as anomalous. Although there are many applications utilizing more recent technique, such as GAN, in anomaly detection, most of their applications focus on image, which cannot be applied to our 1-D dataset. Therefore, those 3 algorithms, which are still widely used for anomaly detection in 1-D data, were chosen as baseline for the comparison against our proposed scheme. By utilizing the training and testing data from 36 subjects described in Section 5.2.1, we conducted a similar experiment using these anomaly detection algorithms. To ensure that the training sets for those baseline algorithms are equal to those of our networks, we saved outputs of the present LSTM feature descriptor, depicted as real_user_data in Eq. (9) for the discriminator during the training. Only the data labeled as ‘real’ were saved and used for training existing algorithms, because the algorithms only accept real(normal) data for the training, as data labeled with ‘negative’ from real world should not be given in anomaly detection paradigm. Note that our networks are also given only real data for training, although it

Table 6 Performance comparison between the conventional anomaly detection algorithms, such as OCSVM(One-Class SVM), IForest(Isolation Forest), EEnvelope(Elliptic Envelope), and ours in terms of average training time, model size, execution time, AUC and EER. Each model was trained with Trajectory (T), Pressure (P), and Trajectory + Pressure (T+P) data, respectively. In addition, the effect sizes in term of Cohen s d and its denominator SD are shown for comparison between these algorithms and ours. Here, the effect sizes were measure by setting conventional algorithms’ results as first group and our networks’ result as second group.

Training Time(min) Model Size(MiB) Execution Time(ms) AUC/EER(T) AUC/EER(P) AUC/EER(T+P)

OCSVM

IForest

EEnvelope

Our Net.

d/SD

17 9.29 20 0.57/0.45 0.54/0.48 0.61/0.44

0.4 1.2 84 0.51/0.49 0.53/0.48 0.55/0.47

5 5.2 3 0.57/0.44 0.51/0.49 0.62/0.43

20 7.3 112 0.85/0.24 0.78/0.32 0.95/0.10

–1.7/7.2 –0.7/2.9 –2.4/31.3 –7.2/0.11 –6.3/0.09 –7.9/0.1

generates fake data based on the real data by utilizing generators. For every subject and pattern, the above mentioned training set has been made and used for each existing algorithms, and evaluated with corresponding testing set, as same as our networks depicted in Section 5.2.1. Scikit-Learn (Pedregosa et al., 2011) package was used in training these conventional algorithms. During deploying these conventional algorithms’ models into mobile phones, we transferred the saved models in.pkl format and used Termux (Fornwall, Grimler, & Plyushch, 2016) to run the models on Android with Python. Also, it is known that tuning of parameters in such algorithms is important as it can make huge differences on their performances. We have tried to find the best hyper-parameters for those 3 algorithms by adopting IRace (Lpez-Ibez, Dubois-Lacoste, Prez Cceres, Sttzle, & Birattari, 2016), with each training set remaining the same as above. Result suggests that Elliptic Envelope showed the worst performance (AUC 0.46) with the contamination set to 0.33 and the best (AUC 0.62) with the contamination set to 0.2. Similarly, One-Class SVM with RBF kernel showed the best performance (AUC 0.61) with the tolerance and nu set to 0.001 and 0.47, respectively, whereas the worst performance (AUC 0.51) was found when they were set to 0.2 and 0.17, respectively. For the Isolation Forest case, when we used 34 estimators (trees) with fraction 0.99, it showed the best (AUC 0.55) performance, and the worst (AUC 0.53) was found when 52 estimators with fraction 0.64 were used. Table 6 and Fig. 12 show the comparison between the conventional algorithms and ours in detail. Effect sizes in Cohen s d show that our network is a bit slower and heavier i.e. −1.7, −2.4 comparing to the conventional ones during training and testing. However, the higher effect size in AUC, i.e. −7.9, indicates that the accuracy is far better than them. Given that computing power of mobile devices is increasing fast, it is expected that deep neural network based applications like ours will be available in the near future. 5.3. What is the performance in the mobile environment? Although it is shown that our network plays a role as an anomaly detector very well in the server, it is essential to see if such a network could show similar performance in the mobile platforms. In addition, it would be necessary to confirm whether it is capable of handling completely unseen data. Another possible concern would be that this network might be overfitted to the given data since we use only 10 swipe records for its training. For the mobile experiment, 10 subjects were recruited as shown in Fig. 8: Half of them already participated for the server experiment and the remaining half were newly recruited as attackers only. To see if the data collection interval had any effect on the performance, the mobile experiment was made 2 months after the initial data collection session. Tensorflow 1.2.1 (Abadi et al., 2015) was used for training our network in the development platform,

S.-Y. Shin, Y.-W. Kang and Y.-G. Kim / Expert Systems With Applications 141 (2020) 112964

15

Fig. 14. Illustration of memory consumption during the training on the server (Left) and the testing on the mobile device (Right). During the training with RB size 10 0 0, the consumption of GPU memory increased from 500 to 4300 in MiB as the training data was accumulated in the replay buffer. Depending the given network, the training time varied from minimum 15 to maximum 24 min as it utilized the early-stopping method (See Section 4.4). On the mobile device, the memory consumption increased as a user started to swipe a pattern. Memory allocation reached its peak right after the deployed model received the input pattern data, indicated as a red dot (Trial). After the deployed model produced the score, the allocation of memory decreases. Here, the user made two trials. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

and then its trained model was transferred to our deployment platform, i.e. mobile phone, as shown in Fig. 13. The compact version of the trained model was saved as a.pb file and then this model had been compiled within an Android pattern lock application written in Java using Android Studio 2.3 and NDK 12b version. The application saved touch trajectory and touch pressure while a user swiped an Android pattern on a touch screen. The application program was set to record the x, y coordinates and pressure for every 10 milliseconds. Both LG-G4 and Samsung Galaxy S7 were used for this experiment. LG-G4 ran on Android 5.1 Lollipop and had a 1.8 GHz hexa-core processor, having 3 GB RAM. Samsung Galaxy S7 ran on Android 7.0 Nougat and had two Octa-core CPUs: Samsung Exynos M1 2.3 GHz and ARM Cortex-A53 1.6 GHz, having 4GB RAM. Among 10 subjects, since the 5 subjects had attended the server experiment (Section 5.2.1) and (their Android patterns had been used for the training, anomaly detectors on their enrolled patterns were ready. The newly recruited 5 subjects played a role as attackers. The experiment proceeded twice using 2 different Android phones. During the session, 5 authentic subjects made 10 trials for each Android pattern to see if their networks produced the scores that were close to 1, whereas 5 attackers made 10 trials for each pattern against the trained models to see if the networks produced the scores that were close to 0. Each authentic user made 15(patterns) × 10 trials, and attackers made 5(the number of authentic users) × 15(patterns) × 10 trials in total. Performance of our mobile system measured by regarding the trials of authentic users as positive samples and those of attackers as negative samples, respectively. Note that all subjects made their trials while they sat. Performance measured for the mobile experiment was 0.13(EER) and 0.93(AUC), respectively. AUC was slightly lower than that of the server experiment as shown in Fig. 12 (left) and Table 6. The average execution time measured during the experiment for 2 mobile phones were 179 ms for LG-G4 and 44ms for Samsung Galaxy S7, respectively. The execution time was measured using a time gap between when a user detached his finger from a touch screen and when the network produced a value. We also measured the memory usage for the server and mobile environment as shown in Fig. 14. It is known that one of the critical comput-

ing resources in the mobile environment is memory. Fig. 14(Right) shows how memory usage changes from 70 to 118 MiB whenever a user swipes an Android pattern on his touch screen. Fig. 15 demonstrates how our Android-GAN application runs on a mobile phone by defending against diverse Android pattern attacks using the proposed anomaly detection method. 6. Discussion Despite the fact that the security level of Android pattern lock is relatively low, many Android phone users prefer it mainly because of its high usability. Extensive researches have been done on how to improve its security level by designing better patterns that are robust against diverse attacks. In such a case, however, the user has to sacrifice its usability in some degrees. We assume in this study that an imposter already knew the pattern of the user using one of aforementioned attacks. Once a pattern lock is attacked, all you have to do is to discriminate the genuine swipes among many fake ones, which is not a trivial job. We have noticed that the discriminator of a GAN always plays a similar role, i.e. segregating the genuine data among many fake ones. In the preliminary study, we found that even the standard GAN cannot discriminate easily between two kinds of touch trajectories when they are presented to it as 2D images. What is required is a well-designed pre-processing by which such touch trajectory and touch pressure are transformed into 1D signals. In other words, input data is transformed from a spatial signal into a temporal signal. To deal with temporal signal, it would be necessary to introduce a certain deep neural network that can process temporal data. That is why we bring LSTM into our network: the first is the LSTM feature descriptor for processing the preprocessed pattern data before the discriminator; the second is for post-processing the data following a CNN within our discriminator. Because our network accepts a 2D touch trajectory and pressure as input, it looks like a typical GAN. However, as it actually processes 1D temporal data internally, the LSTMs are the necessary component to handle them well. We also have shown that performance of our GAN network is much better that those of three conventional anomaly detection

16

S.-Y. Shin, Y.-W. Kang and Y.-G. Kim / Expert Systems With Applications 141 (2020) 112964

Fig. 15. Demonstration of how our mobile application works against potential attacks (a). When a registered user swipes an S-pattern on the phone, the access is granted (b), whereas the access is denied when an imposter tries to attack the system (c). Our demo video can be found at https://youtu.be/EQWrvUn-O6A.

algorithms. A potential concern on such a transformation is that there could be some loss of information while transforming touch data from 2D into 1D. As briefly mentioned before, the primary reason for this transformation is that distinguishing between the genuine user and diverse imposters is much easier with 1D signals. In addition, given that data augmentation for GAN training is essential, dealing with 1D signal is very handy. However, it would be interesting to investigate, in the future, what kind of information loss may exist during such transformation. Replay buffer adopted in our network was inspired from recent reinforcement learning works such as DQN and Actor-Critic composition. In particular, the latter had been thought of as being analogous to the Generator-Discriminator architecture of a GAN, and indeed an attempt was made whether replay buffer plays any significant role in solving the stability problem of GAN for the image generation task (Pfau & Vinyals, 2016). Though our replay buffer is not full-fledged, the data retrieval and its memory management method is identical to that of the RL. What we found is that replay buffer working for our GAN is crucial for the successful generation of data and its stability as shown in Fig. 6 and Table 4. The other issue would be how one can actually observe what is happening while training the network for a certain pattern. If the generated data within the generator pathway could be visualized during the network training, it would be much helpful for an observer. Superposing the generated data upon the real data along time domain is useful since the overlapping area between two data tells exactly how much the network has been trained well. Given that the proposed authentication system aims, in the end, to serve the users on a mobile platform, it has to be light and run without consuming much resources. A major advantage of our system is that replay buffer and the generator are not necessary when our network runs on a mobile device. As a result, a Java application that implements an authentication-GAN runs well even on a smartphone which has typically limited memory, as shown in Fig. 14(Right). Compared to the baseline methods, such as OneClass SVM, Isolation Forest, and Elliptic Envelope, it is clear that our network has an advantage in accuracy, yet requires more execution time and memory usage as shown in Table 6. The execution time could be further minimized by adopting recent network optimization techniques, and similarly, the longer training time could be also reduced by installing more GPUs within the server. In any case, given that one of the primary factors in making a decision whether a new biometric system could be introduced into the market should be its accuracy as well as in evaluating its academic excellency (Tape, 2015), our system has definite merit because such a decision tends to weigh system accuracy before some hardware issues such as training time and memory usage.

Another issue could be the time interval between data collection points. The initial touch data collection session was made within 2 weeks and then the additional data collection for the mobile experiment was carried out 2 months after. There was no sign of distinguishable difference between them. Another potential issue is that there could be a certain overfitting from two different perspectives. The first is that our network too much depends on limited number of data such as 10 touch data. If we utilize more than 10 touch data for training, it is certain that the network would perform better and stable, but usability of the user has to be scarified in some degree. Since our result so far outperforms the conventional anomaly detection algorithms, it is believed that such an overfitting is not occurring yet. The second issue is on overfitting with a specific finger. If a user has registered using an index finger, can he use his thumb for authentication? Our subjects were asked to use only one primary finger during all data collection sessions for such a reason. If a user has a damage on his registered finger, for instance, he is required to register again with other finger. 7. Conclusion A GAN based generative model is proposed to defend against potential Android pattern attacks. Though it is based upon the GAN architecture, two components are added to it: the first is replay buffer from deep reinforcement learning to solve the stability problem; the second is LSTM network to deal with temporal signal. Since GAN is a special kind of unsupervised network, its training method differs from other machine learning techniques. Since the network is trained as an anomaly detector using a users specific pattern data, an extensive data augmentation is necessary. In fact, the generative nature of GAN exactly has such characteristics. Another important ingredient for our network during training is replay buffer since as far as we aware this is the first case where GAN works with replay buffer for its stability. We expand an uni-modal network into a multi-modal one where touch trajectory feature is combined with touch pressure feature, yielding a powerful anomaly detector that works against diverse Android pattern attacks and reaching to 0.95 AUC in accuracy. Subsequent mobile experiment confirms that our network indeed works against different Android pattern attacks with 0.93 AUC in accuracy. In addition, given that one expects that posture variation while entering a pattern on a mobile device makes a certain impact, our result is interesting and valuable since it provides some clue to the following questions: which posture is more stable than the others and how bad would it be to the system performance.

S.-Y. Shin, Y.-W. Kang and Y.-G. Kim / Expert Systems With Applications 141 (2020) 112964

This is a straight through end-to-end study from the conceptual design of new authentication system to an empirical evaluation on the mobile platform. The present study demonstrates that our network is very useful for protecting Android pattern lock system. We expect that such networks can be applied to similar authentication systems as well as biometric authentications. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Credit authorship contribution statement Sang-Yun Shin: Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing - original draft, Writing - review & editing, Visualization. Yong-Won Kang: Software, Formal analysis, Investigation, Resources, Data curation, Writing - original draft. Yong-Guk Kim: Conceptualization, Project administration, Supervision, Writing - original draft, Writing - review & editing, Funding acquisition. Acknowledgment This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2016-0-0498: User behavior-based authentication and anomaly detection using deep learning techniques.). References Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, A., Citro, C., et al. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from https://www.tensorflow.org/. Andriotis, P., Oikonomou, G., Mylonas, A., & Tryfonas, T. (2016). A study on usability and security features of the android pattern lock screen. Information & Computer Security, 24(1), 53–72. Angulo, J., & Wästlund, E. (2011). Exploring touch-screen biometrics for user identification on smart phones. In IFIP primelife international summer school on privacy and identity management for life (pp. 130–143). Springer. Antal, M., Szabó, L. Z., & László, I. (2015). Keystroke dynamics on android platform. Procedia Technology, 19, 820–826. Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning, in PMLR: 70 (pp. 214–223). arXiv:1701.07875. Aviv, A. J., Davin, J. T., Wolf, F., & Kuber, R. (2017). Towards baselines for shoulder surfing on mobile authentication. In Proceedings of the 33rd annual computer security applications conference (pp. 486–498). New York, NY, USA: ACM. ACSAC 2017 doi: 10.1145/3134600.3134609. Aviv, A. J., Gibson, K., Mossop, E., Blaze, M., & Smith, J. M. (2010). Smudge attacks on smartphone touch screens. In Proceedings of the 4th USENIX conference on offensive technologies (pp. 1–7). Berkeley, CA, USA: USENIX Association. WOOT’10. Bicego, M., Lagorio, A., Grosso, E., & Tistarelli, M. (2006). On the use of sift features for face authentication. Computer vision and pattern recognition workshop p. 35. Cho, G., Huh, J., Cho, J., Oh, S., Song, Y., & Kim, H. (2017). Syspal: System-guided pattern locks for android. Security and Privacy (SP), 338–356. Crouse, D., Han, H., Chandra, D., Barbello, B., & Jain, A. K. (2015). Continuous authentication of mobile user: Fusion of face image and inertial measurement unit data. In 2015 International conference on biometrics (ICB) (pp. 135–142). doi: 10.1109/ICB.2015.7139043. Fornwall, F., Grimler, H., & Plyushch, L. (2016). Termux: Android terminal and linux environment. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Bengio, Y., et al. (2014). Generative adversarial nets. Advances in Neural Information Processing Systems, 2672–2680. Graves, A., Mohamed, R. A., & HINTON, E. G. (2013). Speech recognition with deep recurrent neural networks. Acoustics, Speech and Signal Processing (ICASSP), 6645–6649. Harbach, M., Luca, E. A, & Egelman, S. (2016). The anatomy of smartphone unlocking: A field study of android lock screens. In Proceedings of the 34th annual ACM conference on human factors in computing systems. Heck, & Larry (2003). Voice authentication system having cognitive recall mechanism for password verification. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.. Neural Computation, 9, 1735–1780.

17

Jones, E., Oliphant, T., & Peterson, P. (2001). SciPy: Open source scientific tools for Python. [Online; accessed 24-March-2019] http://www.scipy.org/. Khan, M. K., Zhang, J., & Wang, X. (2008). Chaotic hash-based fingerprint biometric remote user authentication scheme on mobile devices. Chaos, Solitons & Fractals, 35(3), 519–524. doi: 10.1016/j.chaos.2006.05.061. Kim, D., Chung, K., & Hong, K. (2010). Person authentication using face, teeth and voice modalities for mobile device security. IEEE Transactions on Consumer Electronics, 56(4), 2678–2685. doi: 10.1109/TCE.2010.5681156. Kolly, S. M., Wattenhofer, R., & Welten, S. (2012). A personal touch: Recognizing users based on touch screen behavior. In Proceedings of the third international workshop on sensing applications on mobile phones PhoneSense ’12 (pp. 1:1–1:5). New York, NY, USA: ACM. doi: 10.1145/2389148.2389149. Kwon, T., & Na, S. (2014). Tinylock: Affordable defense against smudge attacks on smartphone pattern lock systems. Computers & Security, 137–150. Lecun, Y., et al. (1995). Convolutional networks for images, speech, and time series. In The handbook of brain theory and neural networks: 3361 (p. 10). Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). Isolation forest. Data Mining, 413–422. Loge, & Dybevik, M. (2015). Tell me who you are and I will tell you your unlock pattern Master’s Thesis. Lpez-Ibez, M., Dubois-Lacoste, J., Prez Cceres, L., Sttzle, T., & Birattari, M. (2016). The IRACE package: Iterated racing for automatic algorithm configuration (Vol. 3, pp. 43–58). doi: 10.1016/j.orp.2016.09.002. Mehrotra, K. G., Mohan, C. K., & Huang, H. (2017). Anomaly detection principles and algorithms (1st ed.). Springer Publishing Company, Incorporated. Meng, Y., Wong, D. S., Schlegel, R., et al. (2012). Touch gestures based biometric authentication scheme for touchscreen mobile phones. In International conference on information security and cryptology (pp. 331–350). Springer. Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., . . . Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. Pfau, D., & Vinyals, O. (2016). Connecting generative adversarial networks and actorcritic methods. In NIPS Workshop on Adversarial Training. arXiv: 1610.01945 Radfort, A., METZ, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. In 4th International Conference on Learning Representations (ICLR). arXiv: 1511.06434 Ravanbakhsh, M., Sangineto, E., Nabi, M., & Sebe, N. (2019). Training adversarial discriminators for cross-channel abnormal event detection in crowds. In 2019 IEEE winter conference on applications of computer vision (WACV) (pp. 1896–1904). IEEE. Rousseeuw, P. J., & Driessen, K. V. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41, 212–223. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training GANs. Advances in Neural Information Processing Systems, 2234–2242. Schlegl, T., Seebck, P., Waldstein, S. M., Schmidt-Erfurth, U., & Langs, G. (2017). Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In M. Niethammer, M. Styner, S. R. Aylward, H. Zhu, I. Oguz, P.-T. Yap, & D. Shen (Eds.), IPMI (pp. 146–157). Springer. (10265) Lecture Notes in Computer Science. Schölkopf, B., et al. (2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13. Sun, C., Wang, Y., & Zheng, J. (2014). Dissecting pattern unlock: The effect of pattern strength meter on pattern selection. Journal of Information Security and Applications, 19(4), 308–320. doi: 10.1016/j.jisa.2014.10.009. Sutton, S. R., & Barto, G. A. (1998). Reinforcement learning: An introduction. Cambridge: MIT press. Tape, T.G. (2015). UNMC the area under an ROC curve. http://gim.unmc.edu/dxtests/ roc3.htm. Trewin, S., Swart, C., Koved, L., Martino, J., Singh, K., & Ben-David, S. (2012). Biometric authentication on a mobile device: A study of user effort, error and task disruption. In Proceedings of the 28th annual computer security applications conference (pp. 159–168). New York, NY, USA: ACM. ACSAC ’12. doi: 10.1145/2420950. 2420976. Tupsamudre, H., Vaddepalli, S., Banahatti, V., & Lodha, S. (2018). Tinpal: An enhanced interface for pattern locks. Workshop on usable security, series USEC: 18. Wiki (2019). Softmax function — Wikipedia, the free encyclopedia. [Online; accessed 24-March-2019] https://en.wikipedia.org/w/index.php?title=Softmax_function& oldid=889150318. Xi, K., Ahmad, T., Fengling, H., & Hu, J. (2011). A fingerprint based bio-cryptographic security protocol designed for client/server authentication in mobile computing environment. Security and Communication Networks, 4, 487–499. doi: 10.1002/ sec.225. Xu, H., Zhou, Y., & LYU, R. M (2014). Towards continuous and passive authentication via touch biometrics: An experimental study on smartphones. Proceedings of the Tenth Symposium On Usable Privacy and Security. Yang, S., & Verbauwhede, I. (2007). Secure iris verification. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Ye, G., Tang, Z., Chen, X., Kim, K., Taylor, B., & Wang, Z. (2017). Cracking android pattern lock in five attempts. The Network and Distributed System Security Symposium. Zhang, S., & Sutton, R. S. (2017). A deeper look at experience replay. arXiv: 1712.01275. Zhu, J.-Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 2223–2232).