Accepted Manuscript Title: Score normalization applied to adaptive biometric systems Author: Paulo Henrique Pisani, Norman Poh, André C.P.L.F. de Carvalho, Ana Carolina Lorena PII: DOI: Reference:
S0167-4048(17)30155-4 http://dx.doi.org/doi: 10.1016/j.cose.2017.07.014 COSE 1180
To appear in:
Computers & Security
Received date: Revised date: Accepted date:
18-8-2016 25-7-2017 26-7-2017
Please cite this article as: Paulo Henrique Pisani, Norman Poh, André C.P.L.F. de Carvalho, Ana Carolina Lorena, Score normalization applied to adaptive biometric systems, Computers & Security (2017), http://dx.doi.org/doi: 10.1016/j.cose.2017.07.014. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Score Normalization applied to Adaptive Biometric Systems Paulo Henrique Pisania,*, Norman Pohb, André C. P. L. F. de Carvalhoa, Ana Carolina Lorenac a
Universidade de São Paulo, Instituto de Ciências Matemáticas e de Computação, Av.
Trabalhador São Carlense, 400, São Carlos, Brazil b
University of Surrey, Department of Computing, Faculty of Engineering and Physical Sciences,
Guildford, United Kingdom c
Universidade Federal de São Paulo, Instituto de Ciência e Tecnologia, Rua Talim, 330, São
José dos Campos, Brazil Abstract Biometric authentication systems have certain limitations. Recent studies have shown that biometric features may change over time, which can entail a decrease in recognition performance of the biometric system. An adaptive biometric system addresses this problem by adapting the biometric reference/template over time, thereby tracking the changes automatically. However, the use of these systems usually requires the adoption of a high threshold value to avoid the inclusion of impostor patterns into the genuine biometric reference. In this study, we hypothesize that score normalization procedures, which have been used to improve the recognition performance of biometric systems through a better refinement of their decision, can also improve the overall performance of adaptive systems. With such a normalization, a better threshold choice could also be made, which would then increase the number of genuine samples used for adaptation. To the best of our knowledge, this is the first investigation towards the use of score normalization to enhance adaptive biometric systems dealing with the change of user features over time. Through a systematic experimental design tested on two behavioral biometric traits, the obtained results indeed support our conjecture. Moreover, the experimental results show that the performance gain brought by adaptation can have a higher overall impact than score normalization alone.
*Corresponding author Email addresses:
[email protected] (Paulo Henrique Pisani),
[email protected] (Norman Poh),
[email protected] (André C. P. L. F. de Carvalho),
[email protected] (Ana Carolina Lorena)
Page 1 of 29
Keywords: adaptive biometric systems, template update, score normalization, keystroke dynamics, accelerometer biometrics 1.
Introduction Recognizing humans using biometric features such as fingerprint, iris and handwritten
signature is often challenging due to limited enrollment samples. A biometric system, which recognizes users by their biometric features, is then expected to deal with large intra-class variability, including the operating environment changes and the natural aging process. A promising solution to this problem is to adapt the biometric reference – also known as template or model – by using additional query samples acquired when the system is in operation. The resultant system is known as an adaptive biometric system [1, 2]. However, a key problem associated with adaptive biometric systems is how to decide when a query sample should be considered for adaptation [3]. At the core of a biometric system is the operation that compares a biometric reference with a query sample, which produces a score (this paper considers that it is a similarity score). If the score is higher than a given threshold, the query sample is classified as belonging to the genuine user; otherwise, it is classified as belonging to an impostor. Usually, to avoid the inclusion of impostor samples in the biometric reference, a more stringent threshold is adopted. However, this also leads to excluding potential genuine samples that have a lower score. As a result, the adaptation process may not perform as well as expected. To this end, we investigate score normalization procedures as a way of refining the decision mechanism for adaptation. Our conjecture is that, once normalized, a better threshold can be chosen for the adaptive biometric system, allowing more genuine samples to be used for adaptation, while avoiding impostor samples. Consequently, we hypothesize that the refined score should lead to overall better adaptive systems. A preliminary study on the use of score normalization for supervised adaptation to cope with different acquisition conditions has been conducted in [4]. However, to the best of our knowledge, there is no prior work that deals with adaptation in conjunction with score normalization in the unsupervised setting. The current study aims to fill this gap. The use of score normalization in adaptive biometric systems is not straightforward. Several aspects concerning how to obtain the normalization terms in such a dynamic context deserve a discussion on its own right.
Page 2 of 29
The primary goal of this study is to investigate the possibility of using score normalization procedures [5], such as T-Norm, F-Norm and Z-Norm, in order to improve the performance of an adaptive biometric system. This study deals with two behavioral biometric traits, namely, gait movement that is captured by an accelerometer, referred to as accelerometer biometrics, and keystroke dynamics. Using behavioral traits introduces several additional challenges. First, due to their lower discriminative power compared to physical traits such as face, fingerprint and iris, it is harder to decide whether or not samples should be used for adaptation. Second, behavioral traits are usually more prone to changes over time [6]. Therefore, our secondary goal is to investigate which of the two strategies, adaptation or score normalization, has a stronger influence on the overall recognition performance for these traits. More specifically, this study follows a systematic experimental methodology that progressively asks more complex questions, leading to the plausibility that score normalization can improve an adaptive biometric system performance: 1. Can score normalization improve the recognition performance of an adaptive biometric system? 2. Does an adaptive system warrant an improvement over any non-adaptive system, even if the non-adaptive system is enhanced by score normalization? 3. Does the combined use of score normalization and an adaptation strategy improve the results over all possible combinations of their constituent systems, enumerated as follows? (a) The baseline system – without score normalization and adaptation; (b) A non-adaptive system with score normalization; (c) An adaptive system without score normalization. 4. Is the best score normalization different among users in the same dataset? In essence, the variants of biometric systems enumerated in the questions are based on two dichotomies: adaptive versus non-adaptive system; and with versus without score normalization. In particular, it is interesting to compare 3(b) to 3(c) because it can show whether the impact of adaptation is larger than that of score normalization, or vice-versa. We advance the state of the art of adaptive biometric systems in the following ways: • A proposal to apply score normalization in adaptive biometric systems;
Page 3 of 29
• Study of important questions regarding the application of score normalization in this context, such as whether score normalization or biometric reference adaptation has a greater impact on the biometric system performance; • Investigation showing that the best score normalization may be different depending on the user, even within the same dataset; • A systematic and rigorous evaluation of the framework using two behavioral biometric modalities. This paper is organized as follows: Section 2 presents an overview of adaptive biometric systems, including adaptation strategies, classification algorithms and score normalization procedures; Section 3 describes the evaluation methodology adopted in the experiments; Section 4 presents the experimental setup; Section 5 discusses the obtained results; and, finally, Section 6 presents the main conclusions and future directions of this work. 2.
Related work Recently, some papers have suggested that the biometric reference obtained during
enrollment may become outdated over time. This leads to a decrease in the recognition performance of the biometric system, which can potentially result in a worse false non-match rate (FNMR), i.e., the rate of rejecting a genuine request. Such an intra-class variation has been reported for several biometric modalities, e.g., fingerprint and face recognition [1]. In view of this fact, adaptive biometric systems (sometimes also referred to as template update systems) have been proposed to deal with the intra-class variation by updating the biometric reference over time [2]. Adaptive biometric systems have also been applied to improve the biometric reference when the enrollment data is limited [7]. An adaptive biometric system can be defined as a system that obtains a biometric reference from a set of labeled user samples and continuously adapts this biometric reference as new unlabeled samples arrive. However, the literature on adaptive biometric systems is limited, meaning that few studies have investigated the impact of intra-class variation over time in biometrics. This highlights the need for new studies in this field. A possible explanation for this is the lack of appropriate public datasets for such a kind of study [1]. Experiments involving adaptive biometric systems require a dataset that has several samples obtained from the users over time, which are ideally acquired at different sessions.
Page 4 of 29
Most studies on adaptive biometric systems have been carried out for physiological modalities, such as fingerprint and face recognition. This paper focuses on behavioral modalities, particularly keystroke dynamics and accelerometer biometrics. It is known that behavioral modalities are subject to more changes over time than physiological ones. This requires an adaptive system able to add new patterns to the biometric reference and also able to forget outdated patterns. An adaptive biometric system is basically formed by a classification algorithm and an adaptation strategy. The next sections provide an overview of some key adaptation strategies proposed in previous studies, along with classification algorithms employed in this context. In the end, there is a discussion regarding the role of score normalization in adaptive biometric systems. 2.1.
Adaptation strategies Some of the most studied adaptation strategies reported in the literature for unimodal
adaptation are Self-Update and Graph min-cut [8]. Self-Update keeps adding new samples to the biometric reference as long as they are above a given adaptation threshold (which should be equal or above the decision threshold), while Graph min-cut uses a graph-based approach to also add new samples over time. Growing works similarly to Self-Update and was applied to keystroke dynamics in previous work [9]. Although both approaches have performed well for physiological modalities, this may not be the case for behavioral modalities. This is because they do not contain a forgetting mechanism to remove outdated patterns from the biometric reference. A variation of Growing containing a forgetting mechanism is Sliding (also named as Moving window), which has been successfully applied to keystroke dynamics [9, 10]. Sliding works similarly to Growing, however, it also removes the oldest samples from the biometric reference whenever a new one is added. Later, another adaptation strategy, named Double Parallel (DB) [11], combined both Growing and Sliding strategies. The concept is to manage two references, one adapted by Growing and another by Sliding. The output score is then the average of the scores obtained from both references. An extension of this strategy making use of an incremental approach to improve memory efficiency was investigated in [12]. This extension, called Improved Double Parallel (IDB), is designed to work with the M2005 classification algorithm (described in the next section).
Page 5 of 29
Some adaptation strategies based on the usage of detectors were proposed in [10, 13, 12]. The general concept of these strategies, named Usage Control, is to keep the most used patterns in the biometric reference since they would represent the current user behavior. 2.2.
Classification algorithms used in adaptive biometric systems This section describes two classification algorithms employed in previous papers dealing
with behavioral biometric modalities in the context of adaptive biometric systems [11, 13, 12]. These algorithms are: Self-Detector [14] and M2005 [15]. Self-Detector is an instance-based immune algorithm and M2005 is a statistical-based algorithm. They are described in the next sections. 2.2.1. Self-Detector The standard Self-Detector [14] stores all genuine enrollment samples as detectors and assigns a radius to each of them. Later, in the test phase, when a query is presented to the algorithm, it will be tested against all detectors. If at least one detector matches the query, it is classified as genuine; otherwise, it is classified as impostor. In the implementation used here, matching occurs if the distance between the detector and the query sample is smaller than the detector radius (also known as self-radius). The current study defines the radius using the methodology described in Section 3.4. Self-Detector does not output a score. However, since a score is needed to use the score normalization procedures, the correlation between the closest detector and the query (i.e. the maximum correlation value) is considered to be its score. As an example, the Sliding adaptation strategy [9, 10] applied to Self-Detector is shown in Algorithm 1, where
j (t )
is the set of detectors in the biometric reference of user j at instant t.
In the first part of the Algorithm 1, Self-Detector checks whether the input query q is classified as genuine or impostor. Afterward, if the query is classified as genuine, the detector set is adapted using the Sliding strategy, i.e., the oldest detector is removed and the query is added as a new detector. In the end, the result of the classification (isGenuine) and the updated set of detectors are returned (
j ( t 1)
).
Page 6 of 29
2.2.2. M2005 M2005 is a classification algorithm designed for keystroke dynamics proposed by [15]. This algorithm performs classification based on statistics extracted from the enrollment samples (mean, median and standard deviation), which are computed for each dimension of the feature vector. During the test phase, when a query sample q is received, the algorithm checks whether each dimension i of the feature vector meets the condition defined in Equation 1, in which q(i) is the value of the dimension i in the given query sample q and m ean i m ean ({ x ( i ) m edian i m edian ({ x ( i )
j
( i )}) and std i std ({ x ( i )
j
j
( i )}) ,
( i )}) are the mean, median and
standard deviation, respectively, of the dimension i from the enrollment samples
j
of the user
j.
Page 7 of 29
L (i ) q (i ) U (i ) W here: L ( i ) = m in ( m ean i ; m edian i ) (0.95 std i / m ean i )
(1)
U ( i ) = m ax ( m ean i ; m edian i ) (1.05 std i / m ean i )
For each dimension i of the query sample that satisfies the condition in Equation 1, the algorithm computes a sum according to the Algorithm 2. After checking all the dimensions of the query, the algorithm computes a score according to the Equation 2, in which max_sum is defined as 1.0 1.5 ( dim ension _ count 1.0) . Score = sum / m ax _ sum
(2)
If the obtained score value is higher than a given threshold, the query is classified as genuine; otherwise, as impostor. Note that M2005 is not used for accelerometer biometrics since this algorithm was designed for keystroke dynamics. For example, M2005 increases the score for consecutive feature/key matches, which is a reasonable procedure for keystroke dynamics.
Page 8 of 29
The M2005 classification algorithm has been successfully used with the Double Parallel adaptation strategy [11] and its extension Improved Double Parallel [12]. 2.3.
Score normalization applied to adaptive biometric systems Defining which samples should be used for adaptation is a key aspect for adaptive
biometric systems. Usually, samples classified as genuine are used for this purpose, so a more rigorous decision threshold would be a good choice to avoid the inclusion of impostor samples in the biometric reference. However, at the same time, a high threshold means that several genuine samples with a low score may be rejected and not used for adaptation. In light of this fact, score normalization can play an important role. Score normalization procedures aim to reduce any undesirable variation in the output score. As a result of employing these procedures, the class separability between genuine and impostor samples is often increased.
Page 9 of 29
Consequently, a better threshold could be chosen, which can, in turn, result in the use of a higher number of genuine samples for adaptation while avoiding the undesirable inclusion of impostor samples. A preliminary study on the use of score normalization for supervised adaptation to cope with different acquisition conditions has been conducted in [4]. That study used the BANCA dataset [16], which has face and speech data for 52 users acquired in 12 sessions. The sessions 1 to 4 contain samples captured under controlled condition, the sessions 5 to 8 contain samples captured under adverse condition and the sessions 9 to 12 contain samples captured under degraded condition. With this dataset, adaptation was performed using the first session of each condition (sessions 1, 5 and 9), leaving the remaining ones for the test (sessions 2-4, 6-8, 10-12). In the paper, the score normalization terms were computed for each different condition, allowing the system to perform reference and score adaptation. According to their results, both score and reference adaptation improves the recognition performance over a non-adaptive biometric system. Moreover, according to the reported results, reference adaptation performed better than score adaptation. However, to the best of our knowledge, there is no prior work that deals with adaptation in conjunction with score normalization in an unsupervised setting. The current study aims to fill this gap. The next section describes some score normalization procedures. They are also part of the experiments shown in this paper. 2.3.1 Score normalization procedures The T-Norm, or Test Normalization, is a widely used score normalization procedure [17]. It uses data from other known users who serve as cohort users. To apply T-Norm, as shown in Equation 3 [5], the raw output score y from the classifier is subject to z-scoring, where μc is subtracted from y. The μc value is the average cohort score. The resultant term is then divided by σc, which is the standard deviation of the cohort scores. In this paper, the cohort scores are those obtained by comparing the input query with the biometric references from all other users registered in the system at runtime (known as cohort models). Depending on the number of cohort users in the system, the computation of T-Norm can result in a high processing cost. T
y =
y
c
c
(3)
Page 10 of 29
Another related method is Z-Norm [18], which uses a development dataset to estimate the normalization parameters offline. As a result, unlike T-Norm, Z-Norm does not incur any additional significant computation at runtime. Although Z-Norm also applies z-scoring in the same way as T-Norm, it uses a separate set of impostor development samples d to derive the average impostor score Id , j , along with its corresponding standard deviation Id, j , with respect to the reference user j. Z-Norm is defined as: y I,j d
y
Z
=
(4)
I,j d
According to [19], both T-Norm and Z-Norm are considered impostor centric because they only use data from impostor users. On the contrary, F-Norm is considered client-impostor centric, as it also takes into account the genuine samples (the genuine user would be the client in this case). Although F-Norm also contains a subtraction term, the mean impostor score Id , j , its scaling factor (the denominator) depends on the difference between means of the genuine and the impostor samples, i.e., y I,j d
y
F
=
( , j ) I , j d
(5)
where Id , j is the same as the Z-Norm’s impostor mean parameter. The crux in making F-Norm work is to find a stable estimate of the expected value (mean) of the genuine samples. To this end, it appeals to the maximum a priori adapted mean, which is given by ( , j ) , computed as: ( , j ) = G , j (1 ) G d
d
(6)
where Gd , j is the average of genuine scores for the reference user j and Gd is the average of all references. These terms are obtained offline, using a development dataset, similarly to Z-Norm. There is also an adaptive version of the F-Norm which uses cohort scores, similar to the T-Norm [5]. The Adaptive F-Norm uses μc, which is the mean of the cohort scores instead of I , j . The Adaptive F-Norm is defined as: d
y
AF
=
y
c
( , j )
c
(7)
The advantage of using Adaptive F-Norm over the standard F-Norm is that the mean is dependent on each query sample, hence adapted to the context of the operational environment.
Page 11 of 29
3.
Evaluation methodology Before describing the experimental methodology, it is necessary to introduce the terms
registered versus non-registered users due to the need to consider internal and external attacks. An internal attack is one that is carried out by a registered user attempting to impersonate the reference user. On the other hand, an external attack is carried out by a non-registered user attempting to impersonate the reference user. In essence, our experiments are designed to consider adaptation in the presence of internal and external attacks. The simulation of external attacks (due to users not known to the system) becomes even more important when the system takes into account information from all registered users, as in the case of some score normalization procedures. Consistent with the previous discussion, a reference or target user is denoted as j ( j With reference to this user, a sample of a registered user i ≠ j from the registered set
).
is
considered an impostor sample. Similarly, any sample that belongs to any non-registered user is considered an impostor sample. The only admissible genuine samples are those belonging to the reference user. In short, a genuine sample implies a sample that can be used for adaptation, whereas an impostor sample – whether or not it belongs to a registered or a non-registered user, as long as it does not belong to the reference user – should not be used for adaptation. The correct inclusion of genuine samples and correct rejection of impostor samples are the premise for an effective adaptation strategy. The term impostor is used to refer to any non-registered user or to any registered user who is not the reference user. A genuine user is synonymous to the reference or target user. 3.1.
User cross-validation: motivation This study assumes that the biometric system can share data among all registered users,
similar to [20]. This is useful due to several aspects, mainly for obtaining impostor-like data. For instance, apart from the genuine user data, data from other registered users could be accessed as impostor data to improve the biometric reference. This additional data also helps to tune system parameters. This approach is called as a database aware biometric system in our work. However, the use of the database aware assumption inadvertently introduces an additional challenge during the experimental evaluation. Assuming a hypothetical scenario of a dataset containing 100 users, a straightforward evaluation protocol would consider that the biometric system has access to data from all 100 users. Consequently, during enrollment, user 1
Page 12 of 29
could adjust its biometric reference to protect itself from the other 99 users. Nonetheless, this experimental protocol is biased, since all possible impostors seen during the test are previously known by the system. In other words, it would be just evaluating insider attacks. In order to avoid such an experimental bias, not all users should be considered known to the biometric system. Just to illustrate the idea, users 81 to 100 could be considered unknown. As a consequence, during enrollment of user 1, only data from users 2 to 80 can be used. By doing this, there are still 20 users (81 to 100) that can be used for simulating external attacks (from previously unknown users). User cross-validation for biometric data streams methodology was introduced to simulate this scenario, as reported in [12, 21]. 3.2.
User cross-validation: methodology An overview of the user cross-validation is shown in Figure 1. The first step is the
random division of the user indexes d
into k sets of similar size. The set
contains the
user indexes from all users in the dataset. For example, lets assume a dataset containing six users with indexes d {1, 2, .., 6} . In this example, if k = 3, three sets of user indexes are required. Hence, a possible outcome would be
1
2, 5 ,
2
1, 3 ,
3
4, 6 .
Based on these sets, the experiments are performed. For each experiment, one set of users is employed as non-registered users and the remaining k − 1 sets form the registered users. The registered users are those known by the system and they form the the first experiment would consider 1
=
2
3
= {1, 3, 4, 6}
set. In our simple example,
as the set of registered users and
as the set of unregistered users. The subsequent experiments would then evaluate the other
combinations of sets until all the k combinations are tested, so that every user is considered once as a non-registered user. As it can be seen, this part of the user cross-validation works similarly to the well-known k-fold validation except that the sampling occurs at the user index set
as
opposed to the data samples directly. A sample level k-fold validation is not suitable in our case because we need to keep the order of the samples to study the behavior change. The experiments reported later in this paper considered k = 5. After the definition of the registered and the unregistered sets of users, an execution is performed for each registered user j
. An execution consists of two phases: enrollment and
test+adaptation. The first samples from the registered user are employed to perform enrollment. Afterward, a biometric data stream is generated to assess the recognition performance. Each
Page 13 of 29
sample in the data stream is firstly tested and, depending on the adaptive biometric system, it may also be used for adaptation. This process of test + adaptation is repeated sample by sample until the data stream is over. The next section describes the composition of this data stream. 3.3.
Biometric data stream generation The notion of a data stream is necessary to simulate a scenario where data samples arrive
continuously rather than arriving in sessions, although the actual data may have been collected in different sessions. In this way, the decision to perform adaptation or not is up to the adaptive biometric system rules. As no information on sessions is passed to the system, this choice of data stream better represents a practical scenario, since the session information may not be available. Regarding the composition of the data stream, there are several aspects to be considered: • Chronological order: As this study deals with adaptation to changes of the features over time, it is required that the biometric samples from the same user are presented in chronological order. In addition, the enrollment should be performed with samples acquired earlier in time than those used for testing/adaptation. • Ratio of genuine versus impostor samples: The biometric data stream used for the test should contain both genuine and impostor samples to properly evaluate the biometric system performance. However, which ratio of genuine/impostor samples should be considered? A balanced case (50% of samples for both classes) may not correctly reproduce a practical scenario. Usually, in an access control scenario, the majority of authentication attempts are likely to come from the genuine user. Indeed, some previous studies have considered this scenario [11, 12, 21]. They adopted the ratio of 70% genuine samples / 30% impostor samples. The current study adopts the same ratio in the experiments. • Source of the impostor samples: This study generates a data stream with 30% of impostor samples. There are two sources to obtain these samples: the registered users in the dataset (excluding the genuine one) or a separate dataset of unregistered users. This study uses a combination of these two sources. As described in the previous section, the adopted methodology defines two sets of users for each of the k folds: registered and non-registered users. In the user cross-validation methodology, when an impostor sample is selected for the data stream, it has a 50% chance of being drawn from a registered user and a 50% chance of being drawn from an unregistered user.
Page 14 of 29
• Sequence of samples: After setting the rate of genuine/impostor samples, one should define how they will be presented. In one extreme, all genuine samples are presented first, followed by all impostor samples in the end. In another extreme, all impostors are presented first, followed by all genuine samples. However, it is unlikely that any of the two extreme scenarios happen in practice. Hence, the alternative adopted here is to interleave genuine and impostor samples. There are many ways to interleave the samples. This work interleaves them randomly and repeats the experiments 30 times. • True labels: Several studies on data streams [22] assume that the true label is given to the classifier some time after the prediction. It is a feasible assumption for some applications, such as stock market forecasting, for example. Nevertheless, for biometric systems, the true label is not provided in most implementations. Hence, in this study, the true label is never provided to the classifier during the experiments. Therefore, an adaptive system should take the decision to adapt using a given sample or not without ever knowing its true label. Figure 2 shows how the generated data stream may look like. It presents the execution for a single user j, where an enrollment set ( provided. The enrollment data (
j
j
) and a test biometric data stream ( D ataStream j ) are
) has N genuine samples, while the test biometric data stream
( D ataStream j ) has M genuine samples and K impostor samples. As it was considered that there are 30% of impostor samples in the test, then K / ( K M ) = 0.3 . Each individual sample in D ataStream j is presented one after another. Genuine and impostor samples are randomly
interleaved. 3.4.
Parameter setting Before running the tests, the parameters of the classification algorithms need to be tuned.
For Self-Detector, the parameter is the self-radius, while for M2005 it is the decision threshold (they were described in Section 2.2). To tune these parameters, the user cross-validation procedure previously described in this paper is used to evaluate several combinations of parameter values (similar to a grid search method). However, instead of using the complete dataset, only the first samples of each user (enrollment samples) are used for parameter tuning. The best combination of parameter values in terms of balanced accuracy ( B A cc = 1.0 ( F M R F N M R ) / 2.0 ) is selected for the test on the complete dataset, noting that FMR is false match rate, i.e., the rate at which an impostor claim is classified as genuine;
Page 15 of 29
whereas FNMR is false non-match rate, i.e., the rate at which a genuine claim is classified as impostor. Note that the current study only reports the performance measures for a given threshold obtained using enrollment data from all users. The adopted procedure follows some guidelines from [23], which argues that reporting results in terms of EER (Equal Error Rate) can be misleading. EER is the error rate when FMR and FNMR are equal. This is because EER is usually obtained by testing several threshold values on the test data (until false acceptance equals to false rejection), which may not be feasible in a practical application scenario. A more realistic procedure is to tune the parameters in the enrollment data and apply the obtained values in the test data. Parameter tuning is performed only on the static/non-adaptive versions of the biometric systems. Their adaptive counterparts assume the same parameter values. This is adopted to properly compare the impact of using adaptive systems under the same conditions as those of non-adaptive systems. 3.5.
Score normalization terms This section describes how the score normalization terms are obtained from the user
cross-validation for biometric data streams methodology. An overview is shown on Figure 3. Note that some score normalization procedures (e.g., Z-Norm and F-Norm) require an offline set of scores. This section describes how to deal with it in the adopted methodology. Three sets of scores are defined:
c j , q u ery
,
G j
and
I j
.
The first one is the set of cohort scores for the user j considering a query.
c j , q u ery
is the
set of scores obtained by comparing the query to all biometric references other than the claimed one (j), obtained from Equation 8. As the name suggests, the function m atch ( q , m ) returns the match score between the query q for a given claimed user with reference m. In Equation 8, mi is the biometric reference of index i.
is the set of user indexes known by the biometric system.
The cohort score set needs to be recalculated for every new query. c j , q u ery
= { m atch ( q , m i ) | i
j i} (8)
The second one is the set of genuine scores for user j used to calculate F-Norm terms, obtained from Equation 9. This set of scores is obtained only from the enrollment samples of the user j, which are the first samples from the user j in the user cross-validation evaluation
Page 16 of 29
methodology. These scores are obtained by applying leave one out method over the enrollment samples. In this method, the classification algorithm is trained using all enrollment samples from user j except one (
j
s n ). The remaining enrollment sample (sn) is used to obtain a genuine
matching score: m atch ( s n , m j
j sn
) . This process is repeated until all enrollment samples are used
once for matching, as described in Equation 9. G j
= { m atch ( s n , m j j
sn
) | sn
j
}
(9)
The third set of scores consists of the impostor scores for user j. These impostor scores are needed to compute F-Norm and Z-Norm. This set is obtained from Equation 10, using only the enrollment samples from all other users (except those belonging to the reference user j). To this end, the biometric reference of the user mj is compared to all enrollment samples from the remaining registered users, defined as the set I j
= { m atch ( h n , m j ) | h n
I j
I j
.
(10)
}
From these score sets, the terms needed for calculating T-Norm, Adaptive F-Norm, Z-Norm and F-Norm can be obtained as shown in Equations 11 to 16. = j , query = m ean ( c
= c
c
c j , query
= std (
G , j = m ean ( d
G j
c j , query
)
G = m ean ({ G , j | j d
d
I , j = m ea n ( d
I , j = std ( d
I j
I j
c j , query
)
)
)
(11) (12)
(13) })
(14)
(15)
) (16)
4.
Experimental setup
4.1.
Datasets The biometric datasets used in the experiments are described in Table 1. They are all
public datasets, allowing to reproduce the experiments. Seven datasets are used: four of them for keystroke dynamics and the other three for accelerometer biometrics. This paper adopts the term accelerometer biometrics to refer to gait recognition using data from an accelerometer. Datasets suitable for studying adaptive biometric systems need to contain several samples per user. Moreover, they should ideally be acquired at different sessions. All datasets we are
Page 17 of 29
aware of that can be used to study adaptive biometric systems for keystroke dynamics and for accelerometer biometrics are used in our experiments. For keystroke dynamics, the feature flight time type 1 [24] was extracted, which is the time difference between the instant when a key is released and the instant when the next key is pressed, as shown in Figure 4. According to [25], this is one of the most used features in previous keystroke dynamics studies. Self-Detector uses order-based rank transformation to process the keystroke dynamics features, as in [21]. For accelerometer biometrics, features are extracted from the series of accelerations measured in each of the three axes: x, y and z. Each series is divided into windows and then the magnitudes resulted from the application of a Fast Fourier Transform are used as input features for classification, as in [26]. 4.2.
Biometric systems The experiments considered two classification algorithms which have been used in the
context of adaptive biometric systems for keystroke dynamics and accelerometer biometrics: Self-Detector and M2005. These algorithms were described in Section 2.2. Together with these algorithms, several previously proposed adaptation strategies were evaluated. These adaptation strategies were presented in Section 2.1. Additional details on them can be found in [12, 11, 10, 13]. Their combination with the classification algorithms are shown in Table 2. 5.
Experimental results The experiments in this section are designed to answer the four questions asked in
Section 1. 5.1.
Question 1: score normalization improves performance of adaptive biometric systems In order to answer this question, the relative performance of the biometric systems with
and without score normalization is tested. The performance is measured in terms of balanced accuracy ( B A cc = 1.0 ( F M R F N M R ) / 2.0 ) according to Equation 17. FMR stands for False Match Rate and FNMR stands for False Non-Match Rate. Several studies on biometric systems attempt to mainly decrease FMR, since it means that a lower number of impostor samples is wrongly accepted by the system. However, one of the main benefits of adaptive biometric systems is obtaining lower FNMR than a non-adaptive system, without necessarily compromising FMR. This occurs because an adaptive biometric system adapts the biometric reference to the data from the genuine user, thus decreasing false non-match. The challenge of adaptive systems is then to decrease FNMR without negatively affecting FMR when compared
Page 18 of 29
to a non-adaptive biometric system. Since both FMR and FNMR play a key role in adaptive biometric systems, this paper adopted the balanced accuracy to measure the performance. In Equation 17, BAcca is the performance of the system with score normalization and BAccb is the performance of the baseline (without normalization). Note that these rates are obtained from fixed parameter values, as described in Section 3.4. Figure 5, which is based on the plots of [27], shows the relative performance for keystroke dynamics and accelerometer, respectively. R Pa =
B A cc a B A cc b B A cc b
(17)
According to the graphs, the results change for each biometric modality. While T-Norm and Adaptive F-Norm generally enhance the keystroke dynamics systems performance, F-Norm and Adaptive F-Norm do so only for certain accelerometer biometric-based systems. T-Norm, which was the best score normalization for keystroke dynamics, appears to give a lower performance for the accelerometer biometrics. The performance improvement is also dependent on the choice of the classification algorithm. These experiments suggest that score normalization improves adaptive biometric systems for keystroke dynamics, but it is not clear for accelerometer biometrics due to the higher variance. Hence, regarding question 1, the experimental results partially confirm that score normalization improves performance. For keystroke dynamics, which most benefited from the use of score normalization, an additional plot is included. Figure 6 shows the performance considering FMR and FNMR separately for each dataset. This new plot considers the absolute performance difference for the adaptive systems, as seen in Equation 18, where Metric can be either FNMR or FMR. The idea is to compare which performance metric (FMR or FNMR) was more affected by score normalization in our experiments with adaptive biometric systems. For this, the absolute difference was considered instead of the relative performance. PD a = M etric a M etric b
(18)
By comparing the two plots (FMR and FNMR), it is clear that the highest benefit was for FNMR. Indeed, FNMR was decreased in almost all cases, while FMR increased. However, the increase in FMR is not as high as the decrease in FNMR. In practice, the threshold chosen by the optimization method is lower (less rigorous) when a score normalization procedure is used. It
Page 19 of 29
means that the adaptive system was able to accept a higher amount of genuine samples, that would have been rejected were a more stringent threshold used to adapt the biometric reference over time. 5.2.
Questions 2 and 3: comparing the performance impact of adaptation and score
normalization In order to check whether adaptation or normalization has the higher impact, three experimental scenarios using the relative performance box plot are assessed (the best normalization was selected from the plots of the last section, when T-Norm and Adaptive F-Norm were the best normalizations for the keystroke dynamics and accelerometer datasets in terms of median, respectively): • Non-adaptive system (without score normalization) vs. adaptive systems (without score normalization); • Non-adaptive system (with the best score normalization) vs. adaptive systems (without score normalization); • Non-adaptive system (with the best score normalization) vs. adaptive systems (with the best score normalization). All in all, we hypothesize that the better biometric system is one with adaptation and score normalization, while the worst case would be the system without both adaptation and normalization. These scenarios are checked experimentally, as illustrated in Figure 7. Figure 8 adopts the same methodology for visualizing the results of Figure 5. The box plot in the bottom is the baseline whereas the remaining box plots shown above present the performance relative to the baseline. This figure shows the three comparisons in terms of balanced accuracy. The first row of plots (figures 8 (a), (b) and (c)) does not use any score normalization for all biometric systems. Thus, it is possible to see the impact of using adaptation alone. Almost all adaptive biometric systems present a clear performance improvement, as discussed in previous papers [12]. The second row of plots (figures 8 (d), (e) and (f)) is perhaps the most interesting one since it compares the non-adaptive (static) biometric system using the best score normalization against the adaptive counterparts without any score normalization. For keystroke dynamics (M2005) and accelerometer (Self-Detector), adaptive systems still perform better in almost all tests. This suggests that adaptation has a higher impact on the performance than score
Page 20 of 29
normalization. However, for keystroke dynamics (Self-Detector), the non-adaptive version with normalization attains better performance than several adaptive systems. The second question asked in the introduction was: does an adaptive system warrant an improvement over any non-adaptive system, even if the non-adaptive system is enhanced by score normalization? In short, the results presented in this section answer this question by suggesting that adaptation may have a higher impact on performance than score normalization alone. An additional question was then asked: does the combined use of score normalization and an adaptation strategy improve the results over all possible combinations of their constituent systems? In the third row of plots (figures 8 (g), (h) and (i)), score normalization is applied to both non-adaptive and adaptive systems. The results answer this question by showing that the combined use of score normalization and adaptation further improves the recognition performance. 5.3.
Question 4: the best score normalization is different among users in the same dataset To answer this question, a set of graphs of balanced accuracy per user is shown in figures
9 and 10. Although just two of them are shown here, similar conclusions can be obtained from the plots of other systems/datasets. There are two types of graphs. The first one is a heat map of the performance per user for each score normalization procedure. The brighter the green, the better is the performance. The second graph just highlights which was the best score normalization per user in terms of balanced accuracy. These graphs show that the best score normalization can be different among users in the same dataset. In some datasets, such as CMU, most users share the same best score normalization, but it is not the case for all datasets and biometric systems. Sometimes, a user has a better performance without any score normalization procedure. In addition, the heat maps show that the performance difference among the score normalization procedures may be high. This can be observed in most datasets. The findings mentioned in this section suggest that, compared to a system that chooses a single/common score normalization for all users, an adaptive biometric system able to choose the score normalization for each user can result in a better overall recognition performance. 6.
Conclusion
Page 21 of 29
This paper addressed some of the open questions regarding the use of score normalization in conjunction with adaptive biometric systems. These questions were enumerated in the introduction of this paper and answered throughout the text. First, the experimental findings confirm that score normalization can improve the recognition performance of baseline biometric systems, both adaptive and non-adaptive. Although this is expected, to the best of our knowledge, its effectiveness has not been systematically evaluated in an unsupervised scenario, using several score normalization procedures, for keystroke dynamics and accelerometer biometrics. Second, the experimental results suggest that adaptation has a higher impact on the performance than score normalization alone, even though both can improve the recognition performance over a baseline biometric system. Moreover, according to the obtained results, the combined use of adaptation and score normalization can further enhance the recognition performance. Finally, it has been observed that the best score normalization can differ from one user to another, even in the same dataset. This finding highlights that, compared to a system that adopts a single/common score normalization for everyone, an adaptive biometric system able to choose the more appropriate score normalization for each user could result in a higher performance. In future work, the investigation performed here for keystroke dynamics and accelerometer biometrics can be extended for additional biometric modalities, as well as other adaptation strategies. Furthermore, as the adaptive system adapts the biometric reference over time, the F-Norm terms obtained in the enrollment time may become sub-optimal and, as a result, its parameters may need to be updated. Therefore, future studies could explore alternatives for updating mechanisms, considering both the score and reference adaptation. 7.
Acknowledgments The authors would like to thank Brazilian agencies CAPES, CNPq and FAPESP (projects
2012/25032-0, 2012/22608-8 and 2013/07375-0) for financial support. Dr. Paulo Henrique Pisani finished his undergraduate course in Data Processing from Faculdade de Tecnologia de São Paulo (FATEC-SP), Brazil, in 2008. He received his master degree in Information Engineering from Universidade Federal do ABC (UFABC), Brazil, in 2012. He has received his doctorate degree in the area of Computer Science and Computational Mathematics at Instituto de Ciências Matemáticas e de Computação (ICMC) - Universidade de
Page 22 of 29
São Paulo (USP), Brazil, in 2017. His main research interests are Biometrics and Machine Learning. Dr. Norman Poh is a Lecturer in Computational Intelligence in the Department of Computer Science, University of Surrey. He received the Ph.D. degree in computer science in 2006 from the Swiss Federal Institute of Technology Lausanne (EPFL), Switzerland. Prior to the current appointment, he was a Research Fellow with the Centre for Vision, Speech, and Signal Processing (CVSSP) and a research assistant at IDIAP research institute. His research objective is to advance pattern recognition techniques with applications to biometrics and healthcare informatics. In these two areas, he has published more than 90 publications, which also include five award-winning papers (AVBPA05, ICB09, HSI 2010, ICPR 2010 and Pattern Recognition Journal 2006). He is the principal investigator of MRC Modelling CKD project (www.modellingCKD.org). He is an Associate Editor of IET Biometrics Journal, a member of European Reference Network for Critical Infrastructure Protection (ERNCIP), and a member of the Education Committee of the IEEE Biometrics Council. Prof. André C. P. L. F. de Carvalho received his B.Sc. and M.Sc. degrees in Computer Science from the Universidade Federal de Pernambuco, Brazil. He received his Ph.D. degree in Electronic Engineering from the University of Kent, UK. Prof. André de Carvalho is Full Professor at the Department of Computer Science, Universidade de São Paulo, Brazil. He has published around 90 Journal and 200 Conference refereed papers. He has been involved in the organization of several conferences and journal special issues. His main interests are Data Science, Machine Learning, Data Mining, Bioinformatics, Evolutionary Computation and Bioinspired Computing. Dr. Ana Carolina Lorena has bachelor (2001) and Ph.D. (2006) degrees in Computer Science from the University of São Paulo - São Carlos. She is currently an Assistant Professor at the Federal University of São Paulo. She has experience in Computer Science, working mainly on the following topics: data mining, supervised machine learning, hybrid intelligent systems and support vector machines. References [1] F. Roli, L. Didaci, G. Marcialis, Adaptive biometric systems that can improve with use, in: N. Ratha, V. Govindaraju (Eds.), Advances in Biometrics, Springer London, 2008, pp. 447–471. doi:10.1007/978-1-84628-921-7_23.
Page 23 of 29
[2] N. Poh, A. Rattani, F. Roli, Critical analysis of adaptive biometric systems, IET Biometrics 1 (4) (2012) 179–187. doi:10.1049/iet-bmt.2012.0019. [3] N. Poh, J. Kittler, A. Rattani, Handling session mismatch by fusion-based co-training: An empirical study using face and speech multimodal biometrics, in: 2014 IEEE Symposium on Computational Intelligence in Biometrics and Identity Management (CIBIM), IEEE, 2014, pp. 81–86. doi:10.1109/CIBIM.2014.7015447. [4] N. Poh, J. Kittler, S. Marcel, D. Matrouf, J.-F. Bonastre, Model and score adaptation for biometric systems: Coping with device interoperability and changing acquisition conditions, in: 20th International Conference on Pattern Recognition (ICPR), IEEE, 2010, pp. 1229–1232. doi:10.1109/ICPR.2010.306. [5] N. Poh, A. Merati, J. Kittler, Adaptive client-impostor centric score normalization: A case study in fingerprint verification, in: IEEE 3rd International Conference on Biometrics: Theory, Applications, and Systems, IEEE, 2009, pp. 1–7. doi:10.1109/BTAS.2009.5339033. [6] R. Giot, C. Rosenberger, B. Dorizzi, Performance evaluation of biometric template update, in: International Biometric Performance Testing Conference, 2012, pp. 1–4. [7] Q. Yu, Y. Yin, G. Yang, Y. Ning, Y. Li, Face and gait recognition based on semi-supervised learning, in: C.-L. Liu, C. Zhang, L. Wang (Eds.), Pattern Recognition, Vol. 321 of Communications in Computer and Information Science, Springer Berlin Heidelberg, 2012, pp. 284–291. doi:10.1007/978-3-642-33506-8_36. [8] A. Rattani, G. Marcialis, F. Roli, Temporal analysis of biometric template update procedures in uncontrolled environment, in: G. Maino, G. Foresti (Eds.), Image Analysis and Processing ICIAP 2011, Vol. 6978 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2011, pp. 595–604. doi:10.1007/978-3-642-24085-0_61. [9] P. Kang, S.-s. Hwang, S. Cho, Continual retraining of keystroke dynamics based authenticator, in: S.-W. Lee, S. Li (Eds.), Advances in Biometrics, Vol. 4642 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2007, pp. 1203–1211. doi:10.1007/978-3-540-74549-5_125. [10] P. H. Pisani, A. C. Lorena, A. C. P. L. F. Carvalho, Adaptive positive selection for keystroke dynamics, Journal of Intelligent & Robotic Systems 80 (1) (2015) 277–293. doi:10.1007/s10846-014-0148-0.
Page 24 of 29
[11] R. Giot, C. Rosenberger, B. Dorizzi, Hybrid template update system for unimodal biometric systems, in: 2012 IEEE Fifth International Conference on Biometrics: Theory, Applications and Systems (BTAS), IEEE, 2012, pp. 1–7. doi:10.1109/BTAS.2012.6374539. [12] P. H. Pisani, A. C. Lorena, A. C. P. L. F. Carvalho, Adaptive approaches for keystroke dynamics, in: 2015 International Joint Conference on Neural Networks (IJCNN), IEEE, 2015, pp. 1–8. doi:10.1109/IJCNN.2015.7280467. [13] P. H. Pisani, A. C. Lorena, A. C. P. L. F. Carvalho, Adaptive algorithms in accelerometer biometrics, in: 2014 Brazilian Conference on Intelligent Systems (BRACIS), IEEE, 2014, pp. 336–341. doi:10.1109/BRACIS.2014.67. [14] T. Stibor, J. Timmis, Is negative selection appropriate for anomaly detection?, ACM GECCO (2005) 321–328. [15] S. T. Magalhães, K. Revett, H. M. D. Santos, Password secured sites - stepping forward with keystroke dynamics, in: Proceedings of the International Conference on Next Generation Web Services Practices, NWESP ’05, IEEE Computer Society, 2005, pp. 293–298. doi:10.1109/NWESP.2005.62. [16] E. Bailly-Bailliére, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler, J. Mariéthoz, J. Matas, K. Messer, V. Popovici, F. Porée, B. Ruiz, J.-P. Thiran, The banca database and evaluation protocol, in: Proceedings of the 4th international conference on Audio- and video-based biometric person authentication, AVBPA’03, Springer-Verlag, 2003, pp. 625–638. [17] R. Auckenthaler, M. Carey, H. Lloyd-Thomas, Score normalization for text-independent speaker verification systems, Digital Signal Processing 10 (13) (2000) 42–54. doi:10.1006/dspr.1999.0360. [18] D. Reynolds, Comparison of background normalization methods for text-independent speaker verification, in: 1997 Eurospeech, 1997, pp. 963–966. [19] N. Poh, S. Bengio, F-ratio client-dependent normalisation for biometric authentication tasks, in: 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, IEEE, 2005, pp. 721–724. doi:10.1109/ICASSP.2005.1415215. [20] S. Bengio, J. Marithoz, Biometric person authentication is a multiple classifier problem, in: M. Haindl, J. Kittler, F. Roli (Eds.), Multiple Classifier Systems, Vol. 4472 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2007, pp. 513–522. doi:10.1007/978-3-540-72523-7_51.
Page 25 of 29
[21] P. H. Pisani, R. Giot, A. C. P. L. F. Carvalho, A. C. Lorena, Enhanced template update: Application to keystroke dynamics, Computers & Security 60 (2016) 134 – 153. doi:10.1016/j.cose.2016.04.004. [22] I. Žliobaitė, A. Bifet, J. Read, B. Pfahringer, G. Holmes, Evaluation methods and decision theory for classification of streaming data with temporal dependence, Machine Learning 98 (3) (2015) 455–482. doi:10.1007/s10994-014-5441-4. [23] S. Bengio, J. Mariéthoz, M. Keller, The expected performance curve, in: International Conference on Machine Learning (ICML), Workshop on ROC Analysis in Machine Learning, 2005, pp. 1–8. [24] P. S. Teh, A. B. J. Teoh, S. Yue, A survey of keystroke dynamics biometrics, The Scientific World Journal (2013) 1–24 doi:10.1155/2013/408280. [25] P. H. Pisani, A. C. Lorena, A systematic review on keystroke dynamics, Journal of the Brazilian Computer Society 19 (4) (2013) 573–587. doi:10.1007/s13173-013-0117-7. [26] P. H. Pisani, A. C. Lorena, A. C. P. L. F. Carvalho, Adaptive algorithms applied to accelerometer biometrics in a data stream context, Intelligent Data Analysis 21 (2) (2017) 353–370. doi:10.3233/IDA-150403. [27] N. Poh, M. Tistarelli, Customizing biometric authentication systems via discriminative score calibration, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2012, pp. 2681–2686. doi:10.1109/CVPR.2012.6247989. [28] R. Giot, M. El-Abed, C. Rosenberger, Greyc keystroke: a benchmark for keystroke dynamics biometric systems, in: IEEE International Conference on Biometrics: Theory, Applications and Systems (BTAS 2009), IEEE Computer Society, 2009, pp. 419–424. [29] K. Killourhy, R. Maxion, Why did my detector do that?! predicting keystroke-dynamics error rates, in: S. Jha, R. Sommer, C. Kreibich (Eds.), Recent Advances in Intrusion Detection, Vol. 6307 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2010, pp. 256–276. doi:10.1007/978-3-642-15512-3_14. [30] R. Giot, M. El-Abed, C. Rosenberger, Web-based benchmark for keystroke dynamics biometric systems: A statistical analysis, in: 2012 Eighth International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP), IEEE, 2012, pp. 11–15. doi:10.1109/IIH-MSP.2012.10.
Page 26 of 29
[31] J. Frank, S. Mannor, D. Precup, Data sets: Mobile phone gait recognition data (2010). URL http://www.cs.mcgill.ca/~jfrank8/data/gait-dataset.html [32] J. R. Kwapisz, G. M. Weiss, S. A. Moore, Activity recognition using cell phone accelerometers, ACM SIGKDD Explorations Newsletter 12 (2) (2011) 74–82. doi:10.1145/1964897.1964918. [33] J. W. Lockhart, G. M. Weiss, J. C. Xue, S. T. Gallagher, A. B. Grosner, T. T. Pulickal, Design considerations for the wisdm smart phone-based sensor mining architecture, in: Proceedings of the Fifth International Workshop on Knowledge Discovery from Sensor Data, SensorKDD ’11, ACM, 2011, pp. 25–33. doi:10.1145/2003653.2003656. Figure 1: User cross-validation for biometric data streams evaluation methodology (overview). Figure 2: Enrollment samples and biometric data stream for user j. Enrollment data (
j
) has N
genuine samples. The biometric data stream ( D ataStream j ) has M genuine samples and K impostor samples. Figure 3: Sets to compute the score normalization terms in the user cross-validation evaluation methodology (overview). Figure 4: Keystroke dynamics - Flight time type 1 (Figure adapted from [10]). Figure 5: Score normalization in the adaptive biometric systems context (relative performance). There are 600 observations per keystroke dynamics box plot and 450 observations per accelerometer box plot (each dataset results in 150 observations and keystroke dynamics has four datasets while there are three datasets for accelerometer biometrics). As these plots are based on balanced accuracy, the higher the values in the horizontal axis, the better. In most of cases, score normalization improved performance, although for accelerometer biometrics the benefit of using normalization is not clear. Figure 6: Score normalization for the adaptive systems (FMR and FNMR absolute performance difference - 1200 observations per box plot). As these plots are based on error rates (FMR and FNMR), the lower the values in the horizontal axis, the better. Overall, the use of score normalization results in a higher impact on the FNMR. Figure 7: Ranking of the biometric systems. The best case occurs when both adaptation and score normalization are used in the system, while not using both results in the worst performance among the possibilities shown in the diagram.
Page 27 of 29
Figure 8: Relative Performance: score normalization vs. adaptation. There are 600 observations per keystroke dynamics box plot and 450 observations per accelerometer box plot (each dataset results in 150 observations and keystroke dynamics has four datasets while there are three datasets for accelerometer biometrics). There are two different classification algorithms for keystroke dynamics, so there is one set of graphs for each one: M2005 and Self-Detector. Figure 9: Score normalization performance per user - CMU - Self-Detector (Usage Control R). The first plot presents the balanced accuracy per user, while the second plot highlights (in black) the best score normalization for each user. Figure 10: Score normalization performance per user - CMU - M2005 (IDB). The first plot presents the balanced accuracy per user, while the second plot highlights (in black) the best score normalization for each user. Table 1: Summary of keystroke dynamics [28, 29, 30] and accelerometer biometrics [31, 32, 33] datasets after pre-processing. Keystroke
GREYC
CMU
GREYC-Web
GREYC-Web
(L)
(P)
No. of users
100
51
35
29
No. of samples
67.49
400
213.26
194.97
“greyc
“.tie5Roanl”¨+
“laboratoire
“SÉSAME”
laboratory”
Enter key
greyc”
No. of characters
16
11
17
Accelerometer
McGill
WISDM 1.1
WISDM 2.0
(Activity
(Actitracker)
(avg per user) Expression
6
Prediction) No. of users
20
33
131
No. of samples
1245.65
180.55
213.34
“walking”
“walking”¨
“walking”
(avg per user) Activity
Table 2: Biometric systems considered in the experiments. Biometric system
Type
Reference
Page 28 of 29
Self-Detector
Static
[14]
Self-Detector (Usage Control)
Adaptive
[10]
Self-Detector (Usage Control
Adaptive
[12]
Self-Detector (Sliding)
Adaptive
[9, 11, 10]
Self-Detector (Growing)
Adaptive
[9, 11, 10]
Self-Detector (Usage Control
Adaptive
[10]
Adaptive
[13]
M2005
Static
[15]
M2005 (DB)
Adaptive
[11]
M2005 (IDB)
Adaptive
[11, 12]
2)
R) Self-Detector (Usage Control S)
Page 29 of 29