The great debate: study proves whether people or algorithms are best at facial ID

The great debate: study proves whether people or algorithms are best at facial ID

FEATURE The great debate: study proves whether people or algorithms are best at facial ID P Jonathon Phillips Alice J O’Toole P Jonathon Phillips,...

NAN Sizes 0 Downloads 20 Views

FEATURE

The great debate: study proves whether people or algorithms are best at facial ID

P Jonathon Phillips

Alice J O’Toole

P Jonathon Phillips, NIST and Alice J O’Toole, University of Texas at Dallas Face recognition is a vital and controversial area, with its use by police and government under scrutiny worldwide. Accurate use of the technology is paramount, and in law enforcement the responsibility for identification falls primarily to professional forensic facial experts. Yet despite the societal importance of face ID applications, remarkably little is known about the accuracy of professionals relative to others with no training – and nothing is known about how human experts compare to state-of-the-art face recognition technology. A recent study we undertook tested the accuracy of forensic facial examiners as they conducted their work in normal forensic work conditions. Their performance was compared to the accuracy of ‘super-recognisers’, fingerprint examiners and untrained subjects, and to four of the latest face recognition algorithms based on deep learning and convolutional neural networks. The catalyst for the research was the knowledge that whenever face identification is required in legal and law enforcement scenarios, professional forensic facial examiners are considered the `gold standard’ for accuracy. Yet over the last decade, this assumption that well-trained forensic examiners are highly accurate at face identification has come under close scrutiny. The reason? Numerous cases where people wrongly convicted of crimes have been exonerated by DNA-based evidence, overturning the findings of forensic analysis that was not supported by science. In response to this spate of DNA exonerations, the US National Research Council (NRC) commissioned a study to review forensic science practices across America. Its resulting report – ‘Strengthening Forensic Science in the United States: A Path Forward’ – concluded there was “a pressing need for systematic scientific research” and recognised the need to ensure forensic practice is based on scientific evidence. Our study, ‘Face recognition accuracy of forensic examiners, superrecognizers, and face recognition algorithms’1, responds to this call for scientific support for the use of face

October 2018

identification in forensic practice. (The paper, which appeared in Proceedings of the National Academy of Sciences, is open access and can be downloaded for free.) This is the first study to measure the accuracy of forensic facial examiners performing face identification in normal forensic work conditions. It is also the first research to consider potential ‘competitors’ to facial examiners in assuring the highest levels of identification accuracy. Indeed, in recent years emerging evidence has suggested that trained human experts may have competition from three sources: super-recognisers, machines and the wisdomof-crowds. So let’s consider each in turn.

“In recent years emerging evidence has suggested that trained human experts may have competition from three sources: superrecognisers, machines and the wisdom-of-crowds” Super-recognisers are people with superior face recognition ability who have no professional training in this area. These are individuals who possess an innate talent for face identification and who, as a group, have been ‘discovered’ in numerous studies that use standard tests of face recognition ability. Superrecognisers now contribute to face recognition decisions made in law enforcement, though

their accuracy has never been compared systematically to that of forensic examiners. Since 2014, machine learning approaches to face identification have made impressive strides. The newest algorithms are based on the latest advances in artificial intelligence and are called deep convolutional neural networks (DCNNs). These networks have been at the heart of a revolution in computer-based approaches to face identification. They are trained using millions of face images of thousands of people. Deep convolutional neural networks have significantly advanced face recognition technology and can now recognise faces from highly variable, low-quality images2, 3, 4, 5, 6. The term ‘wisdom-of-crowds’ refers to combining the judgments of multiple individuals in order to make a decision. Studying the wisdom-of-crowds is in part motivated by the forensic peer review process. It is established that combining human judgments can increase face recognition accuracy. A separate line of investigation has shown that fusing humans and algorithms can improve accuracy7. But the effect of fusing the judgments of professionals and algorithms has not been explored.

Do we need experts? Our study had two key goals. The first was to measure the ability and accuracy of face specialists, and to compare them to the general population. The second was to identify the most accurate method for face identification – whether it was by using people, machines or both, working alone or in collaboration. We compared the accuracy of forensic facial examiners, forensic facial reviewers and superrecognisers against two control groups – namely, fingerprint examiners and university students. Using our AUC scoring system (see ‘How the study worked’ box on page 6 for details), we

Biometric Technology Today

5

FEATURE

How the study worked To provide a comprehensive assessment of human accuracy, we tested three face specialist groups (forensic facial examiners, forensic facial reviewers and super-recognisers); and two control groups (fingerprint examiners and university students). The two groups of forensic facial professionals are people trained to identify faces in images and videos, using a set of tools and procedures that vary across forensic laboratories. The first group consisted of 57 examiners from five continents. Examiners have extensive training and their identifications involve a rigorous and time-consuming process. Their decisions are described in written documents that support legal actions, prosecutions and expert testimony in court. The second group consisted of 30 reviewers. Reviewers are trained to perform faster and less rigorous identifications that support law enforcement investigations. Reviewers’ identification can assist in generating leads in criminal cases.

found that the professionals were the most accurate. The median scores for facial examiners, facial reviewers and super-recognisers were 0.93, 0.87 and 0.83 respectively. So statistically, there was no difference in accuracy between the three groups. With a median AUC of 0.76, fingerprint examiners were less accurate than the face specialists. The median AUC of the students was 0.68, which was moderately above chance. In other words, the three face specialist groups were superior to the fingerprint examiners, and fingerprint examiners were better than the students. So we can now answer to the first goal of the study: the trained professionals did significantly better than the fingerprint and student control groups. This result established the superior ability of the

We also tested 13 super-recognisers8. For the purpose of our study, a person qualified as a super-recogniser if they met one of two requirements: they were employed professionally as a super-recogniser (eg, in the Super-recogniser Unit at the London Metropolitan Police); or they passed a standard face recognition test at the super-recogniser level9. Fingerprint examiners are trained forensic professionals who perform fingerprint comparisons. We tested 53 of them, who not did have experience comparing faces. This allowed us to determine if face identification is a general forensic ability or an ability specific to facial examiners. To represent the general population, we tested 31 university students. To compare humans with face recognition algorithms, all four DCNNs were tested on the same face images given to humans. The four algorithms were all developed between 2015 and 2017 and we refer to them as A2015, A2016, A2017a and A2017b. In our study, all five human groups and the algorithms were given the same 20 pairs

trained examiners, thus providing for the first time a scientific basis for their testimony in court. We then looked at the AUC for all members of each of the five groups. For each participant we computed their score on the test. There was a large range of scores for algorithms. Remarkably, in all but the student group, at least one individual performed the test with no errors. In addition, there were fingerprint examiners and students whose score was comparable to the best face specialists. At the other end of the spectrum, all groups contained individuals whose score was below that of the median student. Because all groups contained individuals with ability comparable to the best, it is possible that we could find a large number

of face images. The images were taken in classrooms, hallways, atriums and outdoors. They were selected to be very challenging and each participant’s task was to judge whether the faces pictured in each pair showed the same person or different people (the picture opposite shows two example pairs). The humans rated each pair on a sevenpoint scale that varied from high confidence that the images were of the same person to high confidence that the images were of different people. Facial examiners, reviewers, super-recognisers and fingerprint examiners had three months to complete the test. Each algorithm returned a score that measured the similarity between the two faces. Face recognition accuracy was measured using the ‘area under the receiver operating characteristic’ (AUC). The values of AUC range between 0 and 1 where 1 indicates perfect performance, 0 indicates 100% incorrect performance and 0.5 represents random performance — flipping a coin would have been just as effective.

of people with superior face recognition ability by testing the general population. Turning to the performance of the algorithms, our results highlight the potential for machines to contribute beneficially to the forensic decisions. There was a rapid increase in the performance of all four algorithms over two years from 2015 to 2017. Algorithm A2015 had the same score as the median of the students (0.68). A2016 performed at the level of the fingerprint examiners (0.76). A2017a performed at a level comparable to super-recognisers (0.85). Finally, the AUC for A2017b was 0.96, which is comparable to facial examiners. In summary, all four algorithms performed at or above the level of students. Two algorithms scored in the range of the facial specialists, and one algorithm matched the performance of the facial examiners.

Humans and machines together

Are these two faces the same person? The study found trained examiners were significantly better at making such judgements, providing for the first time a scientific basis for their testimony in court. Credit: J Stoughton/NIST

6

Biometric Technology Today

Next, we looked at whether collaboration among the humans, through statistical fusion of the scores, could improve performance over people working as individuals. There are a number of methods for combining human judgments. In this study, we investigated combining judgments that were made independently by different individuals. All the participants compared two faces by themselves and returned their judgments to us

October 2018

FEATURE for analysis. In the analysis phase of the study, we combined these independent judgments by simply averaging the individual judgments. For all five groups, our fusion method increased accuracy. Fusing four examiners produced a median accuracy of 1.0, so no errors. Fusing three superrecognisers produced the same results. Fusing student judgments increased student accuracy; however, there are limits to the improvement in accuracy achieved by combining participants. We found that fusing 10 students was not as accurate as a median examiner. This suggests that a strategy for achieving optimal accuracy is to fuse people in the most accurate group of humans. A second method for combining human judgments is social fusion. Here, groups of people make the decision collaboratively while viewing pairs of face images. A 2018 study showed that for two untrained people, social fusion and combining independent judgments were equally effective10. A third method is peer review, which is a common strategy in facial forensics. In peer review, the first step is for the primary examiner to complete a forensic facial comparison. The results of this comparison are then reviewed by a second examiner and they discuss any discrepancy in the comparison. In some forensic labs or for critical cases, there may be additional peer reviews. Measuring the effectiveness of peer review needs to be investigated further.

“The superiority of humanmachine fusion over humanhuman fusion suggests that humans and machines have different strengths and weaknesses. So there is a strong case for combining human and machine judgments to make face identification more accurate in forensic applications” An important question we examined is whether it is possible to improve face identification accuracy by combining the judgments of expert humans and high-performing machines. This test produced the most surprising finding in our study. Indeed, fusing the most accurate algorithm (A2017b) with a single facial examiner increased accuracy substantially. For facial examiners, the median accuracy increased from 0.93 to 1.0, a perfect score. This combination of a human and a machine yielded higher accuracy than the combination of two forensic facial examiners. Similar results were observed for facial reviewers and super-recognisers. The superiority of human-machine fusion

October 2018

over human-human fusion suggests that humans and machines have different strengths and weaknesses that can be combined more effectively than the strengths and weaknesses of two examiners (which may be similar). Thus, there is a strong case for combining human and machine judgments to make face identification more accurate in forensic applications.

Where do we go from here? The challenge now is to translate what we have learned about combining examiners and machines into forensic practice. In forensic applications, examiners are required to write detailed reports that explain their identification decisions. For the algorithms in our study, a detailed explanation of a face identification judgment is not possible. Indeed, the result of a machine identification comparison between two images is just a number that indicates the machine’s estimate of the similarity between two faces. The issue can be summed up as this: when the machine and the human disagree on an identification, which judgment should be believed? The examiner produces an explanation; the machine produces only a number. To overcome this problem, methods need to be developed that allow an algorithm to explain how its decision was derived. In our study’s fusion experiments, algorithms and humans were given equal weight. Because of the wide range of accuracy for individual facial examiners, algorithm A2017b was substantially more accurate than at least some of the examiners. For these low-performing examiners, the best decision may be to rely on the

algorithm’s decision. For other examiners, the weight given to their judgment will be between zero and one. Investigating methods for finding the best weights for combining humans and machines is a future avenue of investigation.

“It is well-established that people are better at recognising faces from their own race than faces from different races. A similar effect has been established for algorithms” For all three face specialist groups, and for the fusion of humans and machines, there were numerous perfect or near-perfect scores. This strongly suggests there is the potential for highly accurate facial identification under more challenging and varied conditions. These conditions include identification across changes in pose and lighting, from blurry images and in video. A goal for future studies would be to find the upper limits of highly accurate identification. They could also investigate the accuracy on a wider range of face recognition tasks, including recognition across viewpoint and with low-quality images and video, as well as recognition of faces from diverse demographic categories. It is well-established that people are better at recognising faces from their own race than faces from different races. A similar effect has been established for algorithms11. For facial examiners and super-recognisers, the effect of race and other demographics has not been measured; because of their role in law enforcement and society, it is necessary to measure and understand these effects.

Biometric Technology Today

7

FEATURE We found the accuracy of super-recognisers and face examiners did not differ statistically, despite the fact that these two groups have fundamentally different experiences and background knowledge. Face examiners receive extensive training and complete an apprenticeship that can span two years or more. At the end of the apprenticeship, the examiner is qualified to perform detailed comparisons and to write reports that articulate the reasons for their decisions. The ability of super-recognisers is present without the benefits of training, and they are not required to justify their decisions. This suggests that both talent and training may underlie the high accuracy seen in the two groups of facial professionals.

Training and memory Looking ahead, one of the critical open question is the role of training in the performance of examiners. We still do not know whether or how training and on-the-job experience affect accuracy, the use of the rating scale, and the ability to write credible reports. In law enforcement, super-recognisers identify people that they know from their beat. This is a memory task – whereas face examiners perform comparison, or matching, tasks. So does a face examiner with superior ability in comparing faces also have superior ability in remembering faces? And vice versa? The evidence in the literature points to the finding that superior ability in one of these tasks does not guarantee superior ability in the other task. Knowing the answer to this question is important in giving face examiners or super-recognisers assignments.

“For all humans, including the face specialists, there is a wide range of accuracy between individuals, with all the groups containing participants whose accuracy is comparable to the best, and participants whose accuracy is moderately above random” For all humans, including the face specialists, there is a wide range of accuracy between individuals, with all the groups containing participants whose accuracy is comparable to the best, and participants whose accuracy is moderately above random. We have examined two possible explanations for this. First, there are many aspects of face recognition and a participant’s area of expertise (eg, remembering faces) was not included our study. Second, a participant did not have superior ability. One way to solve this would be to develop a battery of tests to 8

Biometric Technology Today

measure face recognition ability. These tests would represent different aspects of superior ability. Each test would look at a different aspect of face recognition ability and determine if a person had superior ability in one aspect. In summary, the study is the most comprehensive examination to date of the face identification performance of humans with superior ability. We compared the accuracy of state-of-the-art face recognition algorithms to humans, and found that the most accurate face identification performance came from statistical collaborations between the ‘best humans’ (professional forensic facial examiners) and the best (ie, most recent) algorithm. These results give us an evidence-based roadmap for increasing face identification accuracy in critical applications. They demonstrate the benefits of combining the judgments of humans and machines, and establish the importance of allowing humans and machines to work together in statistical collaborations that take advantage of the strengths of each ‘system’. Finally, we have outlined a roadmap for future investigations and studies: this roadmap will provide answers to critical questions for understanding the skills of forensic facial examiners, super-recognisers, and modern AI-based face recognition algorithms.

About the authors Dr P Jonathon Phillips is an electronic engineer at the US National Institute of Standards and Technology’s Information Technology Laboratory. He is a leading researcher in computer vision, face recognition, biometrics and forensics. Jonathon pioneered the development of a number of competitions in these areas, including the Iris Challenge Evaluations (ICE), the Face Recognition Vendor Test (FRVT) 2002 and 2006, the Face Recognition Grand Challenge and FERET. He won the inaugural IEEE Mark Everingham Prize and is a Fellow of the IEEE and the IAPR. He received his PhD in operations research from Rutgers University. Alice J O’Toole is a Professor at the University of Texas at Dallas in the School of Behavioral and Brain Sciences. Her research interests include human perception, memory and cognition. In 2007, she was named Aage and Margareta Moller Endowed Professor. She currently serves as an Associate Editor of Psychological Science and the British Journal of Psychology and has served as Program Chair of the 2017 IEEE Meeting on Automatic Face and Gesture Recognition. Alice received a BA in Psychology (1983) from The Catholic University of America, Washington, DC, and a MS (1985) and PhD (1988) in Experimental Psychology from Brown University, Providence, RI.

References 1. P J Phillips et al. ‘Face recognition accuracy of forensic examiners, superrecognizers, and face recognition algorithms’. Proceedings of

the National Academy of Sciences, 2018: 201721355. http://www.pnas.org/content/ early/2018/05/22/1721355115. 2. O M Parkhi, A Vedaldi and A Zisserman. ‘Deep face recognition’. Proceedings of the British Machine Vision, 2015. https:// www.robots.ox.ac.uk/~vgg/publications/2015/Parkhi15/parkhi15.pdf. 3. J-C Chen, V M Patel and R Chellappa. ‘Unconstrained face verification using deep CNN features’. IEEE Winter Conference on Applications of Computer Vision (WACV), 2016. https://arxiv.org/ abs/1508.01722. 4. R Ranjan, S Sankaranarayanan, C D Castillo and R. Chellappa. ‘An all-inone convolutional neural network for face analysis’. 12th IEEE International Conference on Automatic Face and Gesture Recognition Gesture Recognition, 2017. https://arxiv.org/abs/1611.00851. 5. R Ranjan, C D Castillo and R Chellappa. ‘L2-constrained softmax loss for discriminative face verification’. AarXiv:1703.09507, 2017. https://arxiv.org/abs/1703.09507. 6. Y Taigman, M Yang, M Ranzato and L Wolf. ‘Deepface: Closing the gap to human-level performance in face verification’. IEEE Conference on Computer Vision and Pattern Recognition, 2014. https://www.cs.toronto.edu/~ranzato/publications/taigman_cvpr14.pdf. 7. A J O’Toole, H Abdi, F Jiang and P J Phillips. ‘Fusing face recognition algorithms and humans’. IEEE Trans on Systems, Man and Cybernetics Part B, 37:1149-1155, 2007. https://www.utdallas. edu/~herve/abdi-oajp07.pdf. 8. R Russell, B Duchaine and K Nakayama. ‘Super-recognisers: people with extraordinary face recognition ability’. Psychon Bull Rev, 16(2):252-257, 2009. https:// www.ncbi.nlm.nih.gov/pmc/articles/ PMC3904192/. 9. E Noyes, P J Phillips and A J O’Toole. ‘What is a super-recogniser?’. In Face processing: Systems, Disorders and Cultural Differences, Nova, New York, NY, USA, 2017. 10. G Jeckeln, C A Hahn, E Noyes, J G Cavazos, A J O’Toole. ‘Wisdom of the social versus non-social crowd in face identification’. Br J Psychol, 2018. https://www.ncbi.nlm.nih.gov/pubmed/29504118. 11. P J Phillips, F Jiang, A Narvekar, J Ayyad, A J O’Toole. ‘An other-race effect for face recognition algorithms’. ACM Transactions on Applied Perception (TAP), 8(2), p14, 2011. https://ws680.nist.gov/publication/ get_pdf.cfm?pub_id=906254.

October 2018