Physiology & Behavior 95 (2008) 333–340
Contents lists available at ScienceDirect
Physiology & Behavior j o u r n a l h o m e p a g e : w w w. e l s e v i e r. c o m / l o c a t e / p h b
Combining physiological measures in the detection of concealed information Matthias Gamer a,⁎, Bruno Verschuere b,⁎, Geert Crombez b, Gerhard Vossel c a b c
University Medical Center Hamburg-Eppendorf, Germany Ghent University, Belgium Johannes Gutenberg-University Mainz, Germany
A R T I C L E
I N F O
Article history: Received 29 April 2008 Received in revised form 12 June 2008 Accepted 23 June 2008 Keywords: Polygraph test Concealed information test Multiple measures Classification Validity
A B S T R A C T Meta-analytic research has confirmed that skin conductance response (SCR) measures have high validity for the detection of concealed information. Furthermore, cumulating research has provided evidence for the validity of two other autonomic measures: Heart rate (HR) and Respiration Line Length (RLL). In the present report, we compared SCR detection efficiency with HR and RLL, and investigated whether HR and RLL provide incremental validity to electrodermal responses. Analyses were based on data from 7 different samples covering 275 guilty and 53 innocent examinees. Results revealed that the area under the ROC curve was significantly higher for SCR than for HR and RLL. A weighted combination of these measures using a logistic regression model yielded slightly larger validity coefficients than the best single measure. These results proved to be stable across different protocols and various samples. © 2008 Elsevier Inc. All rights reserved.
In the first empirical demonstration of the Concealed Information Test (CIT), participants were confronted with a set of questions concerning a mock crime that they had committed [1]. For example, participants in the theft scenario were asked whether they knew where the thief hided the stolen goods (Was it in the bathroom? On the coat rack? In the locker? On the windowsill? In the office?). It was reasoned that recognition of the correct answer by guilty participants would result in a bodily response (we use the term “guilty” throughout the entire manuscript to indicate that examinees were able to distinguish relevant from irrelevant items in the CIT examination). Using skin conductance as the sole response measure in the above mentioned study, 100% of the innocent examinees and 88% of the guilty examinees were correctly classified. Ever since, many studies have examined the validity of electrodermal measures for the detection of concealed information. A quantitative review on the Concealed Information Test has shown that electrodermal measures allow for a correct classification of about 76% of guilty and 83% of innocent examinees [2]. However, these percentages depend on a single, arbitrary cutoff, meaning that they can vary depending on where the cutoff is set. Alternatively, signal-detection measures have been used to give an estimate of an instrument's validity across all possible cutoff points [3]. Both derived statistics of this recent meta-analysis, the effect size d [4] and the area under the receiver
⁎ Corresponding authors. M. Gamer is to be contacted at Department of Systems Neuroscience, University Medical Center Hamburg-Eppendorf, Martinistr. 52, 20248 Hamburg, Germany. Tel.: +49 40428037160; fax: +49 40428039955. B. Verschuere, Department of Psychology, Ghent University, Henri Dunantlaan 2, B-9000 Ghent, Belgium. Tel.: +32 92648622; fax: +32 92646489. E-mail addresses:
[email protected] (M. Gamer),
[email protected] (B. Verschuere). 0031-9384/$ – see front matter © 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.physbeh.2008.06.011
operating curve (see Results section), confirmed the high validity of electrodermal measures for the detection of concealed information. Although skin conductance has been investigated most extensively, researchers have also examined the validity of several other measures: behavioral, as well as measures of the central and the autonomic nervous system. Several studies have examined, with variable success, whether reaction-times can be used to detect concealed information [5,6]. More consistent results have been obtained with brain potentials recorded at the scalp, such as the P300. The P300 is a positive deflection in the EEG that appears approximately 300 ms after stimulus presentation, and is related to the relevance of the stimulus. It has been repeatedly demonstrated that the amplitude of the P300 to concealed information is larger than that to control information [7–9]. The measurement of the P300, however, requires several adjustments to the standard Concealed Information Test. For example, a very high number of item presentations (typically several hundreds) is needed for reliable recording of the P300. These methodological aspects hinder the concurrent measurement of electrodermal measures and the P300 because skin conductance amplitude shows a gradual decline throughout repeated presentation. The concurrent measurement of autonomic measures does not encounter this problem. Several response measures have been looked at (e.g., pupil dilation, finger pulse amplitude, and blood pressure), but heart rate and respiration have been examined most extensively. For respiration, depth and rate of breathing are usually combined in a single measure called respiration line length (RLL) [10]. For heart rate, the greatest decline in a certain time period (often 15 s) has been repeatedly used [11]. However, inconsistent results regarding these measures were obtained. For example, in a Concealed Information Test administered after a mock theft procedure in 20 male community
334
M. Gamer et al. / Physiology & Behavior 95 (2008) 333–340
volunteers, heart rate and respiration did not discriminate guilty from innocent examinees, but skin conductance alone allowed correct classification of 90% of the examinees [12]. A second study also found that skin conductance, but not heart rate, was highly valid in the detection of concealed mock crime information [13]. Some years later, significant detection rates for both measures, skin conductance (74%) and heart rate (64%) were reported [11]. Using only guilty examinees, it was recently found that skin conductance, but also respiration and heart rate differed between relevant and irrelevant CIT items [14]. From these data, it is clear that skin conductance outperforms heart rate and respiration in the detection of concealed information. Nonetheless, differential heart rate and respiratory responses were also reported and may contribute to the detection efficiency of electrodermal responses. The validity of a combination of response measures was investigated in a number of studies. Most studies examined the combined validity of skin conductance and respiration line length [15–18]. These studies found that combining respiration line length and skin conductance resulted in better detection efficiency than skin conductance alone. In a field study, for example, it was found that 75% of the examinees could be correctly classified using electrodermal measures and respiration line length, respectively, but the combination measure increased classification accuracy to 85% [18]. On the other hand, a number of studies failed to find incremental validity for the respiration line length [19,20]. In a very recent study [20], an autobiographical Concealed Information Test was administered in prisoners and community volunteers while measuring skin conductance, heart rate, and respiration line length. Although detection efficiency of heart rate and respiration line length was significantly above chance, both measures did not add to skin conductance validity. However, in most studies, physiological responses were combined by simply averaging them, which may not be optimal. By contrast, a weighted additive combination of different physiological channels was suggested by research on a different questioning technique of forensic psychophysiology, the Comparison Question Test (CQT) [21]. This approach was recently transferred to the CIT and a logistic regression equation was proposed to combine skin conductance, heart rate, and respiration line length for the detection of concealed information [22]. The blood pressure that was shown to be valid in differentiating guilty and innocent examinees when using the CQT [23] was not included in the classification function because it had no validity in the CIT. The resulting model assigned comparable weights to skin conductance and respiration line length, and somewhat lower weight to heart rate. Using this model, 93% of the guilty and 97% of the innocent examinees were correctly classified. Testing a model on the data it was build on, however, often leads to higher validity coefficients than cross-validation on new data [24]. The present report aimed at testing the validity of the classification function on a new set of data. In a first study, we tested the model on a conceptual replication of the original study [22]. Next, we applied the model to data from 6 studies that comprised different samples (students, community volunteers, and prisoners) and different experimental protocols (autobiographical and mock crime paradigm). The present report aimed at comparing the validity of skin conductance, respiration line length, and heart rate in the detection of concealed information. Moreover, it was examined whether a weighted combination of these measures would lead to larger validity coefficients than each single measure. 1. Conceptual replication study 1.1. Method 1.1.1. Participants Sixty German men participated in this study at the University of Mainz. Informed consent was obtained from each participant prior to the experiment, which was carried out according to the Declaration of Helsinki. It was indicated that participation was voluntary and that each participant could withdraw from the experiment at any time. The mean age of the sample was 24.3 years (SD = 3.2 years) with a range of
19 to 36 years. Most of the participants were students of different fields. They were promised a reward of 10 € for being classified as innocent in the CIT examination. The 17 psychology students who participated in the experiment additionally received course credit. 1.1.2. Instruments Skin conductance, thoracic and abdominal respiration were registered by the Computerized Polygraph System (CPS, Stoelting Company) [21]. Skin conductance was measured by a constant voltage system (0.5 V) using a bipolar recording with two Hellige Ag/AgCl electrodes (surface area = 1 cm2) filled with 0.05 M NaCl electrolyte. The electrodes were placed on the thenar and hypothenar surfaces of the participant's left hand. Respiration was recorded by two piezoelectric Pneumotrace II transducers attached around the chest and the abdomen with Velcro straps. Electrocardiogram (ECG) measurement was accomplished by two Hellige Ag/AgCl electrodes filled with electrode paste and attached to the manubrium sterni and the left lower rib cage. The reference electrode was placed at the right lower rib cage. ECG-data were registered by the Varioport-device (Vitaport system, Becker Meditec) with a sampling rate of 512 Hz. The measurement was conducted in an air-conditioned, soundattenuated chamber with participants seated in a semi-reclining chair. All recording and programming equipment was located outside the chamber, but the participants could be observed via a video system. A conventional personal computer controlled the stimulus presentation and the timing of the measurement equipment. 1.1.3. Design A mock crime procedure was used and participants were randomly allocated to one of the following two conditions: (a) 30 guilty examinees performed a mock theft and acquired knowledge of six critical details during this simulated offence; (b) 30 innocent examinees carried out a specific instruction in the same building, but remained ignorant of the relevant details of the mock crime scene. 1.1.4. Procedure On the arrival for the experimental session, written informed consent was obtained and participants were randomly assigned to one of two experimental conditions: participants in the guilty condition were instructed to enter an open access departmental library, where they should find a cloth bag hanging at a hat stand beneath the librarian's desk. The bag contained a red sweater with a small key in its front pocket. An eye-catching heart shaped pendant was affixed to the key that had to be stolen from the library. Afterwards, participants were told to find a laboratory room on a different floor of the building. Inside this room, they had to unlock a cash box and steal everything they would find inside (20 € and a wristwatch). Finally, participants were instructed to return to the examination room. During the course of the mock crime, they became familiar with six critical details (printed in italics above) that were used in the subsequent CIT examination. Participants in the innocent condition were also instructed to enter the library, but they were told to find a specific book and briefly memorize the contents of one particular page. Having completed this instruction they should return to the examination room. It took approximately the same amount of time to accomplish the instructions of both conditions (guilty and innocent). Prior to the interrogation, all participants were informed that an important key was stolen from the library. They were told to be suspects of this offence because someone noticed that they visited the library during the last hour. All participants were encouraged to convince a condition-blind polygraph examiner of their innocence and a financial reward of 10 € was offered for a successful performance on the test. Then they were brought to the laboratory, where the experimenter attached the polygraph devices and conducted the examination. All
M. Gamer et al. / Physiology & Behavior 95 (2008) 333–340
participants were interrogated using a CIT with six multiple choice questions. Each question consisted of one buffer item following the presentation of the question, four irrelevant items and one crime related relevant item. The position of the relevant item within each question was randomly determined but remained constant across participants. The multiple-choice questions were presented as pre-recorded audio samples along with corresponding pictures that were displayed for 8 s on a 17'' screen placed 120 cm in front of the participant at eye height. The inter-stimulus interval amounted to approximately 22 s and the examinees were requested to immediately respond “no” to every item. After the examination, all participants received a provisional test result based on their skin conductance and respiration recordings. They were paid 10 € when they had successfully concealed crime related knowledge and were finally debriefed. 1.1.5. Response scoring and analysis The scoring of all physiological measures was similar to [22] to allow for an optimal cross-validation of the classification function that was proposed in this recent study. 1.1.5.1. Electrodermal responses. The artifact free amplitude of the largest skin conductance increase that occurred between 0.5 s and 10.5 s after item onset was calculated by the CPS software [21]. 1.1.5.2. Respiration. Respiration line length (RLL) during the interval 0 to 10 s following item onset was calculated by the CPS software for each respiration channel [21]. 1.1.5.3. Heart rate. First, R-waves were detected from the ECG data and R–R intervals were converted to HR (in beats per minute). Afterwards, a second-by-second sampling was applied [25] resulting in one HR value per time epoch (i.e., 1 s). The HR in the last second prior to item onset represented the prestimulus baseline. Poststimulus difference scores (ΔHR) were derived by subtracting the prestimulus baseline value from the HR-score of each poststimulus-second. Finally, two HR measures were derived: The largest HR deceleration within 15 s following item onset [22], and the average of all ΔHR values [20]. 1.1.5.4. Application of the classifications functions. First, standard difference scores between relevant and irrelevant items were calculated according to the procedure described recently [22]. Within each participant, the responses to each item were z-standardized based on the mean and the standard deviation of the responses to all irrelevant items of the CIT. Difference scores between the response to the relevant item and the mean of the four irrelevant items within each of the six multiple-choice questions were calculated and the mean of these measures was computed as an overall index of the differential responsivity in each physiological measure. The resulting values will be considered in the following analyses as estimates of the corresponding validity coefficients. Guilty participants should have positive (in the case of electrodermal responses) or negative values (for respiratory and HR responses) because they are supposed to show larger SCRs, stronger respiratory suppression and HR deceleration to relevant as compared to irrelevant items. Innocents, on the other hand, should not respond systematically to the different item types and thus produce standard difference scores around zero. In a second step, the proposed logistic regression model was applied [22]. Both respiration channels were averaged during this step. The respective weights that were assigned to each measure are displayed in Table 1. Because some studies successfully relied on the mean HR change following stimulus onset instead of the largest HR deceleration [14,20], we calculated a second classification function using the original data of model building study [22]. Standard difference scores of SCR, mean of thoracic and abdominal RLL and the mean HR change of all 15 poststimulus seconds were included in this new logistic regression equation. As can be seen from Table 1, the detection efficiency of this classification function
335
proved to be largely similar to the original function but we also applied the resulting weights to the data from the current study to compare the detection efficiency of both functions on new data. It is important to note that both classification functions that are displayed in Table 1 were fitted to the data of the same published study [22]. Thus, when applying these weights to the data of the current study, both functions can be crossvalidated using an independent sample of participants. 1.2. Results and discussion In a first step, standard difference scores of each single measure and the classification scores of both logistic regression equations were compared between both groups using t-tests. As can be seen from Table 2, guilty and innocent participants differed significantly for each measure. Moreover, the expected response pattern of guilty participants consisting of positive values for skin conductance and negative scores for respiratory and HR responses was confirmed. Thus, guilty examinees showed larger skin conductance responses and heart rate decelerations combined with a stronger respiratory suppression to relevant as compared to irrelevant CIT items. Innocent examinees responded in a non-systematic way to both these item types as indicated by their average values that vary around 0. To examine the diagnostic utility of each single measure and to compare these values to the classification functions, we computed separate ROC curves and estimated the area under the curves as a measure of detection efficiency [3]. Before the computation, the sign of the RLL and HR measures was reversed in order to score into the same direction as all other measures. The area statistic varies between 0 and 1; an area of 0.5 can be understood as random classification. In the context of the current study, values close to 1 could be interpreted as indicating high validity for the detection of concealed information. In addition to this descriptive value, 95% confidence intervals for the area statistic were computed [26]. As can be seen from Table 2, the area statistics of all single measures differed significantly from a chance area of 0.5. The validity of skin conductance responses significantly exceeded the area statistics of respiratory responses (mean RLL), z = 2.00, p b .05, maximal HR deceleration, z = 3.72, p b .001, and mean HR, z = 3.42, p b .001 (statistically compared according to the procedure described in [27]). Both classification functions similarly achieved very high validity coefficients resembling the values from the original study [22] (cf. Table 1). Results indicate that both classification functions could be effectively used to combine several physiological measures into one highly valid predictor of an examinee's truth status. It has to be admitted, however, that the current cross-validation study was conducted under very similar conditions as the original study [22]. A mock crime procedure was used in both settings, the CIT was similar with respect to the number of questions and items and the same equipment was used for measuring the physiological responses. Thus, it would be interesting whether these results generalize across different settings, samples and CIT variations. 2. Application of the cross-validation functions on data from other studies 2.1. Method 2.1.1. Samples Data from 6 studies were used to estimate the validity of both classification functions in different contexts. There was no participant overlap between these studies. For full procedural details, we refer the reader to the published reports of these studies [14,20,28,29]. All these studies were approved by the ethical committee of Ghent University. We successively number the studies starting from 2, because the corresponding results will be pooled with the conceptual replication study described above which is referred to as study 1. Study 2: in this study [14], 27 (mostly female) guilty undergraduates were examined using an autobiographical CIT with 4 relevant items:
336
M. Gamer et al. / Physiology & Behavior 95 (2008) 333–340
Table 1 Coefficients of both logistic regression equations for the prediction of the truth status and the area under the ROC-curves with corresponding 95% confidence intervals (CI) Classification functions
Function using largest HR deceleration Function using mean HR
Regression coefficients and corresponding standard errors
ROC statistics
SCR
RLL
HR
(Constant)
Area
(95% CI)
3.91 (1.27) 4.24 (1.38)
−6.90 (2.38) − 6.31 (2.18)
−1.71 (0.83) −1.97 (0.88)
−3.81 (1.13) −3.92 (1.11)
0.97 0.97
(0.94, 1.00) (0.93, 1.00)
Note. SCR = skin conductance response, RLL = respiration line length, HR = heart rate. Functions were estimated using the data of [22].
First name, last name, first name of the father, and first name of the mother. Participants were asked to beat the polygraph test by hiding recognition of personal information. All items were presented on a computer screen for 6 s each with an inter-stimulus interval ranging from 26 to 30 s. The polygraph test in this study served to answer theoretical questions, and thereby deviated from the standard test format by using a 1 by 1 proportion of relevant and control stimuli. Moreover, no motivational instructions were given and participants did not respond verbally. The whole question set consisting of 4 relevant and 4 control items was presented twice. Because of substantial habituation effects, we narrowed the cross-validation analysis to the first block of the experiment (8 items; 4 personal and 4 control). Study 3: forty guilty male inmates of a maximum security prison in Belgium were examined using the setup described in study 2. A full description of the study is given by [14]. As in study 2, the crossvalidation analysis was confined to the first block of the experiment. Study 4: in this study [20], 31 guilty male community volunteers were examined using an autobiographical CIT. Participants were asked to beat the polygraph test by hiding recognition of personal information. The CIT consisted of five question sets (i.e., first name, last name, first name of the mother, first name of the father, and birthday). Several conditions that are known to optimize detection efficiency were realized [3]. A 1 by 4 proportion of relevant and control stimuli were used. Thus, each question set started with one buffer item that was not included in the analyses (e.g., Is your name GEERT?), followed by four control items (e.g., Is your name MATTHIAS?), and one personal item (e.g., Is your name BRUNO?), resulting in a total of 5 personal and 20 control items. The position of the personal item within each question set was randomly determined. Moreover, participants were instructed to respond deceptively (“no”) to all items. Finally, a monetary incentive of 7.50 € was given for appearing innocent. Because of technical problems, data of 1 participant was lost, reducing the sample size to 30 participants. Study 5: forty-eight guilty male inmates of a maximum security prison in Belgium were examined using an identical setup as study 4 [20]. A monetary incentive of 4 € was given for appearing innocent. Study 6: seventy-seven guilty male community volunteers were examined using a pictoral mock crime CIT [28]. Participants watched a video of a mock crime. They were asked to pay close attention to the video and to try to imagine that they were the person in the video (for a similar procedure see [30]). There were two mock crime scenarios (theft of 10 € or theft of a camera), counterbalanced across individuals. Participants were asked to try to appear innocent in the polygraph examination by trying to hide recognition of the crime details. The CIT consisted of five question sets, each having one buffer item that was not included in the analyses (e.g. a necklace), one crime related relevant item (e.g., 10 €) and four control items (e.g., a cell phone, a digital camera, a bank card, a watch). All items were presented on a computer screen for 6 s, with an inter stimulus interval of 25 s. Participants were instructed to respond deceptively (“no”) to all items, and a monetary incentive of 7.50 € was given for appearing innocent. Study 7: in this study [30, Experiment 2], 23 guilty and 23 innocent (mostly female) undergraduates were examined using a mock crime CIT under similar experimental conditions as in study 6. Guilty participants acquired crime related knowledge by watching a video of a mock crime, and innocent participants watched a videotape of someone walking through a building. There were five mock crime scenarios (theft of digital
camera, cell phone, 100 €, MP3-player, or jewels), counterbalanced across individuals. Participants were asked to try to appear innocent in the polygraph examination by hiding recognition of the crime details. The number of question sets and items was similar to study 6. All items were presented on a computer screen for 6 s, with an inter stimulus interval of 25 s. Participants were instructed to respond deceptively (“no”) to all items. Motivational instructions aimed at self-esteem were given, informing participants that only intelligent people can cope with the polygraph [3]. 2.1.2. Response scoring and analysis 2.1.2.1. Electrodermal responses. In the studies 2 and 3, skin conductance amplitude was calculated by subtracting the skin conductance of the peak onset (i.e., the lowest value in the 0.5–4 s window after stimulus onset) from the peak amplitude (i.e., the highest value in the 0.5–4 s after peak onset). A baseline (one second before stimulus onset) to peak (maximal skin conductance amplitude in 0.5 to 5 s after stimulus onset) difference method was used in the other studies. 2.1.2.2. Respiration. Total respiration line length (RLL) was calculated for 8 s following item onset in the studies 2 and 3, and for 15 s in the other studies. 2.1.2.3. Heart rate. Heart rate, derived from the ECG, was calculated for 8 s following item onset in the studies 2 and 3, and for 15 s in the other studies. Poststimulus difference scores (ΔHR) were computed by subtracting the HR in the last second prior to item onset (prestimulus baseline) from the HR-score of each poststimulus-second. Finally, the largest HR deceleration and the mean HR within the poststimulus period were calculated. 2.1.2.4. Application of the classifications functions. First, the physiological responses were standardized for each participant and each measure as in the first experiment [22]. Subsequently, both above mentioned logistic regression equations were applied using the largest HR deceleration or the mean HR, respectively. 2.1.3. Simulation of innocent's data In the studies 2–6 no innocent participants were examined. Thus, the validity of the CIT in differentiating guilty from innocent participants cannot be directly determined for these data. Given the assumptions of the CIT, however, it is possible to simulate data of an innocent sample (for a similar procedure see [31,32]). Specifically, innocents are supposed to respond in a non-systematic way to relevant and irrelevant items because they cannot differentiate between these item types (see the scores of the innocent participants of study 1 in Table 2). Thus, in principle, an innocent's response profile can be simulated by randomly drawing values for each item from a unimodal distribution. Because of its simplicity, a Gaussian distribution with a mean of 0 and a standard deviation of 1 was recently used to simulate skin conductance data of innocent participants [32]. With respect to the present studies, however, the situation is slightly more complex because we examined electrodermal, respiratory, and cardiovascular measures that are correlated to a certain degree. Therefore, we had to rely on the responses of guilty participants to
M. Gamer et al. / Physiology & Behavior 95 (2008) 333–340 Table 2 irrelevant items to simulate data of innocents. This procedure presumes Conceptual replication means and standard deviations of the standardized these responses to study: be unimodally distributed; an assumption that response differences of single measures and the classification scores as a function of experimental condition
Single measures SCR Thoracic RLL Abdominal RLL Mean RLL Max. HR deceleration Mean HR Classification functions Function using max. HR deceleration Function using mean HR
Guilty (n = 30)
Innocent (n = 30)
M (SD)
M (SD)
1.20 (0.84) − 0.14 (0.34) −0.53 (0.34) − 0.07 (0.38) −0.49 (0.44) 0.02 (0.37) −0.51 (0.35) −0.03 (0.31) −0.38 (0.58) 0.01 (0.42) −0.54 (0.50) −0.06 (0.42)
t(58)
ROC statistics Area (95% CI)
8.13⁎⁎⁎ 4.95⁎⁎⁎ 4.82⁎⁎⁎ 5.71⁎⁎⁎ 2.98⁎⁎ 4.06⁎⁎⁎
0.97 0.83 0.83 0.86 0.70 0.77
(0.94, 1.00) (0.72, 0.93) (0.73, 0.93) (0.78, 0.95) (0.57, 0.83) (0.64, 0.89)
0.87 (025)
0.09 (0.22) 12.78⁎⁎⁎ 0.98
(0.95, 1.00)
0.88 (0.23)
0.09 (0.21)
(0.95, 1.00)
13.93⁎⁎⁎ 0.98
Differences between groups were assessed by means of t-tests and the computation of receiver operating characteristics (ROC) curves. Note. For the respiration line length (RLL) and both heart rate (HR) measures, the signs of each value were reversed before the computation of the ROC statistics in order to score in the same direction as the other measures. SCR = skin conductance response. CI = confidence interval. ⁎⁎p b .01, ⁎⁎⁎p b .001.
irrelevant items to simulate data of innocents. This procedure presumes these responses to be unimodally distributed; an assumption that seems reasonable given prominent theoretical underpinnings of the CIT [33,34]. We used the following bootstrapping algorithm to simulate responses of innocents: in a first step, responses to irrelevant items were extracted for each channel from the data of guilty participants. Afterwards, Ni values were randomly drawn with replacement from the values 1 to Ni, with Ni depicting the number of irrelevant items. These values were used to identify a subset of irrelevant trials that was subsequently selected from the data pool. Values of each channel were averaged separately and constituted a vector of simulated responses for one test item. Random selection and averaging were repeated for each item resulting in a matrix of simulated responses for all data channels and items for one innocent participant. This method has the main advantage that correlations between data channels remain unchanged in the simulated data set. The procedure was applied repeatedly to simulate data for one innocent for each examined guilty participant in the studies 2–6. The simulated values were standardized and entered into the classification functions in the same manner as the responses of guilty participants. 2.2. Results and discussion In a first step, separate ROC curves contrasting the standardized responses of guilty and innocent participants were computed for each
337
individual measure in each study. As can be seen from Table 3, electrodermal measures achieved the largest validity coefficients throughout all studies with an average area value of A = 0.86. To a lesser extent, respiratory and HR measures also discriminated significantly between guilty and innocent participants in most studies (the area statistic of 0.5, representing a chance classification, was not included in the respective confidence intervals). Because a descriptive comparison of the confidence intervals has low statistical power [35], we used an algorithm described in [27] to statistically compare the area statistics of all measures across studies. It turned out that the validity of skin conductance responses significantly exceeded the area statistics of respiratory responses, z = 5.85, p b .001, maximal HR deceleration, z = 6.29, p b .001, and mean HR, z = 6.23, p b .001. All other pairwise comparisons did not reach statistical significance. With the exception of both HR measures, the standardized response differences of all physiological variables were only slightly correlated in both groups of participants (see Table 4). In a second step, we computed the ROC curves for both classification functions (see Table 5) and compared the corresponding validity coefficients to the best individual measure (the electrodermal responses, see Table 3). With the exception of one study [29], both classification functions numerically outperformed SCR validity. HR measures did not significantly differ between guilty and innocent participants in that study, possibly explaining why they could not provide incremental validity to the electrodermal measure. Across all studies, the validity coefficient of the classification function using mean HR significantly exceeded the area statistic of the SCR measure (z = 2.41, p b .05, statistically compared according to [27]). Using the same algorithm, the classification function using maximal HR deceleration also tended to outperform the electrodermal measure (z = 1.88, p b .10). Fig. 1 displays the ROC curves of skin conductance, respiration and mean HR together with the curve depicting their combination in terms of a classification score. It can be seen that the validity of the classification function outperformed each single measure in the diagnostically most relevant region at the left side of the image (high hit rates and tolerable false positive rates). That is, for any fixed false positive rate below 0.50, the hit rate using the classification function was larger than the corresponding hit rates of each single measure. Further evidence for the incremental validity of respiratory and HR measures was provided by a stepwise logistic regression analysis using data from all studies. The mean standardized responses of the different physiological measures were used as potential predictors and a step-by-step inclusion procedure following the Wald-statistic was applied. Probabilities for inclusion and exclusion were fixed at 0.05 and 0.10, respectively. Three measures were included in the regression model: skin conductance, β = 2.64 (SE = 0.25), p b .001; RLL, β = − 0.86 (SE = 0.21), p b .001; and mean HR, β = − 1.03 (SE = 0.19), p b .001. Nagelkerke's R2 increased from .53 to .59 and .62, respectively,
Table 3 Areas under the ROC curves and associated 95% confidence intervals (CI) of each physiological measure in the considered studies Study
nGuilty
nInnocent
1. Students and community sample (conceptual replication of [22]) 2. Students [14] 3. Prisoners [14] 4. Community sample [20] 5. Prisoners [20] 6. Community sample [28] 7. Students [29, experiment 2] Across studies
30 27 40 30 48 77 23 275
30 27 S 40 S 30 S 48 S 77 S 23 275
SCR
RLL
Max. HR deceleration
Mean HR
Area
95% CI
Area
95% CI
Area
95% CI
Area
95% CI
0.97 0.82 0.80 0.94 0.86 0.84 0.94 0.86
(0.94, 1.00) (0.71, 0.94) (0.70, 0.89) (0.88, 0.99) (0.78, 0.94) (0.78, 0.90) (0.88, 1.00) (0.83, 0.89)
0.86 0.66 0.68 0.78 0.72 0.71 0.68 0.71
(0.78, 0.95) (0.50, 0.81) (0.56, 0.79) (0.66, 0.90) (0.61, 0.82) (0.63, 0.79) (0.53, 0.84) (0.67, 0.75)
0.70 0.72 0.70 0.74 0.73 0.76 0.61 0.70
(0.57, 0.83) (0.58, 0.86) (0.59, 0.82) (0.62, 0.87) (0.64, 0.83) (0.68, 0.83) (0.44, 0.78) (0.65, 0.74)
0.77 0.74 0.73 0.75 0.68 0.77 0.63 0.71
(0.64, 0.89) (0.61, 0.87) (0.61, 0.84) (0.63, 0.88) (0.57, 0.79) (0.69, 0.84) (0.47, 0.79) (0.66, 0.75)
Note. Data for a group of innocents were simulated in studies that only examined guilty participants (denoted by the letter S in the cell referring to the number of participants). The simulation procedure is described in detail in the main text. SCR = skin conductance response, RLL = respiration line length, HR = heart rate.
338
M. Gamer et al. / Physiology & Behavior 95 (2008) 333–340
Table 4 Intercorrelations between the standardized response differences of all single physiological measures for guilty and innocent examinees SCR Guilty (n = 275) RLL Max. HR deceleration Mean HR
.09 .02 .03
Innocent (n = 53) RLL Max. HR deceleration Mean HR
−.12 .18 .18
RLL
Max. HR deceleration
.15 ⁎ .24 ⁎⁎⁎
.87 ⁎⁎⁎
.00 .12
.84 ⁎⁎⁎
Note. SCR = skin conductance responses, RLL = respiration line length, HR = heart rate (HR). ⁎p b .05, ⁎⁎⁎p b .001.
when RLL and mean HR were included into the regression model in addition to the SCR measure. Thus, RLL and mean HR accounted for an additional amount of variance in the criterion (truth status) above electrodermal measures. 3. General discussion The present report aimed at comparing the validity of electrodermal, respiratory and heart rate measures in the detection of concealed information. Moreover, the incremental validity of these measures was tested using weighted combination rules in terms of logistic regression equations taking into account all three response systems. Study 1, a conceptual replication of [22], found that SCR performed better than both RLL and HR. This finding was further corroborated by the data from several other studies (Study 2–7). Across these studies, electrodermal measures achieved larger validity coefficients than respiratory and heart rate measures. On average, the area under the ROC curve was 0.86 for skin conductance responses which is very similar to recently reported values for mock crime conditions (A = 0.87) and the autobiographical paradigm (A = 0.84) [3]. Regarding the RLL, it should be noted that a considerably lower validity (A = 0.71) was obtained than in previous studies (e.g., A = 0.84 in [22]; see also [10]). Possibly, variations in measurement equipment and scoring techniques may explain this discrepancy. First, Studies 2–7 did only measure thoracic respiration, whereas other studies [e.g., 22] also used abdominal respiration. It is possible that averaging both respiratory measures produces more reliable results. Second, the pressure sensor used in Studies 2–7 may be less sensitive or more prone to non-
Table 5 Areas under the ROC curves and associated 95% confidence intervals (CI) of the parameter combination according to both classification functions Study
nGuilty nInnocent Classification function using max. HR deceleration
30 1 .Students and community sample (conceptual replication of [22]) 2. Students [14] 27 3. Prisoners [14] 40 4. Community sample [20] 30 5. Prisoners [20] 48 6. Community sample [28] 77 7. Students [29, experiment 23 2] Across studies 275
Classification function using mean HR
Area 95% CI
Area 95% CI
30
0.98
(0.95, 1.00)
0.98
(0.95, 1.00)
27 S 40 S 30 S 48 S 77 S 23
0.83 0.89 0.95 0.88 0.88 0.85
(0.70, 0.95) (0.82, 0.96) (0.90, 1.00) (0.81, 0.95) (0.83, 0.93) (0.72, 0.98)
0.85 0.90 0.95 0.89 0.89 0.86
(0.73, 0.96) (0.83, 0.97) (0.91, 1.00) (0.82, 0.95) (0.83, 0.94) (0.73, 0.99)
275
0.89
(0.86, 0.92) 0.90
(0.87, 0.93)
Note. Data for a group of innocents were simulated in studies that only examined guilty participants (denoted by the letter S in the cell referring to the number of participants). The simulation procedure is described in detail in the main text. HR = heart rate.
Fig. 1. ROC curves contrasting the distribution of z-standardized response differences in skin conductance, mean heart rate (HR) and respiration between guilty and innocent examinees. Additionally, the result of the classification function combining all three physiological channels is depicted.
linearities during signal conversion than the piezo-electric respiration transducers which were previously used [22]. Third, different scoring intervals for the RLL measure were used across studies and it might be possible that they were too short in some cases (e.g. 8 s in studies 2 and 3). However, it seems unlikely that this methodological aspect can fully account for the lower RLL validity in Studies 2–7. In the first study [q.v., 22], where high RLL validity was found, the scoring interval amounted to 10 s which is considerably smaller than the interval of 15 s that was used in Studies 4–7. Clearly, there may be other differences between studies, and it remains to future research to systematically examine determinants of differential RLL validity. Across studies, a combination of electrodermal, respiratory, and heart rate measures by means of both classification functions, relying on the maximal HR deceleration or the mean HR, respectively, outperformed the best single measure: the SCR amplitude. Only with the mean HR, however, this difference was significant. This slight difference may be due to a larger variance of the maximal HR deceleration as compared to the mean HR, which is based upon 15 instead of 1 data point. Moreover, the mean HR might better reflect different processes that are temporally distributed across the measurement window and do not necessarily accumulate in a single second showing the maximal HR deceleration. The pattern of activation that we observed in our studies (SCR increase, HR deceleration, respiratory suppression) results from a simultaneous or at least closely proximate coactivation of the sympathetic and the parasympathetic nervous system when recognizing crime related details [36]. On the one hand, this indicates that differential responding to relevant and irrelevant items in a CIT examination is more related to the orienting response [37] instead of the largely sympathetically driven fight or flight response [38]. On the other hand, this coactivation of both branches of the ANS provides the basis for the incremental validity of HR decelerations that are primarily mediated by the parasympathetic nervous system [39] over electrodermal responses that are solely modulated by the sympathetic nervous system [40]. Taken together, our results indicate that skin conductance is the most valid single measure for the detection of concealed information. Respiration and heart rate can, however, provide incremental validity.
M. Gamer et al. / Physiology & Behavior 95 (2008) 333–340
We propose a classification function and thus provide a standardized framework of how to process and combine different physiological measures to allow for a more precise diagnosis of the truth status in CIT examinations. This algorithm may be integrated into the software of computerized polygraph systems that are mainly optimized for CQT-examinations at the moment [21]. Future research should examine whether other measures can further improve detection efficiency. It seems unlikely that other autonomic measures can provide incremental validity because both branches of the ANS are already taken into account here. Central nervous system measures may be more promising. On the one hand, this could be the amplitude of the P300 that was already found to differ between relevant and irrelevant items in groups of guilty examinees [7–9]. On the other hand, functional magnetic resonance imaging (fMRI) might be a promising technique [41]. Behavioral measures (e.g. response times [6], or symptom validity testing [32]) might also be taken into account. Some limitations of the current report should be acknowledged. First, there are other possibilities on how to combine several physiological measures for the detection of concealed information besides the logistic regression model that was applied to the current data. Recently, a discrimination model based on latent class analysis was suggested which allows for a consideration of interindividual differences in physiological response patterns [42]. However, this approach is more complex than multivariate analyses as the logistic regression and it requires pretest data of each participant (e.g., from a stimulation or card test). Moreover, the crucial assumption of that approach is that the physiological response pattern does not change between pretest and CIT. This assumption is challenged by studies showing that electrodermal measures have a larger detection efficiency in mock crime as compared to card test conditions [3] and the observation of specific respiration patterns that occur frequently in CIT investigations conducted in the field [43]. The application of the logistic regression model in the current studies also depends on the assumption of comparable physiological responding within the training sample (here [22]) and the test samples. However, our data suggest that this assumption holds for different versions of the CIT (autobiographical and mock crime paradigm) and various samples (students, community volunteers, and prisoners). Moreover, the consistent pattern of results across studies demonstrates that the classification functions seem to be robust across slight variations in scoring algorithms. Second, no innocents were examined in most of the considered studies. Therefore, we had to rely on simulated instead of real innocent's data. These simulated data, however, seem to be very plausible, because results are largely in line with comparable studies that actually incorporated a sample of innocents (see study 1, 7, and [22]). However, future studies on the physiological response profile of innocent participants in CIT examinations are certainly desirable to further support the findings of the current study. Third, conclusions are based upon laboratory research only. Future research will have to determine whether the proposed classification function can also be effectively applied in more realistic situations where the recall rate of relevant items is low or where emotional and motivational conditions are supposed to differ substantially from laboratory research [37]. Therefore, an application of the classification function to field data, e.g. from Japan where concealed information tests are frequently used [44,45], is desirable. Acknowledgement We are indebted to Ralph Hofmann for his help and support during data collection and analysis for the conceptual replication study. Furthermore, we thank Gershon Ben-Shakhar, John J. Furedy, and an anonymous reviewer for their constructive comments on an earlier version of this article.
339
References [1] Lykken DT. The GSR in the detection of guilt. J Appl Psychol 1959;43:385–8. [2] MacLaren VV. A quantitative review of the guilty knowledge test. J Appl Psychol 2001;86:674–83. [3] Ben-Shakhar G, Elaad E. The validity of psychophysiological detection of information with the Guilty Knowledge Test: a meta-analytic review. J Appl Psychol 2003;88:131–51. [4] Cohen J. Statistical power analysis for the behavioral sciences. Hillsdale: Erlbaum; 1988. [5] Gronau N, Ben-Shakhar G, Cohen A. Behavioral and physiological measures in the detection of concealed information. J Appl Psychol 2005;90:147–58. [6] Seymour TL, Seifert CM, Shafto MG, Mosmann AL. Using response time measures to assess ‘guilty knowledge’. J Appl Psychol 2000;85:30–7. [7] Allen JJ, Iacono WG, Danielson KD. The identification of concealed memories using the event-related potential and implicit behavioral measures: a methodology for prediction in the face of individual differences. Psychophysiology 1992;29:504–22. [8] Farwell LA, Donchin E. The truth will out: interrogative polygraphy (lie detection) with event-related brain potentials. Psychophysiology 1991;28:531–47. [9] Rosenfeld JP, Biroschak JR, Furedy JJ. P300-based detection of concealed autobiographical versus incidentally acquired information in target and nontarget paradigms. Int J Psychophysiol 2006;60:251–9. [10] Timm HW. Analyzing deception from respiratory patterns. J Police Sci Adm 1982;10:47–51. [11] Bradley MT, Janisse MP. Accuracy demonstrations, threat, and the detection of deception: cardiovascular, electrodermal, and pupillary measures. Psychophysiology 1981;18:307–15. [12] Podlesny JA, Raskin DC. Effectiveness of techniques and physiological measures in the detection of deception. Psychophysiology 1978;15:344–59. [13] Balloun KD, Holmes DS. Effects of repeated examinations on the ability to detect guilt with a polygraph examination: a laboratory experiment with a real crime. J Appl Psychol 1979;64:316–22. [14] Verschuere B, Crombez G, De Clercq A, Koster EHW. Psychopathic traits and autonomic responding to concealed information in a prison sample. Psychophysiology 2005;42:239–45. [15] Ben-Shakhar G, Dolev K. Psychophysiological detection through the guilty knowledge technique: effects of mental countermeasures. J Appl Psychol 1996;81:273–81. [16] Ben-Shakhar G, Elaad E. Effects of questions' repetition and variation on the efficiency of the guilty knowledge test: a reexamination. J Appl Psychol 2002;87:972–7. [17] Ben-Shakhar G, Gronau N, Elaad E. Leakage of relevant information to innocent examinees in the GKT: an attempt to reduce false-positive outcomes by introducing target stimuli. J Appl Psychol 1999;84:651–60. [18] Elaad E, Ginton A, Jungman N. Detection measures in real-life criminal guilty knowledge tests. J Appl Psychol 1992;77:757–67. [19] Bradley MT, Ainsworth D. Alcohol and the psychophysiological detection of deception. Psychophysiology 1984;21:63–71. [20] Verschuere B, Crombez G, Koster EHW, De Clercq A. Antisociality, underarousal and the validity of the Concealed Information Polygraph Test. Biol Psychol 2007;74:309–18. [21] Kircher JC, Raskin DC. Computer methods for the psychophysiological detection of deception. In: Kleiner M, editor. Handbook of polygraph testing. San Diego: Academic Press; 2002. p. 287–326. [22] Gamer M, Rill H, Vossel G, Gödert HW. Psychophysiological and vocal measures in the detection of guilty knowledge. Int J Psychophysiol 2006;60:76–87. [23] Podlesny J, Kircher J. The finapres (volume clamp) recording method in psychophysiological detection of deception examinations. Forensic Sci Commun 1999;1:1–18. [24] Copas J, Corbett P. Overestimation of the receiver operating characteristic curve for logistic regression. Biometrika 2002;89:315–31. [25] Velden M, Wölk C. Depicting cardiac activity over real time: a proposal for standardization. J Psychophysiol 1987;1:173–5. [26] Bamber D. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 1975;12:387–415. [27] Metz CE, Wang P, Kronman HB. A new approach for testing the significance of differences between ROC curves measured from correlated data. In: Deconinck F, editor. Information Processing in Medical Imaging. The Hague: Martinus Nijhoff; 1984. p. 432–45. [28] Vandenbossch K, Verschuere B, Crombez G, De Clercq A. On the validity of finger pulse line length in the detection of concealed information. Submitted for publication. [29] Verschuere B, Crombez G. Déja Vu! The effect of previewing test items on the validity of the Concealed Information polygraph Test. Psychol Crim Law in press. [30] Iacono WG, Boisvenu GA, Fleming JA. Effects of diazepam and methylphenidate on the electrodermal detection of guilty knowledge. J Appl Psychol 1984;69:289–99. [31] Carmel D, Dayan E, Naveh A, Raveh O, Ben-Shakhar G. Estimating the validity of the guilty knowledge test from simulated experiments: the external validity of mock crime studies. J Exp Psychol Appl 2003;9:261–9. [32] Meijer EH, Smulders FT, Johnston JE, Merckelbach HL. Combining skin conductance and forced choice in the detection of concealed information. Psychophysiology 2007;44:814–22. [33] Ben-Shakhar G, Furedy JJ. Theories and applications in the detection of deception: A psychophysiological and international perspective. New York: Springer; 1990. [34] Lykken DT. Psychology and the lie detector industry. Am Psychol 1974;29:725–39. [35] Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982;143:29–36.
340
M. Gamer et al. / Physiology & Behavior 95 (2008) 333–340
[36] Gamer M, Gödert HW, Keth A, Rill HG, Vossel G. Electrodermal and phasic heart rate responses in the Guilty Actions Test: comparing guilty examinees to informed and uninformed innocents. Int J Psychophysiol 2008;69:61–8. [37] Verschuere B, Crombez G, De Clercq A, Koster EHW. Autonomic and behavioral responding to concealed information: differentiating orienting and defensive responses. Psychophysiology 2004;41:461–6. [38] Turpin G. Effects of stimulus intensity on autonomic responding: the problem of differentiating orienting and defense reflexes. Psychophysiology 1986;23:1–14. [39] Gianaros PJ, Quigley KS. Autonomic origins of a nonsignal stimulus-elicited bradycardia and its habituation in humans. Psychophysiology 2001;38:540–7. [40] Wallin BG. Sympathetic nerve activity underlying electrodermal and cardiovascular reactions in man. Psychophysiology 1981;18:470–6. [41] Gamer M, Bauermann T, Stoeter P, Vossel G. Covariations among fMRI, skin conductance, and behavioral data during processing of concealed information. Hum Brain Mapp 2007;28:1287–301.
[42] Matsuda I, Hirota A, Ogawa T, Takasawa N, Shigemasu K. A new discrimination method for the Concealed Information Test using pretest data and withinindividual comparisons. Biol Psychol 2006;73:157–64. [43] Suzuki R, Nakayama M, Furedy JJ. Specific and reactive sensitivities of skin resistance response and respiratory apnea in a Japanese concealed information test (CIT) of criminal guilt. Can J Behav Sci 2004;36:202–19. [44] Hira S, Furumitsu I. Polygraphic examinations in Japan: applications of the guilty knowledge test in forensic investigations. Int J Police Sci Manage 2002;4:16–27. [45] Nakayama M. Practical use of the concealed information test for criminal investigation in Japan. In: Kleiner M, editor. Handbook of polygraph testing. San Diego: Academic Press; 2002. p. 49–86.