Clinical Neurophysiology 124 (2013) 928–940
Contents lists available at SciVerse ScienceDirect
Clinical Neurophysiology journal homepage: www.elsevier.com/locate/clinph
Measuring target detection performance in paradigms with high event rates Alexandra Bendixen a,b,⇑, Søren K. Andersen c a
Institute for Psychology, University of Leipzig, Germany Institute for Psychology, Hungarian Academy of Sciences, Budapest, Hungary c Department of Neurosciences, University of California at San Diego, USA b
a r t i c l e
i n f o
Article history: Accepted 18 November 2012 Available online 21 December 2012 Keywords: Signal detection theory (SDT) Continuous stimulus presentation Discrimination Sensitivity (d0 ) Response bias Undefined observation interval
h i g h l i g h t s There are caveats with employing signal detection theory (SDT) when presenting stimuli at a fast rate (e.g., more than one stimulus per second). We demonstrate the pitfalls of commonly used approaches and suggest a new solution. Our approach is easy to implement and yields unbiased estimates of classical SDT parameters.
a b s t r a c t Objectives: Combining behavioral and neurophysiological measurements inevitably implies mutual constraints, such as when the neurophysiological measurement requires fast-paced stimulus presentation and hence the attribution of a behavioral response to a particular preceding stimulus becomes ambiguous. We develop and test a method for validly assessing behavioral detection performance in spite of this ambiguity. Methods: We examine four approaches taken in the literature to treat such situations. We analytically derive a new variant of computing the classical parameters of signal detection theory, hit and false alarm rates, adapted to fast-paced paradigms. Results: Each of the previous approaches shows specific shortcomings (susceptibility towards response window choice, biased estimates of behavioral detection performance). Superior performance of our new approach is demonstrated for both simulated and empirical behavioral data. Further evidence is provided by reliable correspondence between behavioral performance and the N2b component as an electrophysiological indicator of target detection. Conclusions: The appropriateness of our approach is substantiated by both theoretical and empirical arguments. Significance: We demonstrate an easy-to-implement solution for measuring target detection performance independent of the rate of event presentation. Thus overcoming the measurement bias of previous approaches, our method will help to clarify the behavioral relevance of different measures of cortical activation. Ó 2012 International Federation of Clinical Neurophysiology. Published by Elsevier Ireland Ltd. All rights reserved.
1. Introduction Signal detection theory (SDT) (Green and Swets, 1966/1988; Macmillan and Creelman, 2005) models situations in which an observer continuously monitors a technical (e.g., radar) system for incidents requiring active intervention. The observer’s perfor-
⇑ Corresponding author at: Institute for Psychology, University of Leipzig, Seeburgstr. 14-20, D-04103 Leipzig, Germany. Tel.: +49 341 973 5907; fax: +49 341 973 5969. E-mail address:
[email protected] (A. Bendixen).
mance in distinguishing true incidents (signals) from background noise is evaluated in terms of his capacity for discrimination between the two events (sensitivity index d’) and his tendency to report an incident rather than to refrain from doing so (response bias or criterion c). Because of its potential to disentangle sensory from strategic components of the decision process, SDT has proven useful in a wide variety of fields, including the assessment of vigilance in clinical populations (Mussgay and Hertwig, 1990; Nestor et al., 1990), recognition memory (e.g., Snodgrass and Corwin, 1988), and many more (for a compilation of very diverse applications, see Hutchinson, 1981).
1388-2457/$36.00 Ó 2012 International Federation of Clinical Neurophysiology. Published by Elsevier Ireland Ltd. All rights reserved. http://dx.doi.org/10.1016/j.clinph.2012.11.012
A. Bendixen, S.K. Andersen / Clinical Neurophysiology 124 (2013) 928–940
929
Fig. 1. Illustration of ambiguity in detection paradigms with high event rates. Participants are instructed to press a button whenever they detect a deviation on a given feature (depicted on the y axis). Due to the fast presentation, participants can hardly respond to a deviant event before subsequent standard events have been presented. Consequently, stimulus–response-mapping becomes ambiguous. For some button presses, it will be unclear whether they reflect actual responses to the deviant event (hits) or erroneous responses to a standard event (false alarms). For instance, the third response depicted above would be classified as a hit when setting a wide response window (e.g., up to 1200 ms after deviant onset), yet as a false alarm when setting a narrow response window (e.g., up to 900 ms after deviant onset). Criteria for identifying late and premature responses must be specified. The evaluation of participants’ detection performance (i.e., calculation of d0 ) should be based on a method that is largely independent of these criteria settings.
Although conceived for continuous monitoring situations, SDT is mostly applied with discrete series of events. Specifically, SDT lends itself to situations where the next event is only presented after a response was obtained, or where the rate of event presentation is slow enough (typically 1/s or below) to ensure that responses can be made before the next event is presented. This is usually the case in behavioral paradigms applying forced-choice tasks such as two-alternative forced choice (2-AFC) or two-interval forced choice (2-IFC) discrimination (e.g., Jang et al., 2009; Oglesbee and Kewley-Port, 2009; Tyler and Chen, 2000). Response intervals of about a second between consecutive events allow for an unambiguous mapping of stimuli and responses, and thus for an unambiguous classification of responses as hits, false alarms, misses, or correct rejections. Such situations are thus perfectly suited for calculating SDT parameters. The optimal situation for applying SDT is often compromised when combining behavioral paradigms with neuroscientific tools. In neuroscientific measurements, it is often necessary to speed up stimulus presentation, be it for reasons of efficiency (e.g., to obtain enough trials to reduce background noise in electroencephalographic measurements) or for motivational reasons (e.g., to keep participants’ attention highly focused on a behavioral task). There are also phenomena which in and of themselves only occur with a fast presentation rate, such as the omission mismatch negativity (MMN) component of the event-related potential (ERP) which is only elicited with presentation rates faster than 170 ms (Yabe et al., 1997, 1998). Complementing omission MMN studies with performance measurements thus necessarily implies high event rates. In recordings of steady-state visual evoked potentials (for a recent review, see Andersen et al., 2011), target events are often embedded in an ongoing stimulation rather than a fast sequence of background stimuli. When, for any of these reasons, the sequence of events needs to be speeded up or even become continuous, responses necessarily overlap with subsequent events and thus no longer have a unique reference point (Fig. 1). In a pragmatic attempt to overcome this problem, the sequences used for deriving neuroscientific measurements are often separated from those used for evaluating behavioral discrimination performance between the relevant events. The sequence can then be slowed down in the behavioral part (Paavilainen et al., 2001), or it can be kept at fast pace but be stopped occasionally for asking participants to judge the final stimulus (Bendixen et al., 2008). Such modifications may, however, heavily affect the difficulty of the discrimination task. Additionally, the temporal separation of neurophysiological and behavioral measurements removes the direct trial-by-trial correspondence of the two. Thus comparability between the conditions with behavioral and neuroscientific measures may be lost.
When keeping the fast presentation rate, data analysis by means of SDT is not straightforward. Statistical solutions to this problem have been suggested some 50 years ago already (Egan et al., 1961; Luce, 1966), but have rarely been applied. Two main problems occur. First, how can responses be assigned to preceding stimuli and hence be classified as hits or false alarms? Second, what is the correct denominator for calculating the proportion of false alarms? The most common solution to the first problem is to pre-specify a window after each signal (e.g., 150–1200 ms) within which responses are attributed to the preceding signal (thus considered hits) and outside of which responses are considered false alarms. The choice of the appropriate response window heavily differs between studies, especially so at the upper end, with values ranging from 700 ms (e.g., Kimura et al., 2006) up to 3000 ms (e.g., van Zuijen et al., 2006) after the onset of the relevant event, but also at the lower end, with minimum correct response times ranging from 52.5 ms (e.g., Horváth and Winkler, 2010) up to 300 ms (e.g., Paavilainen et al., 2007). Although this may in part be explained by differences in the complexity of the decision, it remains somewhat arbitrary and prone to misinterpretations of the data. In particular, prolonging the response window will artificially increase the number of hits and decrease the number of false alarms. It is desirable to develop an analysis strategy that compensates for this dependency. Even after specifying an appropriate response window, the computation of discrimination sensitivity d0 needs further consideration. While the proportion of hits can now easily be computed (dividing the number of hits by the overall number of targets), the denominator for calculating the proportion of false alarms is not straightforward. On one extreme, the number of false alarms could be divided by the number of standards (i.e., non-signals). Yet this assumes that participants could have responded to each stimulus. As this is practically impossible, a false alarm rate of or close to 1 cannot be reached, which artificially increases the sensitivity estimate. On the other extreme, the number of false alarms could be divided by the number of targets. Yet this can – in the case of frequently responding participants – lead to false alarm rates above 1, which is obviously meaningless (and, on the more practical side, precludes application of the d0 formula). An intermediate solution has been brought up recently by Andreou and colleagues (2011) who suggest to divide the entire sequence into 300-ms intervals, and use the number of these intervals as the denominator for calculating the proportion of false alarms. The value of 300 ms was chosen as it is assumed to reflect the average time needed for the execution of a simple response (Andreou et al., 2011). The recency of this approach reflects the fact that no consensual solution has been reached on the issue. In fact, some experimental papers report d0 values without mentioning the reference point for calculating false alarm rates; others report the number
930
A. Bendixen, S.K. Andersen / Clinical Neurophysiology 124 (2013) 928–940
Table 1 Five approaches for calculating signal detection performance in fast-paced paradigms. Method A disregards false alarms and evaluates detection performance solely based on hit rates. Methods B–E are all based on calculating the sensitivity index d0 ; they differ only in the way they compute the false alarm rate.
Denominator for false alarm rate
Method A
Method B
Method C
Method D
Method E
None
N D ¼ N D
N D ¼ N T
N D ¼ 300T Sms
N D ¼ TT RS N T
Notes: N D – number of ‘‘virtual’’ non-target events for false alarm rate calculation; N D – actual number of non-target events; N T – number of target events; T S – duration of entire stimulus sequence [ms]; T R – length of response window [ms]; 300 ms – average time assumed for executing a simple response.
of false alarms as negligible and do not analyze false alarm rates at all.1 Some notable exceptions have devised specific measures for paradigms typically more complex than classical detection tasks (see, e.g., Saupe et al., 2010, whose measure relates the number of hits and two different types of false alarms to the overall number of responses, but is not easily transferrable to SDT). Another approach has been to include distinct distractor events and only analyze responses occurring within a pre-defined time-window after the onset of target or distractor events (e.g., Andersen et al., 2008). These solutions are tailored to specific experimental designs and thus do not constitute a general means for computing SDT parameters in fast-paced detection paradigms. In the following, we will present an approach that circumvents these issues and allows for the unequivocal computation of the classical parameters of SDT largely independent of response window choice. Our approach is directly derived from the basic formulas of SDT and from the prerequisite that a random observer be assigned d0 = 0. The denominator for calculating the proportion of false alarms is defined as the number of time intervals not containing a signal that are equally long as the time intervals that are being searched for responses to the signal (i.e., the acceptable response windows). Note that this method is similar to the one proposed by Egan and colleagues (1961), but uses all portions of the stimulus sequence without a signal. The overall time of the sequence is reduced by the summed length of the response windows for hits and then divided by the response window length to obtain the number of equivalent time intervals. This number serves as the baseline for calculating the proportion of false alarms. Hits and false alarms are thus probed in comparable time intervals, which should make the method robust towards changes of the response window (i.e., a higher proportion of hits caused by prolonging the response window is compensated for by a likewise higher proportion of false alarms). We will contrast this method (subsequently denoted method E) with four methods frequently used in the literature as outlined above: analyzing only hit rates (method A), calculating false alarm rates based on the number of non-targets (method B), calculating false alarm rates based on the number of targets (method C), and calculating false alarm rates based on the maximum number of executable responses during the sequence (method D; Andreou et al., 2011). Note that method E is the only one that provides a compensation for the response-window dependency of the hit rate. By necessity, the results of methods A–D will be dependent on the 1 In an attempt to assess the significance of the problem, we identified 17 papers published in Clinical Neurophysiology since 2000 that employed variants of the classical target-detection paradigm with fast (61/s) or continuous stimulus presentation (Bertoli et al., 2002; Egner and Gruzelier, 2004; Gomes et al., 2008; Gomes et al., 2012; Jongsma et al., 2006; Kida et al., 2004, Lepistö et al., 2006, Määttä et al., 2005; Maekawa et al., 2005; Matthews et al., 2007; Michalewski et al., 2005; Müller and Hillyard, 2000; Neelon et al., 2006; Pihko et al., 2005; Sandmann et al., 2010; Shafer et al., 2007; Sussman et al., 2002). Out of these 17 studies, four chose the strategy to evaluate behavioral discrimination separately from the neuroscientific measurement, such that sequence parameters could be modified to meet the requirements of SDT (which is mathematically correct but compromises comparability between behavioral and neuroscientific measures). The majority of studies, however, did not note the problem of ambiguity in the attribution of responses and thus based their d0 calculation on wrong assumptions, or else did not calculate d0 at all and instead reported only percentages for one type of event (hits or ‘‘errors’’) or the raw number of events.
chosen response window. Moreover, it was already pointed out above that method B will underestimate the false alarm rate and method C will overestimate it, such that both will yield biased estimates of sensitivity (too high or too low d0 values, respectively). Finally, method A suffers from the very limitation that was the reason for developing SDT (Green and Swets, 1966/1988; Macmillan and Creelman, 2005): the failure to distinguish between sensory and strategic components of the decision process in perceptual judgments. A performance estimate based on hit rates alone can be artificially increased by responding more often (i.e., adopting a more liberal response criterion)2. Thus methods A–C have known issues that can be derived from logical arguments. Nevertheless, we do not exclude these methods a priori but contrast them with the remaining two (D and E) in order to obtain a quantitative picture of their implications, providing the reader with an estimate of their potentially biasing impact in past publications. In the following, we mathematically derive method E using the basic formulas of SDT. Subsequently, we illustrate the performance of methods A–E (cf. Table 1) with respect to simulated data and an empirical dataset taken from an auditory target detection experiment (Bendixen et al., 2012). 2. Methods 2.1. Mathematical derivation According to SDT (Green and Swets, 1966/1988; Macmillan and Creelman, 2005), based on the Gaussian model assuming homogeneous variances of the noise and signal-plus-noise distributions, the sensitivity index d0 can be derived from the hit and false alarm rates as: 0
d ¼ ZðhÞ Zðf Þ
ð1Þ
where Z is the inverse of the cumulative normal distribution, and h and f are given by dividing the number of hits H or false alarms F by the overall number of targets NT or distractors ND, respectively:
H NT F f ¼ ND
h¼
ð2Þ ð3Þ
By definition of SDT, the expected sensitivity E(d0 ) of an observer who is entirely unable to discriminate target from non-target events and who thus performs at chance level is zero:
2 Another, often neglected disadvantage of calculating only hit rates is that hit and false alarm rates are measured on a non-linear scale (i.e., the performance difference between 40% and 50% is considerably smaller than the difference between 90% and 100%). This does not affect the order of participants’ discrimination capabilities, but the numerical relations between them, which may impede statistical testing. Especially in factorial designs, the compression of the hit-rate scale at the upper and lower bounds can lead to spurious interactions. For instance, when task difficulty is modulated to result in hit rates of 65% (difficult) vs. 95% (easy), the influence of a second factor (e.g., resulting in hit rates of 92% vs. 98% in the easy condition and 56% vs. 74% in the difficult condition) would be underestimated in the easy condition, possibly leading to an artificial interaction between the two factors when only hit rates are considered.
A. Bendixen, S.K. Andersen / Clinical Neurophysiology 124 (2013) 928–940 0
Eðd Þ 0
ð4Þ
To apply this approach to fast-paced paradigms, a response window after each target stimulus is specified. Responses are classified as hits or false alarms depending on whether they fall into any such response window or not. While the calculation of the hit rate h is straightforward, the correct denominator N D for computing the false alarm rate f remains to be determined. Note that here, the number of distractor events N D is a hypothetical entity used for calculating the false alarm rate; it does not correspond to the number of actually presented non-target stimuli. Combining the definition that the expected sensitivity E(d0 ) of an observer at chance level is zero (4) with (1)–(3) yields:
ND ¼
EðFÞ NT EðHÞ
ð5Þ
where E(H) and E(F) are the expected numbers of hits and false alarms, respectively. Given a total number of responses R, E(F) = R E(H). Inserting this into (5) yields:
ND ¼
ðR EðHÞÞ NT R 1 NT ¼ EðHÞ EðHÞ
ð6Þ
The probability of a random response (i.e., executed at a random point in time) to be a hit (pH) is given by:
pH ¼
T R NT TS
ð7Þ
where TR denotes the length of the time-window after each target in which responses are classified as hits, and TS is the total length of the sequence in which responses are recorded. Hence, the expected number of hits E(H) of an observer at chance level is given by:
EðHÞ ¼ pH R ¼
NT T R R TS
ð8Þ
Inserting (8) into (6) yields:
ND ¼
TS NT TR
ð9Þ
and hence the false alarm rate f is given by
f ¼
F T S =T R NT
ð10Þ
Thus having calculated the false alarm rate, the sensitivity index d0 can be derived from the hit and false alarm rates according to (1). If the response window TR is chosen to be identical with the stimulus onset asynchrony (SOA) of the sequence, N D will be identical to the number of actually presented non-targets, and thus the current approach merges into the classical SDT approach. Importantly, (9) is the only solution for N D that generally, and irrespective of the particular definition of the response window, satisfies the requirement that chance performance be assigned d0 = 0. Accordingly, methods B–D violate this basic prerequisite of SDT, except for the special cases in which these approaches yield the same number for N D as (9). In all other cases, these methods assign either too large or too small numbers for N D and hence lead to biased estimates of false alarm rates and d0 . In deriving (7) and (8) we implicitly assumed that there are no double responses within any of the designated response windows. In practical application, two problems can occur with this assumption. First, an observer could issue some unintentional double responses, that is, two responses in close temporal succession that were not meant to be distinct from each other. This can be remedied by defining a minimum separation of consecutive responses (e.g., 200 ms). Any two responses whose temporal distance is less than this minimum value should be treated as one erroneous double response, and the second response should be eliminated from
931
further analysis. Second, the observer could intentionally respond more than once within one response window. In this case, only one of these responses (arbitrarily, the first one) should be counted as a hit, while all following responses should be classified as false alarms. The likelihood of multiple response occurrences within one designated window increases with the length of the response window. In extreme cases, this can lead to a number of ‘‘leftover’’ false alarms that is larger than the number of designated false alarm windows (i.e., the hypothetical number of distractor events N D ), which implies a false alarm rate f larger than 1. As f represents a probability estimate, values exceeding 1 are impossible by definition. Therefore, the length of the response window for analyzing a given dataset must be assigned a sufficiently small value. At the very least, the response window length must be smaller than the average response rate of the participant who responded most frequently. In the application of our approach to simulated and real data, we explored a wide range of possible response windows that did not always fulfill this requirement. Whenever this led to false alarm rates exceeding 1, we corrected the number of false alarms down to the number of designated false alarm windows. This corresponds to a truncation of the false alarm distribution, and introduces a bias by making the d0 estimates appear more homogenous than they are. (Note that the alternative solution – excluding these impossible false alarm values from consideration – would have introduced bias as well, in this case due to incomplete sampling of the response window space; and would have implied the practical problem that d0 cannot be calculated in each case). To compare all d0 calculation methods on equal grounds, the same correction was applied to methods B–D as well. For later application of the method, we recommend that false alarm rates exceeding 1 be avoided by using appropriate response window boundaries (see Section 4 for a more detailed consideration of this issue). After applying the above correction, the obtained values for the hit rate h and the false alarm rate f are limited to the range between and including 0–1. When h or f equal 0 or 1, d0 is still not computable due to infinite values in the cumulative normal distribution. This is a known issue with probability estimates in SDT, and can be corrected following general recommendations for classical SDT (Hautus, 1995; Macmillan and Creelman, 2005). According to this correction procedure, 0.5 is added to the number of observed events in each category (hits, false alarms, misses, correct rejections), and thus 1 is added to the number of targets NT and to the hypothetical number of distractor events N D for each calculation (i.e., regardless of whether a 0 or 1 hit or false alarm rate would have resulted or not). This leads to the following modifications of the above formulas (2) and (10):
H þ 0:5 NT þ 1 F þ 0:5 f ¼ T S =T R NT þ 1
h¼
ð11Þ ð12Þ
Again, to compare methods A–E on equal grounds, the same correction was applied to all methods. All d0 calculations in the present study are based on the assumption of normal (Gaussian) noise and signal-plus-noise distributions, and on the two distributions having homogeneous variance. Note that the modified formula (10) for determining the false alarm rate can be easily applied for computing alternatives to d0 (Macmillan and Creelman, 2005; Snodgrass and Corwin, 1988; Stanislaw and Todorov, 1999), such as the area measure A0 given by:
1 ðh f Þ ð1 þ h f Þ þ if h P f 2 4h ð1 f Þ 1 ðf hÞ ð1 þ f hÞ A0 ¼ if h 6 f 2 4f ð1 hÞ
A0 ¼
ð13Þ
932
A. Bendixen, S.K. Andersen / Clinical Neurophysiology 124 (2013) 928–940
2.2. Simulated data In order to illustrate the differences between method E and frequently used approaches in the literature (methods A–D outlined in the Section 1; see Table 1 for a summary of all five methods), we simulated datasets for a target detection experiment by performing 10,000 iterations of the following procedure. We generated a stimulus sequence of 900 s duration with an SOA of 150 ms comprising 100 target stimuli (and thus 5900 non-target stimuli). Target and non-target stimuli were randomly distributed with the constraint of consecutive target stimuli being separated by at least 2 s. We then determined the shape and scale parameters of a gamma distribution to simulate the response time characteristics for the observer on this iteration, randomly choosing from a range of 8–12 for the shape parameter k and from a range of 40– 60 for the scale parameter h. These values correspond to mean response times in the range of 320–720 ms, constituting a fairly large amount of variation between the simulated observers. Based on the gamma distribution for the current iteration (i.e., observer), we then distributed 50 target-related responses into the sequence. These responses were placed after target stimuli with a delay chosen randomly for each of the 50 responses, governed by the parameters of the gamma distribution. Next, we put an additional number of 100 random responses into the sequence. These responses were distributed randomly throughout the sequence, i.e., they could fall into the designated response intervals as well (and would in this case be classified as hits, representing an incidentally correct guess). These fixed values of 50 target-related and 100 additional random responses imply that the simulated observer has a constant, moderate discrimination capacity. Finally, we specified a range of acceptable response windows to be used for the analysis (lower bound ranging from 0 to 500 ms in steps of 50 ms; upper bound ranging from 50 to 2000 ms in steps of 50 ms; only windows with the lower bound of the response window being smaller than the upper bound of the response window were included, i.e. the response windows were at least 50 ms wide). For each of the resulting response windows, and for each of the 10,000 iterations, we computed the hit rate, the false alarm rate, and the sensitivity value d0 according to the five methods A–E (see Table 1 for formulas). All computations included three correction procedures as described above: (1) any response following another response by less than 200 ms was discarded as an erroneous double click, (2) when the number of events classified as false alarms was larger than the denominator for the false alarm rate, the number of false alarms was corrected down to the value of the denominator, (3) the number of hits and number of false alarms were increased by 0.5, and the denominators for calculating the false alarm and hit rates were increased by 1 (Hautus, 1995; Macmillan and Creelman, 2005). While correction (3) was applied as a standard procedure regardless of the necessity for correction, for corrections (1) and (2) we report how often they were needed. We then averaged the hit rates (method A) and the d0 values (methods B–E) across the 10,000 iterations. The resulting measures of discrimination performance (hit rate in method A; d0 in methods B–E) for the different choices of the response window boundaries were evaluated with respect to their susceptibility to the chosen response window settings. In view of the observed differences in susceptibility towards response window choice, a fixed response window was to be chosen for further analysis that would allow for comparing the validity of the five methods on equal grounds. The fixed response window was chosen to be 215–910 ms, as this window contains 95% of the responses based on the parameters of the gamma distribution specified above. The chosen window thus corresponds to the simulated observers’ response time characteristics, and should yield optimal results for each of the five methods.
In the next step, we manipulated the simulated observer’s discrimination capacity by changing the number of target-related and additional random responses in the sequence. Our aim was to evaluate whether the discrimination performance estimate in each of the five methods validly reflects the known manipulation of discrimination capacity. The number of target-related responses in the sequence was manipulated in 5 steps (0, 25, 50, 75 or 100). The number of additional random responses ranged from 0 to 200 in steps of 5. Both types of responses were distributed into the sequence according to the same principles as specified above. For each combination of a given number of target-related and a given number of random responses, 10,000 iterations were performed. For each iteration, we computed hit rates, false alarm rates, and sensitivity values d0 according to the five methods A–E (see Table 1 for formulas), using 215–910 ms as the acceptable response window. Correction procedures were applied as detailed above. We then averaged the hit rates (method A) and the d0 values (methods B–E) across the 10,000 iterations. The performance measures obtained with the five methods (hit rate in method A; d0 in methods B–E) were assessed as to whether they validly reflect the known manipulation of discrimination capacity, with particular emphasis on whether random response patterns (i.e., 0 target-related responses with any number of additional random responses) are correctly identified as being random. 2.3. Empirical data For testing the application of the five methods on empirically obtained data, we chose an auditory oddball experiment (Bendixen et al., 2012) where participants were instructed to press a button whenever they detected a rare target tone (deviant) amongst frequently occurring other tones (standards) presented at a fast pace. The auditory oddball paradigm is one of the prototypical situations in which the problem of d0 calculation with fast stimulus presentation arises. Following the response characteristics of the auditory system (Cowan, 1984; Grimm et al., 2006; Winkler et al., 1998; Yabe et al., 1997, 1998; Zwislocki, 1969), tones are very often presented at a rate of 4 Hz or faster. The pre-attentive discrimination between two types of tones under such fast presentation can readily be evaluated by electrophysiological indicators such as the MMN component of the ERP (Näätänen et al., 2007). Complementing MMN studies by an investigation of attentive discrimination capacities is often desirable, but, as outlined in the Section 1, faces the problem of transferring the rapid pre-attentive design to an attentive condition. In the oddball experiment (Bendixen et al., 2012), participants were presented with four blocks, each comprising 1912 task-relevant tones presented at an SOA of 150 ms. In each block, 68 tones (3.56%) were deviants in the otherwise regular time–frequency structure of the tone sequence. Participants were instructed to respond to these deviants by button press. Speed and accuracy were equally emphasized in the instructions. The randomization was constrained so that the minimum interval between two successive deviant tones would be 2700 ms; participants were not informed of this constraint. The time–frequency structure of the sequence was very difficult to follow due to informational masking, hence low discrimination sensitivity was to be expected, with many participants performing near chance level (d0 0). While participants performed the deviant detection task, electrophysiological recordings were made with nose reference from seven scalp electrodes (F3, Fz, F4, C3, Cz, C4, Pz), two mastoid electrodes (LM, RM), and four ocular electrodes (LO1, LO2, SO1, IO1; bipolarized off-line to yield horizontal and vertical electroocular activity). Electrophysiological data were sampled at 250 Hz, and filtered off-line (0.530 Hz band pass). ERPs elicited by standard and deviant tones were derived by averaging epochs of 500 ms duration including a 50 ms
A. Bendixen, S.K. Andersen / Clinical Neurophysiology 124 (2013) 928–940
Fig. 2. Empirical data – histogram of response times. Responses of all participants (N = 15) issued from 0 to 2700 ms after the onset of a deviant tone are plotted in bins of 100 ms.
pre-stimulus baseline, excluding epochs with amplitude changes exceeding 100 lV on any channel. In order to quantify the N2b component as an electrophysiological indicator of target detection (Kasai et al., 2002; Novak et al., 1990), ERP amplitudes from 230 to 270 ms after stimulus onset were measured in the deviant-minusstandard difference waves at C3, Cz, and C4. The MMN component occurred earlier in this paradigm (peaking at around 160 ms as determined by polarity inversion at the mastoid electrodes) and thus the latency range of 230–270 ms is assumed to contain mainly N2b, though superimposed on a more sustained negativity representing a processing negativity or a late discriminative negativity. Although the limited number of scalp electrodes did not permit detailed topographical analyses to distinguish between these deviance-related negativities, they all provide electrophysiological indication for discrimination of the deviant tones. For simplicity, we will refer to deviance-related activity in the chosen latency range (230–270 ms) as N2b. For further details on the methodology, see Bendixen and colleagues (2012). Button presses made by participants were continuously recorded. Data of one participant who responded only once during the entire experiment was excluded from the analysis. For the remaining participants (N = 15), the total number of button presses was highly variable (Mean = 227; Min = 38; Max = 404; note that there were 272 deviant tones in total). This high variability poses additional challenges on the analysis method and stresses the importance of using an approach like SDT, which can isolate the influence of response bias. In deciding whether button presses might represent hits (i.e., correct identifications of deviant tones), we limit ourselves to button presses occurring within the minimum inter-deviant interval (2700 ms). Any response that was issued more than 2700 ms after the onset of a deviant tone is regarded as a false alarm. This ensures that responses are unambiguously attributed to the immediately preceding deviant tone. Fig. 2 shows the distribution of all participants’ response times relative to the onset of the deviant tone, including only button presses from 0 to 2700 ms (see previous paragraph). The peak from 400–700 ms suggests that deviant tones were detected correctly by some participants or in some trials. The high noise-floor throughout the investigated time-range suggests that many of the responses represent false alarms. Fig. 2 further indicates that in this dataset, it is practically impossible to unambiguously set a window for accepting responses as correct in a data-driven manner based on response-time distributions alone.
933
Fig. 3. Simulated data – results for the five methods depending on response window boundaries with fixed numbers of target-related (50) and random (100) responses. Each panel represents one of the five methods. The x axes of each panel show the lower bound of the response window (min-RT); the y axes show the upper bound (max-RT), both given in ms. In panel A, color indicates hit rate, with red colors indicating higher hit rates and blue indicating hit rates close to zero. In panels B–E, color indicates d0 , with red colors indicating good discrimination (d0 > 0), green indicating zero sensitivity (d0 = 0), and blue colors indicating belowzero discrimination (d0 < 0). Note that performance estimates increase with wider response windows (i.e., towards the upper left corner of the plots) in all methods except for method E. The color description refers to the online version of the journal. In the print version, white indicates moderate hit rates (h = 0.4) or zero sensitivity (d0 = 0), respectively. Increasingly darker colors indicate successively higher and successively lower hit rate and d0 values, with additional contour lines to mark lower values (see legend).
In the same way as for the simulated data, hit rates, false alarm rates, and sensitivity values d0 were computed according to the five methods A–E (see Table 1 for formulas) with the specified correction procedures. All possible response windows within the borders of 0 to 2700 ms were explored, with two constraints imposed: (1) the lower bound of the response window must be smaller than the upper bound of the response window; and (2) the lower bound of the response window must be smaller than 600 ms in view of the peak in the overall response time distributions (Fig. 2). Response window boundaries were again chosen in steps of 50 ms. The measures of discrimination performance (hit rate in method A; d0 in methods B–E) were calculated for different choices of the response window boundaries, and were first compared against each other with respect to their stability under these different choices. Second, the relations between the different methods’ performance estimates were explored by correlation analyses. Third, the validity of the performance estimates was assessed based on their relation with the amplitude of the N2b component as an electrophysiological measure of discrimination performance (Kasai et al., 2002; Novak et al., 1990). 3. Results 3.1. Simulated data Fig. 3 shows the results of the simulation with fixed numbers of target-related and random responses under different choices of the response window boundaries. The more inhomogeneous an individual color plot is, the larger the influence of the response window on the performance estimates. Fig. 3 shows that the results obtained with method E are more homogeneous than those with methods A–D. Inhomogeneity in methods B–E was quantified as the standard deviation between the d0 calculations for all combinations of response window boundaries. The resulting inhomogeneity amounts to 0.616 (method B), 1.061 (method C), 0.623 (method D), and 0.325 (method E). Inhomogeneity in method A is difficult to compare with the other methods numerically due to the different range of observable values. Yet visual inspection of Fig. 3 suggests similar or even higher inhomogeneity than in methods B–D. Fig. 3
934
A. Bendixen, S.K. Andersen / Clinical Neurophysiology 124 (2013) 928–940
Fig. 4. Simulated data – results for the five methods applying a fixed response interval (200–1200 ms) with variable numbers of target-related and random responses. Hit rates (method A, top left) and d0 values (method B, middle left; method C, middle right; method D, bottom left; method E, bottom right) for five different numbers of targetrelated responses (marked by different line styles). The number of additionally inserted random responses is shown on the x axes of each panel. These responses were put into the sequence at random points, thus they could accidentally fall within the specified response intervals. Note that for sequences containing only random responses (strong dotted line), method E is the only one that correctly yields a performance estimate of zero (indicating chance performance), while methods B and D overestimate and method C underestimates performance. Method A is the only method in which, for a fixed number of target-related responses, the resulting performance measure can be artificially increased by adding random responses. In all other methods, performance estimates correctly decrease when adding random responses, with the exception of methods B and D for sequences containing only random responses. In Method C, false alarm rates above one result when many random responses are given; the applied correction by truncation of the false alarm rate leads to artificially constant sensitivity values.
further shows that the observed inhomogeneity in methods A–D stems from a tendency of the performance estimate to increase with the length of the response window (i.e., when moving towards the upper left corner of the plots). In contrast, method E is the only one in which the maximum performance estimate lies in the middle of the range of response windows. Note that the ‘‘optimal’’ response window for method E is approximately 250–700 ms, which corresponds very well with the actual distribution of response times. Prior to the calculations, 2.74% of the responses were discarded as erroneous double clicks (pertains to all methods). Correction of false alarm rates larger than one was necessary in 46.65% of the cases for method C. No such correction was needed for the other methods because they do not calculate false alarm rates (method A) or because their false alarm rate estimates are considerably lower (methods B, D, and E). The resulting d0 estimates for method C are thus for a large part based on truncated false alarm data. Without this correction it would have been impossible to compute d0 with method C in many regions of the response window space. Fig. 3 also clearly illustrates another serious issue with method C: The estimate for d0 is below zero for all choices of response windows, although the data was generated to reflect above-chance performance. Fig. 4 shows the results of the simulation separately for the five methods (in terms of hit rate in method A; in terms of d0 in methods B–E), using a fixed response interval (215–910 ms) with variable numbers of target-related and random responses. Prior to the calculations, 2.91% of the responses were discarded as erroneous double clicks (pertains to all methods). Again, correction of false alarm rates larger than one was necessary only for method C (in 38.91% of the cases) but not for any of the other methods. Truncation of the false alarm rate leads to constant d0 estimates with method C when adding many random responses such that
there are more responses than targets in the sequence (cf. righthand part of panel C in Fig. 4). Note that a simulated sequence of only random responses will, just by chance, contain some responses within the acceptable response window after a target stimulus, and thus will yield a hit rate above 0. If compensated by a proper measure of the false alarm rate, this simulated sequence will nevertheless be identified as reflecting zero discrimination capability (i.e., random responses). It is evident in Fig. 4 that only method E results in zero sensitivity (d0 = 0) for such random sequences, while methods B and D overestimate sensitivity, and method C underestimates sensitivity in this situation. This overestimation (underestimation) bias translates into higher (lower) d0 estimates along the whole continuum of response distributions. As expected, Fig. 4 also clearly shows that hit rates alone (method A) can be artificially increased by adding random responses (i.e., guessing). In methods B– D, performance estimates correctly decrease when adding random responses to a fixed, non-zero number of target-related responses. With zero target-related responses, the performance estimate should not be influenced by the number of random responses, yet method C yields a performance decrease in this case, and methods B and D even a slight increase of performance with a larger number of random responses. Again, only method E behaves in the expected way. In sum, the properties of the simulated data are best captured by the performance estimate according to method E. 3.2. Empirical data Fig. 5 displays the results of evaluating performance (in terms of hit rate in method A; in terms of d0 in methods B–E) in the 15 participants’ data for different choices of the response window
A. Bendixen, S.K. Andersen / Clinical Neurophysiology 124 (2013) 928–940
935
Fig. 5. Empirical data – results for the five methods depending on response window boundaries. Each subplot represents one of the fifteen participants. The x axes of each sub-plot show the lower bound of the response window (min-RT); the y axes show the upper bound (max-RT), both given in ms. In panel A, color indicates hit rate, with red colors indicating higher hit rates and blue indicating hit rates close to zero. In panels B–E, color indicates d0 , with red colors indicating good discrimination (d0 > 0), green indicating zero sensitivity (d0 = 0), and blue colors indicating below-zero discrimination (d0 < 0). The color description refers to the online version of the journal. In the print version, white indicates moderate hit rates (h = 0.4) or zero sensitivity (d0 = 0), respectively. Increasingly darker colors indicate successively higher and successively lower hit rate and d0 values, with additional contour lines to mark lower values (see legend).
boundaries. As for the simulated data, the more inhomogeneous an individual color plot is, the larger the influence of the response window on the performance estimates. Again, the performance estimate in methods A–D increases with the length of the response window (i.e., when moving towards the upper left corner of the plots), while this is not the case for method E. Inhomogeneity in
methods B–E was statistically measured as standard deviation between the d0 calculations within a participant for all combinations of response window boundaries. The mean across participants of the resulting inhomogeneity measure was compared between the four methods in a repeated-measures ANOVA with the 4-level factor calculation method (B–E). The method of calculation was shown
936
A. Bendixen, S.K. Andersen / Clinical Neurophysiology 124 (2013) 928–940
to have a significant impact on inhomogeneity [F(3, 42) = 28.537, p < .001, Greenhouse-Geisser e = 0.499, partial g2 = 0.671]. Highest inhomogeneity was observed for method C (1.045) followed by methods D (0.649), B (0.637), and E (0.316). Post hoc analyses with Bonferroni correction of the confidence level for multiple comparisons revealed significant differences in inhomogeneity for all pairwise comparisons (all p values < 0.01). Again, inhomogeneity in method A is difficult to compare with the other methods statistically due to the different range of observable values. Yet visual inspection of Fig. 5 suggests similar, if not higher inhomogeneity than in methods B–D. In sum, d0 calculations in method E were least influenced by the choice of the acceptable response window. No responses had to be discarded as erroneous double clicks in the empirical data. Correction of false alarm rates larger than one was not necessary for any participant with methods B and D. With method C, false alarm truncation was needed for 6 out of 15 participants, who responded much more often than there were deviants in the sequence (most noticeable for participant 9, who responded 404 times overall). In these 6 participants, an average of 37.47% of the investigated response windows was affected (maximum of 89.58% for participant 9; average across all participants: 14.99%). Thus without correction, it would have been impossible to compute d0 with method C in many regions of the response window space, as had been observed for the simulated data. Method E also requires correction in some cases, albeit to a much smaller extent. It can treat ‘frequent responders’ without correction in larger regions of the response window space, but correction was necessary in 3 out of 15 participants with an average of 6.35% of the investigated response windows being affected (maximum of 14.74% for participant 9; average across all participants: 1.27%). In method
Fig. 6. Empirical data – comparison of discrimination performance estimates across subjects in the five calculation methods. Hit rates (top panel) and sensitivity values d0 (bottom panel) averaged across the tested response windows are plotted for each subject. Values in both panels are sorted by decreasing sensitivity in method E. Note that the hit rates (method A) are not well correlated with the sensitivity values obtained in methods B–E. The relative order of the subjects’ sensitivity values is roughly similar in methods B–E. The absolute values are, however, heavily affected by the calculation method.
E, false alarm rates larger than one occur when the response window is chosen so that there are more false alarms remaining for a participant than there are designated false alarm windows. Note that choosing response windows small enough to make more than one response within any response window unlikely was pointed out above as one of the prerequisites for method E (see Section 2). So far, we have considered the variability and computability of the performance measures. We will now turn to compare the actual results between the five methods in order to assess their correspondence to each other and to the putative real discrimination capabilities (i.e., their validity). Unlike for the simulated data, it was not feasible to unambiguously choose a meaningful response window for comparison of the five methods in view of the large amount of truncated false alarm data with method C, particularly for participant 9. Therefore, hit rates and sensitivity values were averaged across all response windows within each participant, separately for the five calculation methods. The resulting mean hit rate and sensitivity for each participant is displayed in Fig. 6. It is obvious that the performance measures obtained in method A are the only ones that show pronounced qualitative differences in the order of participants’ discrimination performance as compared to the other four methods. This was evidenced by highly significant pair-wise correlations between the performance measures from any two of the methods B–E (all r values > .695, all p values < .01). In contrast, hit rates (method A) showed much lower and partially even non-significant correlations with the performance estimates obtained with the other methods (methods A and B: r = .688, p < .01; A and C: r = .022, p = .938; A and D: r = .635, p < .05; A and E: r = .359, p = .189). Although methods B–E yield roughly the same order of sensitivity between participants, the sensitivity values themselves are at highly different levels (cf. also Figs. 3–5). Sensitivity estimates in methods B and D are generally higher than in the other two methods, with d0 > 0.5 for all participants. In contrast, method C yields generally low sensitivity estimates, with d0 < 0.5 for many participants. This indicates that the majority of participants performed worse than would be expected had they simply responded by chance. Method E yields an intermediate pattern, with some participants reaching d0 values clearly above 0 (indicating successful discrimination) and many others showing d0 values around 0 (indicating chance performance). In sum, sensitivity estimates in methods B–E do not differ too much on an ordinal scale, but differ considerably on a metrical scale, i.e. in the absolute sensitivity values. Our analytical conclusions and simulations, which argue for serious distortions of detection performance estimates in method A, an overestimation of d0 in methods B and D, and an underestimation of d0 in method C, are thus in full agreement with the patterns observed in the empirical data (Figs. 5 and 6). In a final analysis, we investigated how the behavioral discrimination measures in the five methods correspond with the N2b component of the ERP as an electrophysiological indicator of successful discrimination (Kasai et al., 2002; Novak et al., 1990). Fig. 7 displays scatter plots of individual participants’ N2b amplitude (averaged across C3, Cz, and C4) and their behavioral performance estimate according to the five methods. In order to quantify the relation between behavioral and electrophysiological measures of discrimination, a linear regression model was fitted to predict the behavioral performance measure from the N2b amplitude. These two measures should be clearly correlated and, under ideal conditions (i.e., if both N2b and d0 were unbiased and equally sensitive measures of discrimination performance), the regression line should cross the intersection of the axes, reflecting the fact that participants with zero behavioral discrimination capacity (d0 = 0) do not show an N2b component. This condition is best satisfied by method E with a significant fit of the regression model (R2 = 0.34, p < .05) and a non-significant regression intercept of
A. Bendixen, S.K. Andersen / Clinical Neurophysiology 124 (2013) 928–940
937
Fig. 7. Empirical data – correlation of discrimination performance measures with the N2b component. Upper right panel: grand-average ERPs elicited by standard (dotted line) and deviant (dashed line) tones and deviant-minus-standard ERP difference wave (strong solid line) averaged across the C3, Cz, and C4 electrodes. The latency range of the N2b component is marked in gray. For illustration, ERPs are additionally plotted separately for high-performing (d0 above median) and low-performing (d0 below median) participants. Median split was performed on the d0 estimates derived with method E, but the order of d0 results is highly similar in methods B–D. Upper left panel: relation between single-subject N2b amplitudes and hit rates with regression line. Middle and lower panels: relation between single-subject N2b amplitudes and sensitivity as measured by methods B–E with regression lines. Note that ideally (i.e., if N2b and d0 were unbiased and equally sensitive measures of discrimination performance), the regression line should cross the intersection of the d0 and N2b axes.
0.05 (confidence interval, CI: 0.40 – 0.31). The goodness of fit of the regression model in methods B–D is comparable to that of method E (method B: R2 = 0.29; method C: R2 = 0.29; method D: R2 = 0.31; all p values < .05), but for all three methods, the intercept values significantly deviate from zero (method B: 1.31 [CI 1.05 – 1.56]; method C: 1.47 [CI 2.22 to 0.71]; method D: 1.02 [CI 0.75–1.28]). Thus, under the assumption that N2b and d0 are equally sensitive measures in this paradigm, the N2b results corroborate the conclusion that method C underestimates and methods B and D overestimate sensitivity values. The regression model for hit rates (method A) was not significant (R2 = 0.08; p = .30). Taken together, the N2b results support the view that methods B–E provide valid measures of discrimination performance on an ordinal scale, yet among these, method E gives the most veridical picture of participants’ discrimination capacities. 4. Discussion We analytically derived a method for measuring behavioral performance in perceptual detection paradigms with high rates of event presentation, and contrasted this method with four approaches frequently used in the literature (methods A–D). The most simple approach (method A) based on analyzing only
successfully detected signals (i.e., hits) is highly dependent upon the definition of response windows, can be distorted by the overall response rate, and shows no meaningful relation with an electrophysiological indicator of discrimination, the N2b. All methods that additionally take responses to non-signals (i.e., false alarms) into account for calculating observer sensitivity d0 are superior in that they are less affected by the overall response rate and show relations with the N2b. Of these, method E as proposed here (evaluating false alarms relative to the number of time intervals without signal that are equally long as the accepted response intervals) is superior to the three other measures as these are all highly dependent upon the choice of response windows and yield biased estimates of observer sensitivity. The results obtained with method A (based on discarding false alarms and simply taking hit rate as a measure of performance) are in striking contrast to the more consistent pattern of results obtained with methods B–E (based on different ways of calculating sensitivity d0 through defining a reference point for the false alarm rate). This was expected based on the myriad of theoretical arguments that can be put forward against a pure hit-rate analysis, first and foremost the failure to account for strategic components of the decision process in perceptual judgments (Green and Swets, 1966/ 1988; Macmillan and Creelman, 2005; see Section 1). Nevertheless,
938
A. Bendixen, S.K. Andersen / Clinical Neurophysiology 124 (2013) 928–940
hit rates are easy to compute and thus often preferred. The present data show how such a procedure can obscure the relation between behavioral and electrophysiological indicators of discrimination performance and thus lead to erroneous conclusions on the underlying processes. Moreover, our data demonstrate an alarming extent to which hit rates are susceptible to the specification of the acceptable response window, and to the participant’s frequency of responding. Therefore, consistent with classical SDT, we strongly discourage the use of discrimination performance measures based on hit rates alone. Instead, following the principles of SDT (Green and Swets, 1966/ 1988; Macmillan and Creelman, 2005), we strongly recommend to take false alarms into account and to evaluate performance in terms of sensitivity d0 . More specifically, we recommend calculating false alarm rates according to method E, based on evaluating false alarms relative to the number of time intervals without signal that are equally long as the specified response intervals. Although the pattern of results was roughly similar in the four methods of calculation, at least two arguments can be put forward favoring method E over the other three. Method B (based on evaluating false alarms relative to the number of non-signals), method C (based on evaluating false alarms relative to the number of signals), and method D (based on evaluating false alarms relative to the number of simple responses that could maximally be executed when assuming a typical response time) all show undesirable properties in that they are heavily dependent on the choice of the acceptable response window. Such dependency rules out a method as a valid measure of discrimination performance unless some means is provided to unequivocally determine ‘‘the’’ suitable response window for a given dataset, which to the best of our knowledge does not exist. This problem is even further aggravated in cases where reaction times differ markedly between participants, as ‘‘the’’ suitable response window might vary individually. In contrast, method E does not suffer from strong response-window dependency. Therefore, choosing response window boundaries that do not match the participants’ response time profile would artificially increase the advantage of method E over the other methods (up to a certain point – the d0 estimates of method E will also be erroneous if the mismatch between the response profile and the chosen window boundaries is too large). Knowing that a moderate amount of mismatch between response window boundaries and RT profile is more detrimental to methods B–D than to E, we devised compensatory measures to compare all methods on fair grounds. Specifically, our analysis was favorable to methods B–D by basing all calculations on a window chosen to reflect the known response time distribution (for the simulated data) or on the mean sensitivity values across a wide range of investigated response windows (for the empirical data). This stabilized the results and made them more comparable with those of method E. In spite of the optimized window (simulated data) or the averaging procedure (empirical data), methods B–D still lead to biased estimates of true sensitivity. The simulated data demonstrate that method E reflects participants’ discrimination capabilities quite well (in particular, it is the only one that faithfully measures chance level performance), while methods B and D overestimate and method C underestimates participants’ discrimination capabilities. The real data are also suggestive of this conclusion, with all participants estimated to perform above chance (d0 > 0.5) in methods B and D and many participants estimated to perform below chance (d0 < –0.5) in method C, while method E yields a more realistic pattern, with some participants showing reasonable sensitivity and many others exhibiting sensitivity near chance (d0 0), as is to be expected in a highly difficult discrimination task. Finally, the relation of the N2b component as an electrophysiological correlate of detection performance (Kasai et al., 2002; Novak et al., 1990) with the behavioral performance estimate speaks in favor of method E. The goodness
of fit when regressing the behavioral performance in method E from the N2b amplitude was remarkable (cf. Fig. 7). This close correspondence suggests that the N2b is equally sensitive as the behavioral performance estimate d0 in this paradigm, although future studies are required to verify this observation. It should be noted that the superiority of method E over methods B–D is substantiated not only by our simulated and empirical data but, more importantly, by logical and mathematical arguments outlined in Sections 1 and 2. For instance, the response-window dependency is immediately obvious from the computation formulas (cf. Table 1). Nevertheless, we contrasted all methods here to give the reader an impression of the extent of the bias that the other methods incur. It is perhaps comforting to note that methods B–D reflect the order of participants’ discrimination capabilities quite well, suggesting that correlation analyses in previous studies employing one of these methods would still hold. Any analysis based on the absolute sensitivity values would, however, suffer from substantial bias (differences in d0 estimates of up to ±1), and previous studies should be reviewed for erroneous conclusions resulting from these biases (such as the presence of electrophysiological but not behavioral signs of successful discrimination, or vice versa, depending on the direction of the bias). Our empirical data were drawn from a dataset with low sensitivity (d0 0 for many participants). This situation was chosen because biases in assessing detection performance become particularly apparent when performance is close to chance level for three reasons. First, biases introduced by an improper handling of false alarms are more dramatic when there is a higher number of false alarms. Second, the data-driven choice of meaningful response window boundaries is more difficult in close-to-chance performance situations because the response time distribution inseparably contains both hits and false alarms. Third, the correct value for d0 is most obvious in the case of chance performance. We would like to clarify that the mathematical derivation of the d0 formula in method E is entirely unrelated to the range of empirically observed d0 values. In particular, formula (4) requiring d0 = 0 for the random observer is a prerequisite which, by definition of SDT, any valid d0 estimate has to fulfill – it was introduced for theoretical, not for empirical reasons. There is no reason to believe that the superiority of method E would not hold for situations with above-chance performance. In fact, we have successfully applied our approach in two experimental studies with much easier discrimination tasks and thus clearly above-chance performance (Rimmele et al., 2012; Scharinger et al., 2012). Some further comments are warranted on method D, given the recency of its suggestion (Andreou et al., 2011). According to this method, the denominator for computing the false alarm rate is the number of simple responses that could have been executed during the entire time of the stimulus sequence. This approach bears higher face validity than the ‘‘extreme’’ solutions offered by methods B and C. Nevertheless, the quality of its outcome is comparable to that of method B (using the number of non-targets as the denominator for the false alarm rate). How similar the results of the two methods are, will depend on the stimulation rate. In any case, method D (just like methods B and C) does not solve the issue of response-window dependency, and it additionally introduces target-probability dependency, as the false alarm rate is defined relative to the entire time of the stimulus sequence, without excluding those intervals in which responses would be counted as hits. These properties, in addition to the observed bias (overestimation of sensitivity values), must lead us to discourage the use of this method with the same clarity as the (more obviously unfeasible) alternative methods B and C. Thus based on theoretical and empirical arguments, method E is the preferable alternative for measuring target detection performance in paradigms with high event rates. The calculation of d0
A. Bendixen, S.K. Andersen / Clinical Neurophysiology 124 (2013) 928–940
is straightforward with the above formulas (11) and (12) as soon as a response window is specified and thus responses are classified into hits and false alarms. The specific response window setting has only very little influence on the resulting performance estimates as long as meaningful constraints are applied. If, for example, the lower bound of the response window is set higher than the average response time of hits, d0 values will obviously be obscured. In order to prevent such unusual circumstances, we recommend calculating d0 values across a range of response windows and identifying unstable regions based on plots such as the ones displayed in Figs. 3 and 5. Some previous studies have already used the strategy of investigating a range of response windows to show that the pattern of results does not change (e.g., Woods et al., 2001); we clearly recommend a systematic investigation along these lines. This will also help in avoiding those response window settings that require truncation of the false alarm rate to correct for values above one. Interestingly, approaching the response window regions requiring correction in method E is preceded by regions of a drop in the otherwise relatively constant sensitivity measure (cf. Fig. 5). This pattern should signal caution in using the affected response window specifications. Method E thereby seems to provide means to determine a feasible range of response windows in a data-driven manner. This empirical verification of appropriate response windows for a given dataset will further increase the informative value of discrimination performance estimates with method E. We would like to emphasize, however, that the issue of finding ‘‘the’’ suitable response window is not, and probably cannot, be unequivocally resolved with our proposed method E. Importantly, method E is conservative in that an inappropriate choice of response window leads to an estimate of d0 that is closer to chance level (Fig. 3). Accordingly, such errors of application may obscure effects but will not create spurious effects, unlike methods B–D which tend to yield extreme values for d0 in this situation. When the rate of event presentation is decreased to an extent that the SOA itself can be used as the response window boundary, the formula used for calculating the false alarm rate (10) merges into the classical SDT approach (see Section 2). In that sense, method E does not constitute a distinct novel approach, but a generalization of classical SDT (Green and Swets, 1966/1988; Macmillan and Creelman, 2005). Given the simplicity of its implementation, it should prove useful for a variety of fields. Although our simulated and empirical data pertain to rapidly presented but still discrete events, the same logic applies to situations with continuous presentation, such as uninterrupted auditory streams with hidden patterns (Berti et al., 2000) or ongoing visual stimulation in which participants monitor changes at the fixation point in order to withdraw their attention from the periphery (Müller et al., 2010) or from background images superimposed on the task-relevant foreground (Hindi Attar et al., 2010). Note also that, although our simulations and empirical data were based on a fixed SOA value, the formulas do not incorporate the SOA value (neither for method E nor for any of the other methods, cf. Table 1), and there is no theoretical or mathematical reason to believe that the results would not be transferrable to situations with variable SOA or continuous presentation. This is because the logic of the calculation in method E is based on defining concrete windows around the target stimuli (whenever they occur), and fictitious windows around the ‘‘distractor stimuli’’ (which are a hypothetical entity and thus have no time-point of occurrence). The calculation is completely independent of the actual temporal distance between the stimuli. Indeed, our proposed method has been successfully applied in a study with jittered SOA values (Rimmele et al., 2012). In conclusion, we recommend that perceptual sensitivity d0 in paradigms with fast or continuous stimulus presentation be calculated (1) by taking false alarms into account, (2) by deriving the rate of false alarms relative to the number of time intervals with-
939
out signal that are equally long as the specified response intervals for hits, and (3) by applying classical SDT for combining false alarm and hit rates. This approach, outlined as method E in this paper, appears promising for deriving true estimates of discriminative capabilities in perception and cognition and thus for investigating their relation with measures of cortical activation. Acknowledgments Both authors contributed equally to all aspects of this work. This work was supported by the German Research Foundation (DFG, BE 4284/1-1 to A.B., AN 841/1-1 to S.K.A.). The authors thank Zsuzsanna D’Albini and Orsolya Szálardy for collecting the data as well as Erich Schröger, Andreas Widmann, and three anonymous reviewers for helpful comments. References Andersen SK, Hillyard SA, Müller MM. Attention facilitates multiple features in parallel in human visual cortex. Curr Biol 2008;18:1006–9. Andersen SK, Müller MM, Hillyard SA. Tracking the allocation of attention in visual scenes with steady state evoked potentials. In: Posner MI, editor. Cognitive neuroscience of attention. 2nd ed. New York: Guilford; 2011. Andreou L-V, Kashino M, Chait M. The role of temporal regularity in auditory segregation. Hear Res 2011;280:228–35. Bendixen A, Prinz W, Horváth J, Trujillo-Barreto NJ, Schröger E. Rapid extraction of auditory feature contingencies. Neuroimage 2008;41:1111–9. Bendixen A, Schröger E, Ritter W, Winkler I. Regularity extraction from nonadjacent sounds. Frontiers Psychol 2012;3:143. Berti S, Schröger E, Mecklinger A. Attentive and pre-attentive periodicity analysis in auditory memory: an event-related brain potential study. Neuroreport 2000;11:1883–7. Bertoli S, Smurzynski J, Probst R. Temporal resolution in young and elderly subjects as measured by mismatch negativity and a psychoacoustic gap detection task. Clin Neurophysiol 2002;113:396–406. Cowan N. On short and long auditory stores. Psychol Bull 1984;96:341–70. Egan JP, Greenberg GZ, Schulman AI. Operating characteristics, signal detectability, and the methods of free response. J Acoust Soc Am 1961;33:993–1007. Egner T, Gruzelier JH. EEG biofeedback of low beta band components: frequencyspecific effects on variables of attention and event-related brain potentials. Clin Neurophysiol 2004;115:131–9. Gomes H, Barrett S, Duff M, Barnhardt J, Ritter W. The effects of interstimulus interval on event-related indices of attention: an auditory selective attention test of perceptual load theory. Clin Neurophysiol 2008;119:542–55. Gomes H, Duff M, Ramos M, Molholm S, Foxe JJ, Halperin J. Auditory selective attention and processing in children with attention-deficit/hyperactivity disorder. Clin Neurophysiol 2012;123:293–302. Green DM, Swets JA. Signal detection theory and psychophysics. New York: Wiley; 1966/1988. Grimm S, Roeber U, Trujillo-Barreto NJ, Schröger E. Mechanisms for detecting auditory temporal and spectral deviations operate over similar time windows but are divided differently between the two hemispheres. Neuroimage 2006;32:275–82. Hautus M. Corrections for extreme proportions and their biasing effects on estimated values of d0 . Behav Res Methods Instrum Comput 1995;27:46–51. Hindi Attar C, Andersen SK, Müller MM. Time course of affective bias in visual attention: convergent evidence from steady-state visual evoked potentials and behavioral data. Neuroimage 2010;53:1326–33. Horváth J, Winkler I. Distraction in a continuous-stimulation detection task. Biol Psychol 2010;83:229–38. Hutchinson TP. A review of some unusual applications of signal detection theory. Qual Quant 1981;15:71–98. Jang Y, Wixted JT, Huber DE. Testing signal-detection models of yes/no and twoalternative forced-choice recognition memory. J Exp Psychol Gen 2009;138:291–306. Jongsma MLA, Eichele T, Van Rijn CM, Coenen AML, Hugdahl K, Nordby H, et al. Tracking pattern learning with single-trial event-related potentials. Clin Neurophysiol 2006;117:1957–73. Kasai K, Nakagome K, Hiramatsu K-I, Fukuda M, Honda M, Iwanami A. Psychophysiological index during auditory selective attention correlates with visual continuous performance test sensitivity in normal adults. Int J Psychophysiol 2002;45:211–25. Kida T, Nishihira Y, Wasaka T, Nakata H, Sakamoto M. Passive enhancement of the somatosensory P100 and N140 in an active attention task using deviant alone condition. Clin Neurophysiol 2004;115:871–9. Kimura M, Katayama J, Murohashi H. Probability-independent and -dependent ERPs reflecting visual change detection. Psychophysiology 2006;43:180–9. Lepistö T, Silokallio S, Nieminen-von Wendt T, Alku P, Näätänen R, Kujala T. Auditory perception and attention as reflected by the brain event-related potentials in children with Asperger syndrome. Clin Neurophysiol 2006;117:2161–71.
940
A. Bendixen, S.K. Andersen / Clinical Neurophysiology 124 (2013) 928–940
Luce RD. A model for detection in temporally unstructured experiments with a Poisson distribution of signal presentations. J Math Psychol 1966;3:48–64. Määttä S, Pääkkönen A, Saavalainen P, Partanen J. Selective attention event-related potential effects from auditory novel stimuli in children and adults. Clin Neurophysiol 2005;116:129–41. Macmillan NA, Creelman CD. Detection theory: a user’s guide. 2nd ed. Mahwah, NJ: Erlbaum; 2005. Maekawa T, Goto Y, Kinukawa N, Taniwaki T, Kanba S, Tobimatsu S. Functional characterization of mismatch negativity to a visual stimulus. Clin Neurophysiol 2005;116:2392–402. Matthews N, Todd J, Budd TW, Cooper G, Michie PT. Auditory lateralization in schizophrenia – mismatch negativity and behavioral evidence of a selective impairment in encoding interaural time cues. Clin Neurophysiol 2007;118:833–44. Michalewski HJ, Starr A, Nguyen TT, Kong Y-Y, Zeng F-G. Auditory temporal processes in normal-hearing individuals and in patients with auditory neuropathy. Clin Neurophysiol 2005;116:669–80. Müller D, Winkler I, Roeber U, Schaffer S, Czigler I, Schröger E. Visual object representations can be formed outside the focus of voluntary attention: evidence from event-related brain potentials. J Cogn Neurosci 2010;22:1179–88. Müller MM, Hillyard S. Concurrent recording of steady-state and transient eventrelated potentials as indices of visual-spatial selective attention. Clin Neurophysiol 2000;111:1544–52. Mussgay L, Hertwig R. Signal-detection indexes in Schizophrenics on a visual, auditory, and bimodal continuous performance-test. Schizophr Res 1990;3:303–10. Näätänen R, Paavilainen P, Rinne T, Alho K. The mismatch negativity (MMN) in basic research of central auditory processing: a review. Clin Neurophysiol 2007;118:2544–90. Neelon MF, Williams J, Garell PC. The effects of auditory attention measured from human electrocorticograms. Clin Neurophysiol 2006;117:504–21. Nestor PG, Faux SF, McCarley RW, Shenton ME, Sands SF. Measurement of visual sustained attention in Schizophrenia using signal-detection analysis and a newly developed computerized CPT task. Schizophr Res 1990;3:329–32. Novak GP, Ritter W, Vaughan HG, Wiznitzer ML. Differentiation of negative eventrelated potentials in an auditory discrimination task. Electroenceph Clin Neurophysiol 1990;75:255–75. Oglesbee E, Kewley-Port D. Estimating vowel formant discrimination thresholds using a single-interval classification task. J Acoust Soc Am 2009;125:2323–35. Paavilainen P, Simola J, Jaramillo M, Näätänen R, Winkler I. Preattentive extraction of abstract feature conjunctions from auditory stimulation as reflected by the mismatch negativity (MMN). Psychophysiology 2001;38:359–65.
Paavilainen P, Arajärvi P, Takegata R. Preattentive detection of nonsalient contingencies between auditory features. Neuroreport 2007;18:159–63. Pihko E, Kujala T, Mickos A, Antell H, Alku P, Byring R, et al. Magnetic fields evoked by speech sounds in preschool children. Clin Neurophysiol 2005;116:112–9. Rimmele JM, Schröger E, Bendixen A. Age-related changes in the use of regular patterns for auditory scene analysis. Hear Res 2012;289:98–107. Sandmann P, Kegel A, Eichele T, Dillier N, Lai W, Bendixen A, et al. Neurophysiological evidence of impaired musical sound perception in cochlear-implant users. Clin Neurophysiol 2010;121:2070–82. Saupe K, Koelsch S, Rübsamen R. Spatial selective attention in a complex auditory environment such as polyphonic music. J Acoust Soc Am 2010;127:472–80. Scharinger M, Bendixen A, Trujillo-Barreto NJ, Obleser J. A sparse neural code for some speech sounds but not for others. PLoS One 2012;7:e40953. Shafer VL, Ponton C, Datta H, Morr ML, Schwartz RG. Neurophysiological indices of attention to speech in children with specific language impairment. Clin Neurophysiol 2007;118:1230–43. Snodgrass JG, Corwin J. Pragmatics of measuring recognition memory: applications to dementia and amnesia. J Exp Psychol Gen 1988;117:34–50. Stanislaw H, Todorov N. Calculation of signal detection theory measures. Behav Res Methods Instrum Comput 1999;31:137–49. Sussman E, Winkler I, Kreuzer J, Saher M, Näätänen R, Ritter W. Temporal integration: intentional sound discrimination does not modulate stimulusdriven processes in auditory event synthesis. Clin Neurophysiol 2002;113:1909–20. Tyler CW, Chen C-C. Signal detection theory in the 2AFC paradigm: attention, channel uncertainty and probability summation. Vision Res 2000;40:3121–44. van Zuijen TL, Simoens VL, Paavilainen P, Näätänen R, Tervaniemi M. Implicit, intuitive, and explicit knowledge of abstract regularities in a sound sequence: an event-related brain potential study. J Cogn Neurosci 2006;18:1292–303. Winkler I, Czigler I, Jaramillo M, Paavilainen P, Näätänen R. Temporal constraints of auditory event synthesis: evidence from ERPs. Neuroreport 1998;9:495–9. Woods DL, Alain C, Diaz R, Rhodes D, Ogawa KH. Location and frequency cues in auditory selective attention. J Exp Psychol Hum Percept Perform 2001;27:65–74. Yabe H, Tervaniemi M, Reinikainen K, Näätänen R. Temporal window of integration revealed by MMN to sound omission. Neuroreport 1997;8:1971–4. Yabe H, Tervaniemi M, Sinkkonen J, Huotilainen M, Ilmoniemi RJ, Näätänen R. Temporal window of integration of auditory information in the human brain. Psychophysiology 1998;35:615–9. Zwislocki JJ. Temporal summation of loudness: an analysis. J Acoust Soc Am 1969;46:431–40.