Spike detection algorithm performance and methods of acquiring expert opinion

Spike detection algorithm performance and methods of acquiring expert opinion

Clinical Neurophysiology 121 (2010) 1968–1971 Contents lists available at ScienceDirect Clinical Neurophysiology journal homepage: www.elsevier.com/...

72KB Sizes 5 Downloads 27 Views

Clinical Neurophysiology 121 (2010) 1968–1971

Contents lists available at ScienceDirect

Clinical Neurophysiology journal homepage: www.elsevier.com/locate/clinph

Letters to the Editor Response to ‘‘Spike detection algorithm performance and methods of acquiring expert opinion”

In response to Dr. Scott Wilson’s recent letter, I apologize for the error in my recent review paper (Halford, 2009) and I agree with the correction Dr. Wilson suggests: when his statistical methods, described in detail in his 1996 paper (Wilson et al., 1996), are correctly applied to the data in his 1999 paper, the performance of the automated epileptiform transient (ET) detection algorithm is sensitivity = 0.899, specificity = 0.9963 and false positives per minute = 0.5772. In his 1996 and 1999 papers, the method for acquiring expert opinion on EEG involved expert interpreters marking ETs in a set of EEG recordings and providing detailed information about the morphology of each ET including a ‘perception index’ for each ET (an expert rating of the ‘spikeness’ or epileptiform-appearance of each ET on a scale of 1–4). I think that this statistical approach is the most sophisticated method published to date to measure epileptiform transient (ET) detection algorithm performance. The use of the ‘perception index’ for each ET provides at least three significant advantages. First, as Dr. Wilson described, this approach does not require the selection of an arbitrary level of consensus among a panel of experts (such as a defining a paroxysmal EEG event as an ET if ‘3 out of 5’ experts agree that it is an ET). Second, it allows a more precise calculation of algorithm sensitivity because it weighs the interpretation of low-perception spikes less heavily than high-perception spikes in the performance metrics, preventing the sensitivity value from being lowered excessively due to an abundance of spikes rated by some experts as having a low-perception index. Third, it increases the correlation between the experts thereby increasing the reliability of the expert opinion measurement and decreasing the number of experts needed to reach consensus. In my own ongoing EEG scoring project, I am attempting to implement Dr. Wilson’s approach with two additional improvements. First, because Dr. Wilson’s method does not provide a measure of true-negative ET detections, I am attempting to measure this. (In Dr. Wilson’s 1996 and 1999 studies, the value of true-negative ET detections was arbitrarily set to be equal to the highest non-ictal spike rate found in the database of EEG recordings – 2.6 spikes/s multiplied by the length of the EEG data analyzed.) I am attempting to measure false negatives by requesting that EEG experts score the EEG in two phases. In the first phase, all paroxysmal EEG activity will be marked (including artifacts, benign electrocortical activity, and ETs). In the second phase, each expert will categorize all of the paroxysmal activity marked by all of the experts as either artifact, benign electrocortical activity, or as an ET. I believe a careful measurement of specificity, a system’s capacity to recognize negative activity, is important since the labeling of a normal recordings as abnormal is the most common and clinically-significant mistake in EEG interpretation (Benbadis, 2007). Although the rate of false positive per minute also reflects this, this rate is more influenced by the difficulty of the EEG recording. The second change I am making

is to request less information about each ET from the experts. Although this will produce less information about expert perception of ET morphology, hopefully, it will allow more expert opinion to be collected on a larger number of EEG recordings. Again, I apologize for the error in reporting Dr. Wilson’s research results.

References Benbadis SR. Errors in EEGs and the misdiagnosis of epilepsy: importance, causes, consequences, and proposed remedies. Epilepsy Behav 2007;11:257–62. Halford JJ. Computerized epileptiform transient detection in the scalp electroenceph alogram: obstacles to progress and the example of computerized ECG interpretation. Clin Neurophysiol 2009;120:1909–15. Wilson SB, Harner RN, Duffy FH, Tharp BR, Nuwer MR, Sperling MR. Spike detection. I. Correlation and reliability of human experts. Electroencephalogr Clin Neurophysiol 1996;98:186–98. Wilson SB, Turner CA, Emerson RG, Scheuer ML. Spike detection. II. Automatic, perception-based detection and clustering. Clin Neurophysiol 1999;110: 404–11.

Jonathan J. Halford Division of Adult Neurology, Department of Neurosciences, Medical University of South Carolina, 96 Jonathan Lucas St., Suite 307 CSB, Charleston, SC 29425, USA Tel.: +1 843 792 3221; fax: +1 843 792 8626 E-mail address: [email protected] Available online 6 May 2010

1388-2457/$36.00 Ó 2010 International Federation of Clinical Neurophysiology. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.clinph.2010.04.011

Spike detection algorithm performance and methods of acquiring expert opinion

I enjoyed the recent review paper on EEG spike detection by Dr. Jonathan Halford and believe it addresses an important topic. I would like to point out what I believe is an error in Table 2 where the accuracy of the Persyst spike detector is listed (Halford, 2009). To calculate the algorithm’s sensitivity, the total number of spikes (1952) marked by any of the five human experts were used. This is in stark contrast to both the method we employed, as well as, the usual consensus method where a simple or super majority of readers is required to ‘‘identify a spike”. Perhaps most important to the topic is the fact that our 1999 paper introduces mathematics for avoiding the need to define a consensus threshold (Wilson et al., 1999). Instead of requiring that a spike be identified by a fixed percentage of readers, we assign each spike a weight, defined as the percentage of readers that marked it. For example, a spike marked by three of five readers is assigned a ‘‘perception” weight of 0.6. We show that assigning

Letters to the Editor / Clinical Neurophysiology 121 (2010) 1968–1971

weights yields more accurate determinations of sensitivity, specificity and false positive rate. In contrast to using a simple majority, this method avoids counting as a false positive a spike marked by the algorithm that was marked by only two of the five human readers – certainly something we want to avoid. Additionally, we show that the convergence, as more readers are added, of the spike weights can be increased by asking the readers to rate the spike, differentiating archetypal spikes from those that are identified only by similarity to their neighbors. Indeed, many of the 1952 spikes marked in our study were categorized as ‘‘poor” (i.e., assigned a numeric value of 0.25) by a single expert who wanted to avoid penalizing the algorithm in case it was extremely sensitive. The weight for these spikes was 0.05, so they produced only a small effect when comparing readers. In conclusion, I suggest using the sensitivity and specificity values taken from Table 1 in our 1999 paper. These can be compared directly to the sensitivity and specificity of other consensus methods. To see that this is the case, realize that our method essentially calculates the sensitivity (etc.) of the algorithm versus each single human reader, and then takes the mean over the five readers. The application of these methods enhances results by providing more robust ‘‘gold standard” dataset of expert opinion on EEG spikes. References Halford JJ. Computerized epileptiform transient detection in the scalp electroencephalogram: obstacles to progress and the example of computerized ECG interpretation. Clin Neurophysiol 2009;120:1909–15. Wilson SB, Turner CA, Emerson RG, Scheuer ML. Spike detection II: automatic, perception-based detection and clustering. Clin Neurophysiol 1999;110:404–11.

Scott B. Wilson Persyst Development Corporation, 3177 Clearwater Drive, Prescott, AZ 86305, USA * Tel.: +1 928 708 0705; fax: +1 928 771 1209 E-mail addresses: [email protected] Available online 21 May 2010

1388-2457/$36.00 Ó 2010 International Federation of Clinical Neurophysiology. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.clinph.2010.04.012

How precisely can the regularity of spontaneous activity be recognized acoustically? 1. Introduction The electromyographic proof of ongoing denervation and of inflammatory muscle disease relies on the presence of two types of characteristic electrical activity in the completely relaxed muscle. These potentials are termed fibrillation potentials and positive sharp waves (Denny-Brown and Pennybacker, 1938). However, the activity that indicates disease must be distinguished from activity that occurs physiologically in the region of the neuromuscular junction (endplate potentials and endplate noise, respectively). Also, the muscle might be not completely relaxed and one might record voluntary activity. Clinical studies indicate that the recognition of spontaneous activity requires considerable skill and that false positive classification of potentials is a frequent error (Kendall and Werner, 2006). Thus criteria for the identification of spontaneous activity are required. Denervation potentials have a typical shape and amplitude (Dumitru et al., 2002). Positive sharp waves are biphasic with an

1969

initial positive phase followed by a smaller and longer negative deflection. Durations range from several milliseconds to 100 ms. Fibrillations are biphasic (positive–negative) or triphasic (positive–negative–positive) with a duration below 5 ms). The amplitudes of both types of discharges are below 1 mV. However, these characteristics are influenced by the position of the needle electrode relative to the muscle fibre that produces the potential. Two more reliable characteristics that differentiate between physiological and pathological potentials are related to the temporal pattern of discharge. Firstly, fibrillation potentials and PSWs can (but not necessarily do) occur at low discharge frequencies, whereas voluntary activity occurs at higher frequencies. Second, in contrast to voluntary activity and endplate activity, there is little variation of discharge frequency with time for pathological spontaneous activity (Conrad et al., 1972; Stoehr, 1977). As a gold standard, a quantitative approach to regularity with appropriate indices has been suggested and evaluated (Conrad et al., 1972; Heckmann and Ludin, 1982; Schulte-Mattler et al., 2001). We chose ‘‘APCID” (average proportional consecutive interdischarge difference) as a measure of irregularity (Conrad et al., 1972). For this parameter the time series of the intervals between consecutive discharges is considered. Roughly, the average difference between consecutive intervals is calculated (the sign of the difference is ignored) and divided by the average interval. Thus, the parameter describes the variability of the time interval between consecutive discharges in units of the average interval. Large APCID values correspond to irregular discharge intervals, small values to more regular trains. For fibrillation potentials and PSWs average APCID values of 6.5 have been reported, whereas voluntary activity is characterized by average APCID values of 173 (Schulte-Mattler et al., 2001). However, there is some overlap between the entities. As a cut-off APCID value for a differentiation between regular and irregular 50 has been suggested (Schulte-Mattler et al., 2001). The present study deals with the reliability of subjective perception in the evaluation of the regularity of one given discharge train. A minimum APCID value is necessary for a human observer to detect irregularity in a train of discharges. This threshold can be quantified by the APCID value that is classified as regular or irregular with equal probability 0.5. Trains with larger APCID values will be classified as regular with lower probability (approaching 0 for larger values of APCID), and trains with smaller APCID values will be classified as regular with higher probability (approaching 1 for small values of APCID). The threshold for the human observer is unknown. It is also unknown whether this threshold depends on discharge frequency. The aim of this study is to measure the threshold and discuss possible implications for the identification of spontaneous activity. 2. Methods The subjects that participated in the study were residents of the department of Neurology (10 male, 9 female). Three subjects had completed a minimum EMG training of 6 months, the remainder of the subjects was EMG naïve. One of the residents also has a degree as a pianist. The subjects denied any hearing impairment but this was not formally established. Audiofiles with simulated EMG recordings were generated as follows: A positive sharp wave (PSW) and a fibrillation potential (FIB), respectively, were taken from a patient recording. This signal was repeated 20 times at a rate of 2.4 and of 8 Hz on average, respectively. A discharge frequency of 2.4 Hz is at the lower end of the spectrum of discharge frequencies reported for both motor unit potentials on one hand and fibrillation potentials on the other hand (Schulte-Mattler et al., 2001). A discharge frequency