Spike detection II: automatic, perception-based detection and clustering

Spike detection II: automatic, perception-based detection and clustering

Clinical Neurophysiology 110 (1999) 404±411 Spike detection II: automatic, perception-based detection and clustering Scott B. Wilson a,*, Christine A...

974KB Sizes 1 Downloads 87 Views

Clinical Neurophysiology 110 (1999) 404±411

Spike detection II: automatic, perception-based detection and clustering Scott B. Wilson a,*, Christine A. Turner b, Ronald G. Emerson b, Mark L. Scheuer c a

Persyst Development Corporation, 316 Skyline Drive, Prescott, AZ 86303, USA Department of Neurology, Columbia University, Neurological Institute, 710 West 168 Street, New York, NY 10032, USA c Epilepsy Laboratory, University of Pittsburgh, 811 Liliane Kaufman Building., 3471 Fifth Avenue, Pittsburgh, PA 15213, USA b

Accepted 7 August 1998

Abstract Objectives: We developed perception-based spike detection and clustering algorithms. Methods: The detection algorithm employs a novel, multiple monotonic neural network (MMNN). It is tested on two short-duration EEG databases containing 2400 spikes from 50 epilepsy patients and 10 control subjects. Previous studies are compared for database dif®culty and reliability and algorithm accuracy. Automatic grouping of spikes via hierarchical clustering (using topology and morphology) is visually compared with hand marked grouping on a single record. Results: The MMNN algorithm is found to operate close to the ability of a human expert while alleviating problems related to overtraining. The hierarchical and hand marked spike groupings are found to be strikingly similar. Conclusions: An automatic detection algorithm need not be as accurate as a human expert to be clinically useful. A user interface that allows the neurologist to quickly delete artifacts and determine whether there are multiple spike generators is suf®cient. q 1999 Elsevier Science Ireland Ltd. All rights reserved. Keywords: Electroencephalography; Spike detection; Computer; Neural network; Clustering

1. Introduction We previously showed that the inter-reader correlation of human experts is increased when perception-based marking is utilized: small or ambiguous spikes are assigned low perception (near zero) values, average spikes are assigned medium (near one half) values, and exemplar spikes are assigned high (near one) values (Wilson et al., 1996). The probabilistic nature of spike marking allows the perception value assigned to a spike to describe the probability that it will be marked, e.g. a perception value of 0.76 suggests that 76 out of 100 expert readers will mark the spike. We also showed that spike databases created with perception markings are more reliable than those created with dichotomous (yes it is a spike; no it is not) markings, further emphasizing the need for perception-based markings. We describe a multiple monotonic neural network (MMNN) spike detection algorithm that attempts to mimic the perception-based marking of human experts. The algorithm's accuracy is described for both the training and testing databases. By utilizing perception-based detection, the * Corresponding author. Tel.: 1 1-520/708-0705; fax: 1 1-520/7711209. E-mail address: [email protected] (S.B. Wilson)

algorithm's sensitivity may be changed after a scan is complete, without rescanning, by focusing the review process on the subset of spikes above a certain perception value. The speed of review of automated detections is further increased with a novel hierarchical clustering interface, which groups similar spikes by topology and morphology and allows the immediate identi®cation of multiple focal points and artifacts.

2. Methods 2.1. `Gold standard' spike database The ®rst spike database used for this study was created by 5 epileptologists and is described elsewhere in detail (Wilson et al., 1996). 1952 spikes were detected by one or more of the readers. Each reader assigned a spike they marked a perception value between zero and one. A spike not marked by a reader was assigned a perception of zero for that reader. Panel scores were created from the mean perception of the 5 readers. Methods from the theory of measurement error showed that the correlation of the panel scores and the true scores was 0.95, which is higher than that of any individual reader.

1388-2457/99/$ - see front matter q 1999 Elsevier Science Ireland Ltd. All rights reserved. PII: S13 88-2457(98)0002 3-6

CLINPH 98020

S.B. Wilson et al. / Clinical Neurophysiology 110 (1999) 404±411

This `gold standard' database was used to train the MMNN detection algorithm. The subject population included 40 epilepsy patients and 10 control subjects ranging in age from 6 months to 66 years (mean of 10.8). The population consisted of 25 females and 25 males. The trial durations ranged from 82 to 894 s (295 ^ 192 s). The number of spikes marked by any reader was limited to 50 (per record), and the `active' duration of the record ended with the ®rst reader to reach this limit. This resulted in study durations that ranged from 14 to 702 s (171 ^ 144). The subject states represented were hyperventilation, eyesclosed-alert, eyes-open-alert and drowsy/sleep. The EEG data were recorded on a Nicolet BEAM I, S3000 computerized EEG. The data were initially ®ltered with a 3 dB, 1 Hz high-pass ®lter and a 3 dB, 300 Hz lowpass ®lter. To avoid aliasing, a 24 dB, 90 Hz low-pass ®lter was applied before digitizing at 256 samples/s. Twenty channels of recording were obtained from scalp electrodes placed according to the standard 10±20 system, referenced to combined ears (A1 1 A2).

405

accuracy of the algorithms is quanti®ed with the detection correlation coef®cient, a form of the Pearson correlation coef®cient that accounts for the implicit correlation on numerous non-spikes. The correlation uses the perception values assigned to the spike by the reader or algorithm. Extensions to the standard sensitivity and speci®city de®nitions that allow spike perception to be modeled as a continuous value are also utilized. 2.5. Multi-neural network development

2.4. Statistical methods

The MMNN development shares many similarities with the `parameterized' feed-forward neural network described by Webber et al. (1990, 1993, 1994), so we will detail only the differences. Our approach determines the perception of the spike seen in each channel individually rather than the perception of a 4 channel bipolar chain. Perceivable spikes are grouped with those in other channels at the same time, and individual spike perceptions are updated via a neural network (NN) according to the presence or absence of other high perception spikes in the group. (This is similar to the visual reader's rule of seeing the spike in more than a single channel.) The focal channel of the spike is that with the largest perception. (While the usual de®nition of focal would use the largest spike, this de®nition often results in the selection proximal non-spike activity, e.g. vertex waves or random alpha variations. These non-spike movements are visually ignored by the human expert.) The large neural networks (e.g. 401 input nodes, 7 hidden nodes and a single output node) we originally employed became highly overtrained. They performed exceedingly well on the training set, but clear spikes were periodically missed when new records were tested. This is one of the major issues faced when employing NNs. They are exceedingly easy to create, but their decision making process is essentially hidden. As a result of this, we chose to employ many small NNs (e.g. 4±12 input nodes, one hidden node and a single output node) whose function we could understand. NNs with a single hidden node have a monotonic transfer function (an increase of an input value always results in an increase or decrease of the output value throughout the input value's range). The transfer function of these monotonic neural networks (MNNs) mimic the itemcharacteristic (S-curve) previously shown to model perceptual phenomenon (Wilson et al., 1996). Similar to Webber et al., (1994), we ®rst employ a spike candidate function, which serves to rule out any event that is clearly not a spike. This is not strictly necessary, but it reduces the processing time. The test employs hard parameter cutoffs like the `rule-based' algorithm, but the cutoffs are far from the range of normal spike values. Cutoffs, which are user con®gurable, include the following:

The statistical methods used for this study have been previously described in detail (Wilson et al., 1996). The

² the height of a spike must be at least half the average height of the background activity,

2.2. Neural network validation database A second database was used to verify that the MMNN detection algorithm was not overtrained. It was created with dichotomous markings (before the `gold standard' database) by one of the readers from the `gold standard' panel. Four hundred and forty-eight spikes were marked in 15 short-term EEG trials, from 10 epilepsy subjects. The trial durations ranged from 42 to 402 s (118 ^ 84 s), and spikes were marked for the complete trial duration. The patient demographics, subject states, and recording methods were similar to those of the `gold standard' database. 2.3. Software testing The MMNN implementation tested is that used in the Persyst SpikeDetector version 3.0. For baseline validation, the `rule-based' (Gotman, 1985) Telefactor SzAC version 3.3 was also used to test the records. Both the packages were run with their default parameter settings. The SpikeDetector assigns each spike a perception value between zero and one. The SzAC does not assign perception values, so marked spikes were assigned a perception value of 1.0. The records and markings of the SzAC were sent to Telefactor to verify the results of the testing, and no changes were requested. Using the spike matching algorithm developed previously (Wilson et al., 1996), markings from the algorithms were matched with spikes from the two expert databases when they occurred within 200 ms.

406

S.B. Wilson et al. / Clinical Neurophysiology 110 (1999) 404±411

² the peak angle of a spike must be smaller than the average peak angle of the background activity, ² and the duration of the spike must be between 40 and 200 ms.

The algorithm's groups were visually matched to those of the human reader by opening the hierarchical tree to a similar level of detail and deleting artifact groups.

Once a candidate spike is found, it is presented to a number of MNNs, whose outputs are presented to a ®nal MNN that determines the spike's overall perception. The MNNs were created to model the following questions:

3. Results

² Does the spike stand out from the local background activity? ² Does the spike stand out from any local rhythmic activity? ² Does the spike stand out from any local EMG activity? ² Does the spike have a slow-wave? ² Are there spikes in other channels at the same time? ² Are there similar spikes occurring within the last 30 s? ² Finally, given the answers to all of the above, what is the overall perception of the spike? 2.6. Spike clustering Hand markings are routinely grouped into similar sets of spikes at the Columbia University Department of Neurology in order to study propagation patterns and other features (Emerson et al., 1995). One of these groupings was selected because it exhibited multiple focal points and morphologies. The 10±20 scalp recording was 28 min in duration. The MMNN detection algorithm was used to automatically mark the spikes using default detection parameters and a 17 channel average reference. The spikes were hierarchically clustered using both topology (perception-based) and morphology (height, duration and tip angle). Hierarchical clustering is often contrasted with another type of unsupervised clustering called K-means, which requires the number of groups to be selected a priori. Hierarchical clustering allows the number of groups to be selected after the fact by keeping a history of the object aggregation, which is visually modeled as a dendrogram. The clustering starts with each object in its own node (leaf node) at the bottom of the graph. The most similar pair of nodes is found and combined repeatedly until all objects are in a single node (the root node) at the top of the graph. Straighten out the lines showing the combination of nodes by appropriately ordering the nodes at the bottom of the graph to create the dendrogram, which has a tree-like branching structure. The order in which the nodes are collapsed is a function of the similarity and linkage methods employed. The similarity of two spikes is determined using the correlation of their perception values over the scanned channels and the normalized Euclidean distance of their morphology features. The total similarity is taken to be the product of these topology and morphology similarities. Ward's linkage method (Jain and Dubes, 1988) was used to update the similarities because it was found to offer better groupings than single and complete linkage.

3.1. `Gold Standard' spike database The MMNN detection algorithm marked a total of 1282 spikes, with perception values greater than or equal to 0.1, and matched 919 of those listed in the spike database. The rule-based detection algorithm marked 749 spikes and matched 293 of those listed in the database. Table 1 includes the detection correlation, sensitivity, selectivity and speci®city of the two algorithms and a theoretical sixth EEG expert who has the same expertise as the 5 experts in the `gold standard' panel. (See Wilson et al., 1996 for this calculation.) Testing that sensitivity 1 speci®city . 1, we ®nd that both the rule-based algorithm (x 2 ˆ 1314, P ! 0:001%) and the MMNN algorithm (x 2 ˆ 15827, P ! 0:001%) have a signi®cant correlation with the expert panel. We also ®nd that the MMNN algorithm offers signi®cantly better sensitivity (x 2 ˆ 165, P ! 0:001%) and speci®city (x 2 ˆ 409, P ! 0:001%) than the rule-based algorithm, and that there is no signi®cant difference between the sensitivity (x 2 ˆ 3.07, P . 5%) and speci®city (x 2 ˆ 1.92, P . 1%) of the MMNN and a human expert. Fig. 1 displays the percentage of true positive spikes marked by the algorithms as a function of the `gold standard' perception value.

3.2. Neural network spike database The MMNN algorithm marked a total of 536 spikes, with perception values greater than or equal to 0.1, and matched 338 of those listed in the database with a correlation of 0.76. The rule-based algorithm marked 191 spikes and matched 117 of those listed in the database with a correlation of 0.43. 3.3. Spike clustering The human reader marked 101 spikes that were grouped into two right fronto-temporal and two left fronto-temporal groups. The MMNN algorithm marked 264 spikes. Seventy-nine of these were deleted from 3 artifact groups leaving 185 Table 1 Abilities of the rule-based detection algorithm, the MMNN detection algorithm and a theoretical sixth EEG expert when compared to the 'gold standard' spike database

Rule-based MMNN Expert (predicted)

Corrclation

Sensitivity

Selectivity

Speci®city

0.262 0.849 0.870

0.375 0.899 0.937

0.183 0.801 0.808

0.9719 0.9963 0.9954

S.B. Wilson et al. / Clinical Neurophysiology 110 (1999) 404±411

Fig. 1. Percentage of true positive spikes marked by the algorithms as a function of the expert perception value from the spike database. The ®rst bin corresponds to spikes marked with perceptions greater than or equal to 0.1 and less than 0.2. The percentage calculated is the number of true positive algorithm spikes in the bin divided by the number of expert spikes in the bin.

spikes. The hierarchical partition, shown in Fig. 2, also has two right fronto-temporal and two left fronto-temporal groups. This graph is a depiction of the top of the hierarchical dendrogram with the root node at the left. The dendrogram can be interactively opened to display more and more of the tree until nodes containing only a single spike are displayed. Since the distinction between the spikes normally becomes negligible before the single-spike nodes are reached, a partially opened tree can display groups of events corresponding to artifacts and/or distinct spike generators. The iconic topographic plots display the average spike perception at each channel where scalp negative channels are displayed with negated perception values. (Negating the scalp negative in¯ections allows focal points with opposite signs to be visually discerned.) The iconic perception histogram displays the number of spikes at each perception level, e.g. 0.9±1.0. The iconic time histogram displays the spike density (number of spikes per time bin) over the record duration. To the left of each node is the number of spikes in the group (e.g. `n ˆ 185') as well as the group dissimilarity (e.g. `d ˆ 21'). Dissimilarity decreases with the tree depth, and a group with only a single spike has a dissimilarity of 0.0. (It is similar to the standard deviation of the group.) Fig. 3a±d show the side by side average tracings for the 4 groups marked by the readers.

4. Discussion 4.1. Algorithm accuracy This study demonstrates that both algorithms show significant statistical correlation with the expert databases and that the accuracy of the MMNN algorithm approaches that of a human expert.

407

The high correlation of the MMNN algorithm with the neural network validation database shows that the algorithm was not overtrained. As only a single reader marked these trials, database dif®culty and reliability cannot be directly quanti®ed. However, by comparison with the `gold standard,' we can estimate that the inter-reader correlation for expert readers should fall near the middle of 0.68 (the average dichotomous-valued inter-reader correlation) and 0.79 (the average continuous-valued inter-reader correlation) because the human reader marked dichotomously in this study. The MMNN algorithm correlation of 0.76 is higher than expected indicating that these trials are probably less dif®cult than those in the `gold standard.' The better scores for the rule-based algorithm on the validation database also re¯ect this observation. Although the MMNN algorithm displays an accuracy similar to that of a human expert in this study, over a year of clinical use at numerous institutions shows that it is susceptible to movement related artifacts, somewhat reducing its accuracy in practice. This is probably due to the lack of such artifacts in the `gold standard' database, and a longterm EEG database is being collected to rectify this problem. Use of short duration trials resulted in a signi®cant bias against the rule-based algorithm. This algorithm does not mark any spikes within the ®rst 30±60 s of the trial `to set up the necessary EEG baseline' (SzAC User Manual). We were unaware of this feature when the study was designed, and it is not clear whether this feature is speci®c to this particular implementation of the algorithm. In an attempt to quantify this bias, we calculated the correlation of the rule-based algorithm with the `gold standard' while ignoring spikes marked during the ®rst 60 s. This resulted in a correlation increase from 0.26 to 0.35. However, this calculation is also biased since a handful of trials ®lled with exemplar spikes are removed from consideration because of their short study durations. Ignoring the spikes in the ®rst 60 s in the validation database resulted in a correlation increase from 0.43 to 0.53. Fig. 4a±d show 4 pages of EEG that represent the predominant marking characteristics of the two algorithms. Displayed are exemplar spikes marked by all readers, spikes marked by one algorithm and not the other, false positives events marked by one algorithm and not the other, and spikes missed by both algorithms. Spikes are marked with text noting the reader and the perception value, e.g. `R1 P ˆ 0:5' indicates that Reader 1 marked the spike with a perception of 0.5. The 5 human readers are R1±R5. The MMNN algorithm is R9, and the rule-based algorithm is R10. The left edge of the text is placed at the spike's vertex. Occasionally the MMNN assigns clear-cut, high voltage spikes a low perception value. This most often occurs with confounding background activity that is similar in size and speed to that of the spike, e.g. spike `R9 P ˆ 0:13' and channel F3 in Fig. 4c.

408

S.B. Wilson et al. / Clinical Neurophysiology 110 (1999) 404±411

Fig. 2. Hierarchical clustering created using topology and morphology on the automatic marking of a single record. The rightmost topographic plots display the two left temporal and two right temporal groups.

Table 2 Four spike detection studies are compared. The dif®culty parameter quanti®es the dif®culty of the EEG trial set as determined by the human experts. (A larger number means more dif®cult.) The reliability coef®cient quanti®es the veracity of the expert's consensus marking. (A larger number means more reliable.) The four last columns are the algorithms tested and their respective detection correlation with the expert's consensus marking Database

Algorithm accuracy

Study

Patients

Spikes

Experts

Dif®culty

Reliability

Rule-based

Tampere

Hostetler et al. (1992) Pietila et al. Webber et al. 'Gold standard' NN validation

5 2 10 50 10

1393 ? 1349 1952 448

6 2 1 5 1

.0.13 ,0.25 ? 0.32 ,0.32

.0.87 ,0.86 ? 0.95 ,0.74

,0.73 0.29

0.32

0.26±0.35 0.43±0.53

Webber

0.74

MMNN

0.85 0 76

S.B. Wilson et al. / Clinical Neurophysiology 110 (1999) 404±411

409

Fig. 3. (a±d) Side-by-side average tracings of groups marked by human reader (left) and automatic hierarchical clustering (right) shown in Fig. 2.

Fig. 4. (a±d) EEG pages that represent the predominant detection characteristics of the two algorithms. Readers 1±5 are the human experts, Reader 9 is the MMNN algorithm and reader 10 is the `rule-based' algorithm. The left edge of the text is placed at the spike's vertex, and the marking includes the perception value assigned by the reader.

410

S.B. Wilson et al. / Clinical Neurophysiology 110 (1999) 404±411

4.3. Algorithm usefulness

Fig. 5. Comparison of the voltage and perception topographs for an individual spike. The spike at F4, clearly displayed in the perception plot, is hidden in the voltage plot by the confounding background activity.

4.2. Comparison with previous studies We compare our results to those of 3 recent studies ( Hostetler et al., 1992; Pietila et al., 1994; Webber et al., 1994) for database dif®culty, database reliability and algorithm accuracy in Table 2. (In some cases the reported values are estimates garnered from the ®gures.) Dif®culty is quanti®ed as 1.0 minus the average dichotomous inter-reader correlation and ranges from 0.0, the easiest, to 1.0, the most dif®cult. Reliability, a function of the perception-based inter-reader correlation and the number of readers (Wilson et al., 1996), ranges from 0.0, completely unreliable, to 1.0, completely reliable. Algorithm accuracy is quanti®ed as the perception-based detection correlation and ranges from 0.0, the worst, to 1.0, the best. The Hostetler and Pietila studies have a bias that works to the bene®t of the human readers because their scores are compared with the consensus markings. The potential in¯ation of the human scores is 1.0 divided by the number of experts. We have attempted to account for this bias in the values reported. The Webber study employs a single human expert, so the dif®culty and reliability could not be quanti®ed. The algorithm was used in the selection of the spikes, so the reported algorithm accuracy is possibly biased. That the `gold standard' database is both the most dif®cult and most reliable highlights the bene®ts of perception-based marking and multiple readers.

A great deal of emphasis has historically been placed on the accuracy of spike detection algorithms. However, an automatic marking need not be as accurate as a human expert's to be clinically useful. A user interface that allows the neurologist to quickly delete artifacts and determine whether there are multiple spike generators is suf®cient. Two features that employ perception-based detections allow this. Firstly, by using perception-based detections, the algorithm's sensitivity may be changed after a scan is complete (without rescanning) by focusing on the subset of spikes above a certain perception value. Similarly, spikes can be randomly selected to create a subgroup of any desired size. Thus, it is sensible to run the detection algorithm at its most sensitive setting and ®lter the resulting spikes if too many are found. Secondly, hierarchical clustering allows different spike groups to be identi®ed while the neurologist controls the level of detail. Artifacts tend to group together, often displaying poor spatial distribution, and can be immediately deleted. Multiple spike generators are immediately obvious, and rather than having to review thousands of individual marks, the process requires the comparison of perhaps 5± 10 groups. (The individual and average tracings for a particular group are also readily available for further validation). Using perception (recall that each channel is assigned a perception) rather than voltage in the topology comparisons used to create the hierarchical tree enhances correct group assignment as shown in Fig. 5. The spike at F4, clearly displayed in the perception plot, is hidden in the voltage plot by the confounding background activity. (This type of error is removed when group averages are used, but they are not available before the groups are created). 4.4. Spike clustering While this portion of the study is exploratory, the visual comparison of the hand marked and algorithm groupings (Fig. 3a±d) is striking. That the relative numbers of spikes in the groups are different may be explained by the fact that the marking by the human reader was not rigorous, and often times only `enough' spikes are marked in a group to create a reasonable average. The authors are planning a larger study that will attempt to quantify the grouping abilities of both humans and the hierarchical clustering algorithm.

Acknowledgements The ®rst author has a `commercial' interest in the Persyst SpikeDetector. We have attempted to give an unbiased comparison of the detection algorithms; however, this potential con¯ict of interest should be noted. This work

S.B. Wilson et al. / Clinical Neurophysiology 110 (1999) 404±411

was supported by a grant from the National Institute of Mental Health (MH46153). References Emerson RG, Turner CA, Pedley TA, Walczak TS, Forgione M. Propagation patterns of temporal spikes. Electroenceph clin Neurophysiol 1995;94:338±348. Gotman J. Automatic recognition of interictal spikes. Electroenceph clin Neurophysiol 1985;37:93±114. Hostetler WE, Doller HJ, Homan RW, et al. Assessment of a computer program to detect epileptiform spikes. Electroenceph clin Neurophysiol 1992;83:1±11. Jain AK, Dubes RC. Algorithms for clustering data. New York: Prentice Hall, 1988. Pietila T, Vapaakoski S, Nousiainen U, Varri A, Frey Hl, Hakkinen V,

411

Neuvo Y, et al. Evaluation of a computerized system for recognition of epileptic activity during long-term EEG recording. Electroenceph clin Neurophysiol 1994;90(6):438±443. Webber WRS, Wilson K, Lesser RP, Fisher RS, Eberhart RC, et al. On-line detection of epileptic spikes using a patient independent neural network. Epilepsia 1990;31:687. Webber WRS, Litt B, Lesser RP, Fisher RS, Bankman I, et al. Automatic EEG spike detection: what should the computer imitate? Electroenceph clin Neurophysiol 1993;87(6):364±373. Webber WRS, Litt B, Wilson K, Lesser RP, et al. Practical detection of epileptiform discharges in the EEG using an arti®cial neural network: a comparison of raw and parameterized EEG data. Electroenceph clin Neurophysiol 1994;91:194±204. Wilson SB, Harner RN, Duffy FH, Tharp BR, Nuwer MR, Sperling MR, et al. Spike detection. I. Correlation and reliability of human experts. Electroenceph clin Neurophysiol 1996;98(3):186±198.