DOSED: A deep learning approach to detect multiple sleep micro-events in EEG signal

DOSED: A deep learning approach to detect multiple sleep micro-events in EEG signal

Accepted Manuscript Title: DOSED: a deep learning approach to detect multiple sleep micro-events in EEG signal Author: S. Chambon V. Thorey P.J. Arnal...

2MB Sizes 0 Downloads 25 Views

Accepted Manuscript Title: DOSED: a deep learning approach to detect multiple sleep micro-events in EEG signal Author: S. Chambon V. Thorey P.J. Arnal E. Mignot A. Gramfort PII: DOI: Reference:

S0165-0270(19)30101-3 https://doi.org/doi:10.1016/j.jneumeth.2019.03.017 NSM 8273

To appear in:

Journal of Neuroscience Methods

Received date: Revised date: Accepted date:

7 December 2018 21 March 2019 29 March 2019

Please cite this article as: S. Chambon, V. Thorey, P.J. Arnal, E. Mignot, A. Gramfort, DOSED: a deep learning approach to detect multiple sleep microevents in EEG signal, (2019), https://doi.org/10.1016/j.jneumeth.2019.03.017 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Extraction of features for any

potential event locations with

Convolutional Neural Network (CNN) EEG signal

Joint detection and classification

ip

t

Prediction: (1) event probability: P = 0.9

(2) start and end times: (3.5s, 4.2s)

an

us

cr

Default event: Potential location of an event:

(1) start time, end time = (3s, 4s)

Ac ce pt e

d

M

Benchmark: detection of spindles

- 3 datasets: SS2, SSC, WSC

- 20 - 30 subjects per dataset

- 1 full night polysomnography recording per subject (8h) Method improves both precision and recall

1

Page 1 of 24

Ac ce p

te

d

M

an

us

cr

ip t

1. A new deep learning method is proposed to detect sleep micro-events 2. The influence of hyper-parameters is investigated and insights are given to tune them 3. The approach outperforms the baselines from the state-of-the-art on 3 datasets 4. Multiple events such as spindles of K-complex can be detected jointly 5. Results are provided for the detection of spindles, K-complex and arousals

Page 2 of 24

DOSED: a deep learning approach to detect multiple sleep micro-events in EEG signal S. Chambona,b,c,1,∗, V. Thoreyb,1 , P. J. Arnalb , E. Mignota , A. Gramfortc,d,e,∗ a Center

for Sleep Sciences and Medicine, Stanford University, Stanford, California, USA b Research & Algorithms Team, Dreem, Paris, France c LTCI T´ el´ ecom ParisTech, Universit´ e Paris-Saclay, Paris, France d Inria, Universit´ e Paris-Saclay, Paris, France e CEA Neurospin, Universit´ e Paris-Saclay, Paris, France

Abstract Background : Electroencephalography (EEG) monitors brain activity during sleep and is used to identify sleep disorders. In sleep medicine, clinicians interpret raw EEG signals in so-called sleep stages, which are assigned by experts to every 30 s window of signal. For diagnosis, they also rely on shorter prototypical micro-architecture events which exhibit variable durations and shapes, such as spindles, K-complexes or arousals. Annotating such events is

t

traditionally performed by a trained sleep expert, making the process time consuming, tedious and subject to inter-

ip

scorer variability. To automate this procedure, various methods have been developed, yet these are event-specific and rely on the extraction of hand-crafted features.

cr

New method : We propose a novel deep learning architecure called Dreem One Shot Event Detector (DOSED). DOSED jointly predicts locations, durations and types of events in EEG time series. The proposed approach,

us

applied here on sleep related micro-architecture events, is inspired by object detectors developed for computer vision such as YOLO and SSD. It relies on a convolutional neural network that builds a feature representation from raw EEG signals, as well as two modules performing localization and classification respectively.

an

Results and comparison with other methods: The proposed approach is tested on 4 datasets and 3 types of events (spindles, K-complexes, arousals) and compared to the current state-of-the-art detection algorithms.

the current state-of-the-art detection methods.

M

Conclusions: Results demonstrate the versatility of this new approach and improved performance compared to

Ac ce pt e

1. Introduction

d

Keywords: Deep learning, machine learning, EEG, event detection, sleep

Sleep is a behavioral state associated with specific changes in physiology and brain activity patterns [1]. The most common and practical way to monitor brain activity during sleep is to use electroencephalography (EEG). EEG measures hundreds of times per second the electrical potentials at several locations over the scalp. Identifying 5

in EEG signals microarchitectural events of variable duration like sleep spindles and K-complexes (0.5 - 2 s duration) or arousals (∼ 10 s duration), is of strong interest for sleep research [2]. Such events are typically used to determine sleep stages, which are scored by 30 s epochs. In other cases however, when one needs to go beyond sleep scoring, they must be counted and specifically annotated. It is notably the case when the aim is to understand sleep physiology [3, 4] or to study the pathophysiology of specific sleep or neuropsychiatric disorders [5, 6, 7]. The

10

identification of micro-architectural events in the EEG is traditionally performed by trained sleep experts, also called scorers, who visually investigate the recorded signals over a night and annotate the relevant events with their respective start times and durations. This is a tedious, imprecise, costly and time consuming task. As a consequence, the annotations of events produced by two different sleep experts are likely to differ. Indeed, two experts might annotate different non-overlapping events, or might annotate the same event with different start

15

times and end times introducing a large variability in events annotations. Yet, averaging the annotations over multiple experts, can reduce this variability and improve the quality of annotations [3, 5]. Multiple automatic algorithms for the detection of micro-events in the sleep EEG, such as spindles or Kcomplexes, have been proposed in the literature. These typically rely on band-pass filtering (typically 11 − 16 Hz ∗ Corresponding

authors Email addresses: [email protected] (S. Chambon), [email protected] (V. Thorey), [email protected] (P. J. Arnal), [email protected] (E. Mignot), [email protected] (A. Gramfort) 1 equally contributed

Preprint submitted to Elsevier

March 20,Page 2019

3 of 24

for spindles, 0.5 − 5 Hz for K-complexes), and the extraction of hand-crafted features. Four categories of algorithms 20

can be distinguished. Methods from the first category extract the envelope of the filtered signal and threshold it [8, 9, 10, 11, 12, 13]. The thresholding level is then either fixed or tuned. The threshold can be applied to the rectified filtered signal [13], to the instantaneous amplitude obtained by Hilbert transform [12], to the root mean square of the filtered signal [11], or alternatively to the moving average of the rectified filtered signal [9]. In order to identify the start and end times of events, it was for example proposed to look at inflexion points of

25

the envelope of the filtered signal [10]. This process is typically done after signal pre-processing to remove ocular artifacts and environmental noise such as the spurious signals due to electrical current (notch filtering around 50 Hz or 60 Hz). One limitation of this first category of methods is that they should be employed during specific sleep stages [13, 12, 11, 9, 8]. Therefore they require preliminary visual inspections of the data and/or a preliminary manual sleep stage scoring.

30

The second type of approaches decomposes the EEG signals into an oscillatory component and a transient component prior to filtering and thresholding the resulted signals [14, 15, 16]. These methods can detect both spindles and K-complexes, provided some changes are made to hyper-parameters. Also they are not sleep-stage

t

specific which makes them more convenient to work with. The third type of methods employ unsupervised learning

35

ip

techniques such as clustering. For the clustering step, Patti et al. [17] use a Gaussian Mixture Model (GMM). First the input signal is a band-pass filtered in 3 frequency bands (10.5 - 16 Hz, 4 - 10 Hz, 20 - 40 Hz), then 2 features

cr

called sigma ratio and sigma index are extracted from sliding windows of 1 s: the sigma ratio is the ratio of energy in the spindle frequency band (10.5 - 16 Hz) during the window of interest over the energy in previous and following

us

windows, while the sigma index is the ratio of power in the spindle band (10.5 - 16 Hz) over the sum of energy in the neighboring frequency bands (4 - 10 Hz and 20 - 40 Hz). The GMM is then applied on extracted features to cluster 40

samples into potential spindles versus non spindles. This approach has the following advantages: it is unsupervised,

an

hence does not rely on human annotations and it is not sleep stage specific. Yet, as the technique is unsupervised the algorithm may not discriminate the events of interest, performance cannot be easily quantified, and the setting of hyper-parameters cannot be automated thanks to cross-validation. The fourth type of methods are supervised

45

M

machine learning approaches which consist in training a classifier to predict whether a window of signal is an event of interest or not. For such a binary decision, classifiers such as a Support Vector Machine (SVM) classifier [18] or

number of peaks or zero crossings etc. [18].

d

a Random Forest Classifier [19] can be trained on manually extracted signal features, such as amplitude variance,

Ac ce pt e

The methods mentioned above suffer from several limitations. First, they rely on pre-defined parameters, such as frequency bands for filtering, which may not be optimal for some recordings or subjects. Second, they are 50

intrinsically event-specific. Third, the hyper-parameters of some methods, such as thresholds, are often selected using parts of the recording(s) used for evaluating the detection performances. This can introduce a positive and optimistic bias in the reported results [14, 15].

To address these limitations, possible solutions can be found in the computer vision literature, and more specificaly in the state-of-the-art object detection literature which relies on deep convolutional neural networks [20, 21, 55

22, 23, 24]. Such approaches learn a feature representation that is used by a prediction module that outputs both bounding boxes of detected objects as well as their classes. These approaches can predict objects of different categories at the same time. Indeed, the network outputs the localization of regions of interest together with a probability vector. Like in a multiclass logistic regression model setup, the prediction contains the probability that the region belongs to each event category. [23, 24, 21, 22]. Besides, they may make predictions from different fea-

60

tures maps of the underlying neural network [21, 22] allowing to handle different resolutions and scales of objects. Translating such methods to detect micro-architectural events in EEG time series is of great interest as it would provide the community with a non-event specific and a non sleep stage specific method that predicts locations, durations and types of any micro-event at the same time. Nonetheless, the translation from images to EEG time series is not straightforward for two reasons. First, when working on EEG recordings, one needs to process chunks

65

of signals: processing entire nights of signal is not tractable. Second, in a recording, most of these chunks of signals do not contain any true event. This implies that a successful method has to predict not only the absence of any event in the majority of the signal, but start times, durations and classes of events accurately when they occur. In machine learning terms, the method needs to cope with the problem of learning from imbalanced data.

2

Page 4 of 24

Deep learning methods have already been proposed to process EEG signals in the context of brain computer 70

interface applications [25, 26, 27, 28] or sleep [29, 30, 5, 31, 32, 33, 34] (see [35] for an extensive review). In the context of sleep, the existing approaches have explored the extraction of highly non-linear features thanks to convolutional and recurrent neural networks [34, 33, 29, 31]. These approaches have been developed to work both on large EEG datasets containing thousands of full night recordings [30, 5, 32], and on smaller datasets containing dozens or hundreds of 10 hours night recordings [34, 33, 29, 31]. Moreover deep learning approaches

75

have demonstrated abilities to learn relevant feature representation for the task to perform in a data-driven way [34, 33, 29]. However, to our knowledge, deep neural networks for processing EEG data during sleep are limited to the scope of classification tasks such as sleep stage classification. In this paper we propose the Dreem One Shot Event Detector (DOSED), a deep learning architecture algorithm to detect any type of micro-events in multivariate EEG signals. The proposed approach builds on a convolutional

80

neural network which extracts complex non-linear features from raw non-preprocessed EEG time series. A localization module predicts centers and durations of potential events over the input signals while a classification module predicts their labels. The whole network architecture is trained end-to-end by back-propagation. In the following

t

sections, we first detail this general approach and the associated training procedure. We then present a detailed and

85

ip

extensive benchmark comparing the proposed method with multiple state-of-the-art algorithms, on 3 event detection tasks and over 4 datasets. We also address technical questions pertaining to the influences of hyper-parameters

cr

on detection performances. This work extends on a recent short communication [36].

us

2. Methods

Notation. We denote by JnK the set of integers {1, . . . , n} for n ∈ N. Let X = RC×T be the set of EEG input

90

an

samples where C stands for the number of EEG channels and T for the number of time steps. Let L ∈ N be the

number of different labels or types of events to be predicted. We denote L = JLK the set of events labels, and 0  the label associated to no event or background signal. An event e = tc , td , l ∈ E = R2 × L ∪ {0} is defined by a

M

center location time tc , a duration td and an event label (or type) l ∈ L ∪ {0}. A true event is an event with label in L detected by a human scorer or a group of human scorers, a.k.a. consensus. A predicted event is an event with

95

2.1. Method overview

d

label in L detected by an algorithm or a group of algorithms.

Ac ce pt e

The general principle of DOSED is as follows. Let x ∈ X be an EEG sample, i.e. a short window of signal (∼ 30 s duration) coming from a PSG recording (∼ 8 h). First, the method is initialized with Nd default events di = (tci , tdi ), which are parameterized with a center time tci and a duration tdi , and which are positioned over each input signal x. For example, 1 s default events every 0.5 s if this corresponds to a typical duration of events to 100

be detected, see Figure 1 - A. Note that, centers, durations and overlaps of such default events can be adjusted depending on the type(s) of event(s) to be detected. Second, the network predicts a potential event associated to each default event: it predicts an adjusted center and an adjusted duration, as well as the probability for this event to have any label l ∈ L ∪ {0}, cf. Figure 1 - B. Finally, potential events for which the highest probability label l ∈ L, and for which the probability is higher than a specific (cross-validated) threshold θl ∈ [0, 1], are selected.

105

Then, non-maximum suppression is applied to remove overlapping events [22, 23, 37], cf. Figure 1 - C. The idea is to group predictions with the same label based on their IoU, and keep only the ones associated to the highest probability of observing a true event in each group. 2.2. Loss function Supervised machine learning boils down to the minimization of a specific loss function that quantifies the

110

prediction errors from the model on the training database. Here the default events are parameterized by a center and a duration, so the method predicts an adjusted center and an adjusted duration for any default event. It also returns the probabilities of each event type as well as the probability of belonging to the background signal (l = 0). To quantify the overlap in time between two events, a commonly used metric is the Jaccard index, a.k.a. Intersection over Union (IoU) [23]. For two time intervals, it is defined as the ratio of size between their intersection

3

Page 5 of 24

d1 d2

A

B

dNd

C

Figure 1: Prediction procedure for DOSED. A: Nd default events, di , i ∈ JNd K, are generated over the EEG sample. B: the network predicts potential events, i.e. adjusted centers and durations with respect to default events centers and durations, and potential events labels. C: non-maximum suppression is applied to merge overlapping potential events with label l different from 0. The network finally returns the center(s), duration(s) and label(s) of the remaining merged event(s).

115

and their union. This metric takes values between zero and one. It is zero if the events do not overlap in time, and one if they perfectly overlap.

of the Ne true events annotated over the signal x.

ip

120

t

The goal is to learn a prediction function fˆ from X to Y where y ∈ Y is a set of elements from E. Given Nd ∈ N  default events generated over the input EEG sample x ∈ X . Let D(x) = di = (tci , tdi ), i ∈ JNd K be the set of  centers and durations of the Nd default events generated over x. Let E(x) = ej = (tcj , tdj , lj ) : j ∈ JNe K be the list At training time, we want to map the default events and the true events. Following SSD [22] we use per-prediction

cr

matching. First, bipartite matching is applied: each true event is matched with the default event presenting the highest IoU. The remaining default events are then matched to the ground truth event presenting the highest IoU

125

us

with IoU(di , ej ) > η ∈ [0, 1]. We denote by γ the function which returns, if it exists, the index j of the true event matching with the default event di , and ∅ otherwise. The unmatched default events are assigned with the label l = 0 related to the absence of event.

an

Let ej be a true event matching with the default event di . di ’s center and duration are then encoded with ! c c d t − t t j i j , log d [24]. This encoding function quantifies the relative variations in φej : R2 −→ R2 , di = (tci , tdi ) 7−→ tdi ti centers and durations between the default event di and the true event ej . Let fˆ(x) ∈ Y be the prediction made by the model fˆ over the sample x. We define it as fˆ(x) = {(tˆci , tˆdi , ˆli ) ∈

M

130

E, i ∈ JNd K}. τˆi = (tˆci , tˆdi ) are the predicted coordinates of encoded default event di and ˆli is its predicted label. In

d

practice, the model will output the probability of each label l ∈ L ∪ {0} for default event di so ˆli is replaced by a P vector of probabilities π ˆi ∈ [0, 1]|L|+1 . As it is a probability vector, we have l∈L∪{0} π ˆil = 1.

Ac ce pt e

The loss between the true annotation E(x) and the model prediction fˆ(x) over signal x is a function ` : Y × Y →   − R+ defined as ` E(x), fˆ(x) = `+ norm + `norm where `+ =

X

 l L1smooth φeγ(i) (di ) − τˆi − log(ˆ πiγ(i) )

(1)

i∈JNd K γ(i)6=∅

`− = −

X

log(ˆ πi0 )

(2)

i∈JNd K γ(i)=∅

− + − and the normalized losses `+ norm and `norm are obtained by dividing ` and ` by the numbers of terms involved 135

in the sums (1) and (2) respectively. In (1), the sum considers the localization and classification losses for any default event di matching a true event eγ(i) . The L1smooth loss applies coordinate-wise the real valued function: x 7→ (x2 /2)1|x|<1 + (|x| − 1/2)1|x|≥1 , a.k.a. the Huber loss [24]. The purpose of (1) is to promote accurate localization and classification on default events matching a true event (l ∈ L). Equation (2) corresponds to the classification loss for default events which do not match any true event (l = 0).

140

One particular difficulty for the proposed approach, is that most of the default events do not match any true event. This is due to the facts that (1) EEG signals contain few events over long time periods, and that (2) the parameterization of default events implies using as many overlapping default events as necessary to maximize the number of true events matched over a window x. To address the issue of label imbalance between default events matching a true event and default events not matching any true event (l = 0), we employed a negative-mining

145

step: during training only a fraction of default events with label l = 0 were used. Let us remark that this negative mining step is omitted in (2) for simplicity. The process first boils down to sorting default events which do not match any true event based on their classification errors. The events which exhibit the worst classification errors 4

Page 6 of 24

are then considered, and a fraction of them is finally selected to compute the loss (2). This fraction is set such that the ratio between default events matching a true event and default events matching no true event is equal to 150

1/3, while ensuring that a minimum number of 10 default events matching no true event is selected. This means that for a specific EEG sample, at most one third of the default events associated to this sample can be used as negative default events which do not match a true event. On the other hand, at least 10 default events shall be used as negatives. Such a strategy aims to cope with class imbalance and to improve the training of the approach compared to a random sampling strategy of negatives.

155

Remark that the parameterization of default events is important to ensure a successful training process. In particular the parameterization of the duration tdi of the default events is critical. Indeed, a bad choice for tdi might lead to a small number of matched true events during the training process thus hurting the detection performances of the proposed approach. The influence of the default events durations is investigated and discussed in a specific experiment in Section 3. To conclude, training the event detection method fˆ boils down to solving the following minimization problem: fˆ ∈ arg min Ex∈X [` (E(x), f (x))]

(3)

Note that the loss function ` is differentiable with respect to the parameters of the considered model f , so

ip

160

t

f ∈F

it becomes possible to learn a neural network f with (stochastic) gradient methods and back-propagation. Also

cr

note that in practice, one minimizes the empirical version of (3) computed over a finite number of labeled training

us

samples. 2.3. Model architecture 165

What remains to be defined is the form of the function f , i.e. the architecture of the network. We use for f a

an

deep convolutional neural network that predicts at most Nd potential events with label different from 0 given a set of default events D(x) = {di = (tci , tdi ) : i ∈ JNd K}.

The architecture of the network is as follows. First, following the work in [36, 29], it starts with a spatial filtering

170

M

of the input multivariate signals. This corresponds to a matrix multiplication similar to Independent Component Analysis (ICA) that increases the signal to noise ratio [36, 29]. The multivariate nature of EEG data does carry precious information. This is for example exploited to cope with electrode removal or bad channels, and thus

d

to improve the robustness of the algorithm. Yet, this can also be exploited to improve the performance of the

Ac ce pt e

algorithm. Indeed, exploiting and selecting the signals from multiple EEG sensors presents a great interest and the EEG community has designed numerous methods to increase the signal-to-noise ratio (SNR) of an effect of interest 175

from a full array of sensors. Most popular examples are spatial filtering methods such as PCA/ICA [38], Common Spatial Patterns [39], beamforming methods for source localization [40] or the xDAWN algorithm [41]. Exploiting the time series from several sensors can also involve a sensors selection step [42] which may be performed jointly with the spatial filtering operation [43]. Several deep learning methods have been proposed to train a model from several EEG time series [44, 45, 46, 28]. Some of these approaches use a first layer that performs a spatial filtering

180

step explicitely [47, 48, 25, 49, 27, 26, 29] or implicitly [30, 5, 50]. Then, it performs some temporal processing with convolutions to build a feature representation of the input signals, and finally it outputs the prediction of intervals and labels of potential events. Let K, F, C, T ∈ N. The network can be written as the composition of three functions, f (x) = ψ(φT (φC (x))). The module φC : X −→ X stands for the spatial filtering operation which performs C linear combinations of the

185

C input time series and returns a new tensor x ˜ ∈ X . This module is implemented by a 2D convolution layer with C kernels of size (C, 1) (space, time) followed by a transpose operation. When C = 1 this module becomes the identity function: φC = Id. ˜

The module φT : X −→ RF ×C×T performs the temporal feature extraction. The module φT is composed of K blocks. Each block k is composed of a 2D convolution layer with batch normalization [51] and ReLU activation 190

x 7→ max(x, 0) [52], followed by a temporal max-pooling. Block k first convolves the previous feature maps xk−1 with 4 × 2k kernels of size (1, 3) (space, time), using a stride of 1. Zero padding is used to maintain the dimension of the tensor through the convolution layer. Then, the ReLU activation is applied. Finally a temporal max pooling operation with kernel of size (1, 2) and stride 2 is applied to divide by 2 the temporal dimension. Each Block k

5

Page 7 of 24

does not process the spatial dimension. Thus, the output of φT is a tensor of shape (F, C, T˜) with F = 4 × 2K and 195

T˜ = T /2K . ˜

The module ψ : RF ×C×T ←→ (R2 × L ∪ {0})Nd , stands for the prediction part of the model: it predicts for each of the Nd default events its corresponding potential event, i.e. its adjusted center and duration in R2 as well as the label of this potential event in L ∪ {0}. In practice it predicts a probability vector which encodes the probability of that event to belong to any of the classes L ∪ {0}. The prediction of locations is implemented by a convolution 200

layer with 2 × Nd kernels of dimension (C, T˜) and a linear activation. The prediction of classes is implemented with a 2D convolution layer with (L + 1) × Nd kernels of dimension (C, T˜). A softmax activation is applied on every L + 1 output feature maps to obtain a probability vector πi for any potential event i. The general architecture is summarized in Table 1.

φC φT

kernel

kernel

size

#

Conv. 2D

(C, 1)

Transpose Conv. 2D

Layer Type

output dim

activation

stride

C

(C, 1, T )

linear

1

-

-

(1, C, T )

-

-

(1, 3)

4 × 2k

(4 × 2k , C, T /2k−1 )

relu

1

k

k

Max P. 2D

(1, 2)

-

(4 × 2 , C, T /2 )

-

ψ - localization

Conv. 2D

(C, T˜)

2 × Nd

(2 × Nd , 1, 1)

linear

ψ - classification

Conv. 2D

(C, T˜)

(L + 1) × Nd

((L + 1) × Nd , 1, 1)

ip

k blocks, k ∈ JKK

t

Module

(1, 2)

cr

softmax

us

every (L + 1) kernels

Table 1: General architecture of the proposed approach. The network f (x) = ψ(φT (φC (x))) processes an input tensor of shape (1, C, T ) corresponding to C time series of T time steps. φC stands for the spatial processing module that performs spatial filtering. When C = 1, φC = Id. φT stands for the temporal processing module that builds the temporal feature representation and outputs a tensor

an

of shape (F, C, T˜) with F = 4 × 2K and T˜ = T /2K . ψ stands for the prediction module that predicts locations of potential events and their classes, i.e. it outputs two tensors: the localization tensor of shape (2 × Nd ) and the classification tensor of shape ((L + 1) × Nd ).

205

3.1. Datasets

d

3. Experiments

M

Note that the predictions exploit the full spatio-temporal data: they are based on convolutions of the whole output of φT

Ac ce pt e

The experiments were performed on 4 datasets described below. For each dataset, only full night PSG recordings of about 8 hours were used. Details on the datasets are summarized in Table 2. Montreal Archives of Sleep Studies dataset. The publicly available dataset Montreal Archives of Sleep Studies session 2 (MASS SS2) [53] was used. This dataset contains 19 polysomnography (PSG) recordings (11 females, 210

8 males, 23.6 ± 3.7 years old). These were sampled at 256 Hz. Spindles and K-complexes have been manually annotated by a first expert (E1) over 19 PSGs, and spindles have been annotated by a second expert (E2) over 15 PSGs using different guidelines [15]. All annotations have been performed over the EEG channel C3. In this work, only annotations produced by scorer E1 and data from channel C3 were used. Annotating spindles and K-complexes on signals from central channels is a common choice as these events are particularly salient and thus more easily

215

identifiable from centrals sensors. This decision to use E1 annotations was made since scorer E1 has annotated both spindles and K-complexes on all the records. The use of annotations from the scorer E2 was explored in a preliminary work [36]. Stanford Sleep Cohort dataset. The second dataset includes 26 PSGs from the patient-based Stanford Sleep Cohort (SSC), see [54] for a description of this dataset. These PSGs are from 26 subjects (9 females, 17 males, 52.2 ± 14.3 years old) and were sampled at 128 Hz. For each PSG recording, spindles have been scored by 5 different sleep experts through visual investigation of central channels. Experts were randomly selected from a pool of 9 scorers. For each PSG recording, annotations made by 5 experts were merged to build a consensus. To do so, annotations of each expert j were first encoded as binary vectors yj ∈ {0, 1}T containing as many time steps T as in the considered PSG recording. A time step i was considered as a spindle, yj [i] = 1, if it belonged to a spindle annotated by the expert j, otherwise it was considered as background signal and yj [i] = 0. The binary vectors corresponding to 6

Page 8 of 24

the five annotations were then averaged to build a vector y¯ ∈ [0, 1]T , taking values in {0, 0.2, 0.4, 0.6, 0.8, 1}. The consensus annotations were obtained by thresholding the vector y¯ to get a binary vector. Let κ ∈ [0, 1] stand for the desired level of consensus. The binary vector yκ ∈ {0, 1}T was built as follows.   1 if y¯ ≥ κ yκ [i] = ∀i ∈ JT K  0 otherwise Start times and durations for all events were then estimated from yκ . Unless otherwise mentioned (cf. Figure 4 and Figure 5), we have used the consensus κ = 0.2 (union of scorers) 220

annotations from channel C4 (C3 if C4 not available), using the mastoid electrode as a reference. Wisconsin Sleep Cohort dataset. Third, 30 PSG recordings from 30 subjects of the Wisconsin Sleep Cohort (WSC), were considered [55] (16 females, 14 males, 65.2 ± 8.2 years old). The PSGs were sampled at 200 Hz. Similarly to SSC, each PSG was scored by 5 different sleep experts from the same pool of 9 possible scorers. All experts followed the same scoring guidelines. Unless otherwise mentioned, we used consensus κ = 0.2 (union of scorers) annotations

225

on signals from channel C4 (C3 if C4 not available), using as reference the mastoid electrode.

t

One reason for using a consensus vector on SSC and WSC is the quality of some annotations. Indeed some

ip

annotations are questionable as certain experts were only able to annotate start times and durations up to a 0.5 s

cr

precision.

MESA dataset. A fourth dataset, publicly available, was finally used: MESA [56]. 1000 PSG recordings from 1000 subjects (528 females, 472 males, 69.3 ± 9.0 years old) originally sampled at 256 Hz were used for the experiments.

us

230

The signals from the 3 EEG channels (Fz-Cz, Cz-Oz, C4), and the 2 EOG channels (EOG left and right) were considered in this work. MESA dataset was used to study the detection of arousals which have been annotated

an

based on visual investigation of C4 channel. 3.2. Evaluation methodology

Cross-validation. A 5 split cross-validation was used on SS2, SSC and WSC. This scheme provides a stable estima-

M

235

tion of performance (with low variance) while ensuring enough data are available at each split for training, validation and testing. This also reduces the computational burden compared to a leave-one-subject-out CV scheme.

d

On SS2, a split stands for 10 training, 5 validation and 4 testing PSG recordings. On SSC, a split corresponds

240

Ac ce pt e

to 15 training, 5 validation and 6 testing PSGs. For WSC, a split stands for 19 training, 5 validation and 6 testing PSGs. On MESA, a single split was performed, using 400 PSGs for training, 100 for validation and 500 for testing. Cross-validation details are summarized in Table 2.

The reported performances were averaged over the splits. We furthermore ensure that each method, on a given split, has access to the same recordings as the competitive approaches for training (resp. validation / resp. testing). This offers a fair comparison of the different approaches. 245

Metrics. By event metrics were used to quantify detector performances for detection and localization of events [3]. These metric rely on the following IoU criterion: for a given δ > 0, a predicted event was considered as a true positive if it exhibited an IoU ≥ δ with a true event, otherwise it was considered as a false positive. The numbers of positives and true positives were evaluated to compute precision, recall and F1 scores of detectors for different overlapping criterion δ ∈ {0.1, 0.2, . . . , 0.9}. When δ = 0.1 a predicted event which exhibits at least some small

250

overlap with a true event might be considered a true positive, and when δ = 0.9 only a predicted event which exhibits a high overlap with a true event is considered a true positive. Evaluation was performed for δ ∈ {0.1, 0.2, . . . , 0.9}, on entire PSGs, each being taken individually. Reported performances were averaged by values of δ over PSG recordings from the testing set. Compared methods. For spindles detection, 3 state-of-the-art alternative methods were compared: Parekh et al.

255

2017 [14], Lajnef et al. 2017 [15] and Lachner-Piza et al. 2018 [18]. One can consider Parekh et al. 2017 and Lajnef et al. 2017 as good representatives of state-of-the-art non-supervised detectors (actually signal processing like detectors). Lachner-Piza et al. 2018 is a representative of supervised detectors of spindles. Benchmarks relied

7

Page 9 of 24

on codes provided by the original authors2 . For K-complexes detection, Lajnef et al. 2017 [15] was considered as the baseline comparator. This algorithm was designed to detect negative peaks of K-complexes without detecting 260

precisely start times and end times of these events. The convention employed by the authors was therefore used: start time was predicted as 0.1 s before the negative peak and end time as 1.3 s after the negative peak. Hyper-parameters of Parekh et al. 2017, λ3 and threshold were searched over {10, 20, 30, 40, 50} × {0.5, 1, 1.5, 2, 2.5, 3}. Threshold parameter of Lajnef et al. 2017 was selected in {0, 25, . . . , 250} for spindles detection and in {−100, −95, −90, . . . , 0} for K-complexes detection. Hyper-parameters were selected by grid search on the training and validation PSG recordings at hand. The

265

selection was performed in order to maximize F1 scores. More precisely, for a given overlapping criterion δ ∈ {0.1, . . . , 0.9} of interest, F1-scores associated to every set of hyper-parameters were evaluated on validation PSG recordings. The set achieving the highest F1-score was selected to predict PSG metrics of the testing set, computed with respect to this specific δ. The process is repeated for every δ ∈ {0.1, 0.2, . . . , 0.9}, meaning that for each δ a 270

potentially different set of optimal hyper-parameters is selected. For Lacher-Piza et al. 2018, the code provided by the authors came as a stand-alone software allowing neither

t

retraining of the underlying detector nor hyper-parameter tuning. We therefore asked for Lachner-Piza et al. 2018 ’

ip

detector to perform well across unknown datasets without hyper-parameter selection, a more difficult task referred to as inter dataset generalization. This likely led to lower performances than the ones this approach could potentially deliver if training and hyper-parameters tuning had been possible.

cr

275

Proposed approach. For detection of spindles and K-complexes, the network was provided with 20 s EEG samples,

us

sampled at a sampling frequency Fs specific to each dataset. Furthermore, the feature extraction module φT was built with K = 8 blocks. Thus, x ∈ RC×T , with T = 20 × Fs . Only the case C = 1 was considered for spindles

280

an

and K-complexes. For arousals detection, the network was provided with 2 min samples, down-sampled to 128 Hz. For this task, if not mentioned, C = 1. For every task, a normalization was applied to each sample x: centering and standardization by dividing each centered signal by its standard deviation computed on the full recording. We

Duration (s)

256

20

SSC

128

20

WSC

MESA

200 128

20



120

C

# records

lr

(training / validation / testing)

5120

1

10−4

10 / 5 / 4

2560

1

10−3

15 / 5 / 6

1

10−3

19 / 5 / 5

Ac ce pt e

SS2

T

d

Fs (Hz)

M

summarize the training details of the proposed approach on each dataset in Table 2.

4000

15360

1-5

−3

10

400 / 100 / 500

Table 2: Datasets properties and training parameters we considered per dataset. Fs is the sampling frequency of the given dataset. Duration stands for the duration of the input windows used in the the proposed approach. T is the resulting number of time steps. C stands for the number of channels that was considered. lr stands for the learning rate used during stochastic gradient descent training. ∗:

Note that PSG recordings from MESA were down-sampled to 128 Hz, which is in practice performed at the level of batch-sampling

This approach was implemented using the PyTorch library [57]. The training and inference operations were run on a NVIDIA P6000 GPU. Training took about 20 minutes for WSC, SSC and SS2. Inference took about 3 285

seconds per recording. Minimizing (3) was achieved using a stochastic gradient descent, with a learning rate of lr = 10−4 on MASS (lr = 10−3 on SSC, WSC, MESA), a momentum µ = 0.9 and a batch size of 32. 200 training epochs were considered. During the training process, if not explicitly mentioned, the method is fed with balanced batches of EEG samples: each batch contains 50% of EEG samples containing at least a true event, and the 50% remaining contained EEG samples with no annotated event at all. In practice, samples were drawn at a random

290

time location in the record until they match the condition of having at least one event or not. When a true event was partially included in a sample, its label l was set to 0 if less than 50% of this event was part of that sample. Early stopping was used to stop the training process when no improvement was observed on the loss evaluated on the validation data over 10 consecutive epochs. Furthermore, a learning rate decrease procedure was used: when no 2 Lajnef

et al. 2017 : https://github.com/TarekLaj/SPINKY, Parekh et al. 2017 : https://github.com/aparek/mcsleep, Lachner-

Piza et al. 2018 : https://github.com/mossdet/Mossdet

8

Page 10 of 24

progress was observed on the validation loss after 5 consecutive epochs, the learning rate was divided by 2. Learning 295

rate decrease on plateau is commonly used for training neural networks [58]. At prediction time, consecutive EEG samples were sampled from entire PSG recordings, and the network predicted on each of these samples. Matching hyper-parameter η was fixed to η = 0.5. The hard negative mining strategy used at the level of a single EEG sample has to deal with the class imbalance between default events matching a true event and default events matching no true event (which are the most numerous). To address this issue we set the proportion of default

300

events containing a true event versus the defaults events containing no true event to 1/3. The minimum number of default events containing no true event to involve in the loss was fixed to 10. Selecting a minimal number of default events matching no true event ensures that the network learns something on each sample, even if it does not contain a true event. Non-maximum suppression was applied to merge potential events exhibiting at least an IoU ≥ 0.4. Default events for spindles and K-complexes detection were fixed as 1 s sliding windows with 75% overlap between

305

two consecutive default events, resulting in Nd = 80 default events. Default event hyper-parameters for arousals detection were investigated in a dedicated experiment. A potential event (tˆci , tˆdi , π ˆi ) was considered a positive event of label l if π ˆil ≥ θl , θl ∈ [0, 1] being a detection threshold specific to label l. This means that a predicted event was

t

considered as a positive event of label l ∈ L if the probability of this label π ˆil is higher than a certain cross-validated

310

ip

detection threshold θl .

The detection threshold θl for detecting events of label l ∈ L was selected by cross-validation for any overlapping

cr

criterion of interest δ ∈ {0.1, . . . , 0.9} following a process similar to the one used for the baselines. For every δ, hyper-parameter θl was selected by grid search over the validation data to maximize the F1 score. More precisely,

us

for a given δ, the network was used to predict on the validation PSG recordings with different detection thresholds θl ∈ [0, 1] leading to different precision, recall and F1 scores. The detection threshold achieving the highest F1 score 315

with respect to this criterion δ was used for performance evaluation on the testing set. The process was repeated

an

for any δ ∈ {0.1, . . . , 0.9}.

In summary, to use the proposed approach one needs to select the following parameters: (1) the default events parameters depending on the events to detect (a priori knowledge), (2) the learning rate by monitoring the training

320

M

and validation losses (experimental knowledge) (3) the detection threshold θl by cross-validation (experimental knowledge).

d

3.3. Results

Ac ce pt e

We now provide the results on the different detection tasks (spindles, K-complexes, arousals). Spindles detection. In this experiment, we compare the proposed approach with three alternative methods on the spindle detection task over 3 different datasets: SS2, SSC and WSC. We perform 3 intra-dataset benchmarks. We 325

report the detection performances of every approach obtained on each dataset in Figure 2. We also report some statistics about the datasets in Figure 3.

A first observation is that, on every dataset, the proposed approach outperforms the compared methods in terms of precision / recall at IoU = 0.3 and in terms of F1 score for any IoU. One can also see on SS2, that the detection performances are quite stable for any IoU in [0.1, 0.7]. 330

A second major observation is that the learning task appears more difficult on SSC and WSC datasets than on SS2. Indeed, performances reported on SS2 are higher than those reported on SSC and WSC. Also, the standard deviations obtained on SS2 are smaller than those obtained on SSC and WSC. Finally the losses obtained on SS2 are smaller than the ones obtained on SSC and WSC, see Figure 13 and Figure 14 in Appendix. This observation is made, although the statistics about the datasets are quite similar, especially regarding the quantity of training samples,

335

see Figure 3. Yet, the frequency analysis indicates that the spindles from SS2 exhibit a more salient frequency content in the 11 − 16 Hz frequency band compared to SSC / WSC, see Figure 3 - E. This might explain the lower performances obtained on SSC and WSC compared to the ones obtained on SS2. Another possible explanation is that the datasets are collected on different populations and annotated by different scorers and according to different annotation merging strategies. This is likely to impact the performances of the model.

340

Spindle detection and consensus level. In this experiment, we benchmark the proposed approach on the spindles detection task, on SSC and WSC datasets. Annotations built with 3 different consensus levels are considered:

9

Page 11 of 24

SSC

1.0

0.6

0.6

0.6

Precision

0.8

Precision

0.8

0.4

0.2

0.4

0.2

0.00.0

0.2

0.4

0.6

0.8

Recall

1.0

1.0

0.2

0.00.0

0.2

0.4

0.6

0.8

Recall

1.0

0.00.0

1.0

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0.4

IoU

0.6

0.8

1.0

0.4

0.8

1.0

0.00.0

0.2

0.4

IoU

0.6

Lajnef et al. 2017

0.8

0.00.0

1.0

IoU

0.6

0.8

1.0

Lachner-Piza et al 2018

Proposed approach

cr

Parekh et al. 2017

0.2

0.6

Recall

F1

0.6

F1

0.8

F1

0.8

0.2

0.4

1.0

0.8

0.00.0

0.2

t

Precision

0.8

0.4

WSC

1.0

ip

SS2

1.0

Figure 2: Spindle detection: general benchmark on 3 different datasets: SS2, SSC and WSC. First row: averaged precision / recall

us

at IoU = 0.3. Second row: F1 score as a function of IoU. Standard deviation at each IoU are provided as errorbars. The proposed approach outperforms the 3 alternative methods on each metric (Precision, recall and F1 score).

SSC

Dataset

WSC

15000

SSC

WSC

1.5

0.10

0.5

0 SS2

SSC

Dataset

WSC

0.0 SS2

SS2 SSC WSC

0.15

1.0

5000

E

0.20

Duration (s)

10000

Dataset

D

2.0

Normalized spectrum

0 SS2

train validation

an

5000

C

20000

M

10000

B

# events for learning

# events - total

15000

1500 1250 1000 750 500 250 0 SS2

# events - per record

A

20000

0.05

SSC

Dataset

WSC

0.005

10

15

20

Frequency (Hz)

25

d

Figure 3: Statistics about spindles over the different datasets: the datasets contain comparable amounts of labeled events, yet the frequency content corresponding to spindles 11 − 16 Hz is more salient on SS2 than on SSC / WSC. A: total number of spindles. B:

Ac ce pt e

averaged number of spindles per PSG recording. C: averaged numbers of events used for training and validation. D: averaged durations of events. E: normalized spectrum of 2 s of signal centered on spindles. This is actually the spectrum averaged over all the spindles annotated on a dataset. The normalization is obtained by dividing the power of each frequency bin by the total power between [0, Fs /2]

κ ∈ {0.2, 0.4, 0.6}. We report in Figure 4 the statistics of annotated events depending on the consensus level considered. The obtained performances are reported in Figure 5.

5000

0 0.2 0.4 0.6 0.8 1.0 consensus level

20000 15000 10000 5000

C

20000

# events for validation

10000

B 1000 800 600 400 200 0 0.2 0.4 0.6 0.8 1.0 consensus level

# events for training

15000

A

# events - per record

# events - total

20000

0 0.2 0.4 0.6 training consensus level

D

15000 10000 5000 0 0.2 0.4 0.6 training consensus level

Figure 4: Statistics about spindles on SSC and WSC as functions of the consensus level: the consensus level influences greatly the number and the duration of the events annotated by the consensus of scorers. A: total number of spindles. B: averaged number of spindles per PSG recording. C: averaged numbers of events used for training. D: averaged numbers of events used for validation

One can first observe in Figure 4 that the numbers and the durations of events decrease as the consensus level 345

increases. This leads to fewer training and validation events for learning when κ increases. The changes induced by the consensus level on the annotated events statistics induce significant changes in the obtained performances, see Figure 5. Indeed, at constant testing level, the F1 score at IoU = 0.3 decreases when the training consensus level increases. This might be due to the fact that fewer training and validation samples are available. Note that,

10

Page 12 of 24

SSC

1.0

1.0 0.8

F1 (IoU = 0.3)

0.8 0.6

0.6

0.4

0.4

0.2

0.2

0.0

0.0

0.2 0.4 0.6 consensus level for training

WSC consensus level for testing 0.2 0.4 0.6

0.2 0.4 0.6 consensus level for training

Figure 5: Benchmark on spindles detection as a function of the consensus levels used for training and testing on WSC and SSC: the F1 score at IoU = 0.3 is greatly impacted by both the consensus level used for training and the consensus level used for testing

at constant training consensus level, the F1 score at IoU = 0.3 decreases when the testing consensus level increases on SSC. This might be explained by the fact that the events used for training are of poor quality compared to the

t

350

ip

events used for testing. This will be further commented in the discussion of this work.

Finally, on WSC, at constant training consensus level, the higher F1 score is always reached for a testing

cr

consensus level equal to the training one. This might be explained by the fact that the quality of annotated events

355

us

for training and testing are similar.

K-complexes detection. In this experiment, we perform a general benchmark on MASS SS2 dataset, on K-complexes detection and demonstrate that the proposed approach outperforms current state-of-the-art methods Lajnef et al.

an

2017 [15]. We report the obtained performances in Figure 6. We furthermore report some statistics about the K-complexes annotations over SS2 in Figure 7.

A

1.0

M

1.0 0.8

0.8

Ac ce pt e

0.4 0.2

0.00.0

Proposed approach Lajnef et al. 2017

0.6

F1

d

Precision

0.6

B

0.2

0.4

0.6

Recall

0.8

0.4 0.2

1.0

0.00.0

0.2

0.4

IoU

0.6

0.8

1.0

Figure 6: K-complexes detection: the proposed approach outperforms the alternative approach [15]. A: Precision / Recall at IoU = 0.3. B: F1 score as a function of IoU.

The proposed approach outperforms the approach from Lajnef et al. 2017 [15] in terms of Precision / Recall 360

at IoU = 0.3, and in terms of F1 score for any IoU. Furthermore, the proposed approach seems to predict start and end times quite precisely. Indeed, the F1 score as a function of IoU is rather stable for IoU ∈ [0.1, 0.7]. Note however, that Lajnef et al. 2017 is not able to predict accurately the start and end times of events and can only detect the negative peaks of K-complexes. Indeed, the start and end times are predicted empirically as occurring 0.1 s before and 1.3 s after the negative peaks of the K-complexes following the authors’ pipeline [15]. This penalizes

365

negatively this approach as it assumes that the duration of predicted events is always of 1.4 s whereas the duration of true events is around 0.8 s on average (see Figure 7). This may explain the drop of F1 score for IoU ≥ 0.3. Joint spindles and K-complexes detection. In this experiment, we perform a general benchmark on joint Kcomplexes and spindles detection and we demonstrate that detecting both events at the same time leads to detection performances similar to the ones obtained when performing each detection task separately. We report the results

370

in Figure 8. 11

Page 13 of 24

10000 5000 SS2

800

train validation

15000

600

10000

400 200 0

Dataset

C

20000

SS2

0

1.5 1.0

5000

Dataset

D

2.0

Duration (s)

# events - per record

# events - total

15000

0

B

1000

# events for learning

A

20000

0.5 0.0

SS2

Dataset

SS2

Dataset

Figure 7: Statistics about K-complexes on MASS SS2. A: total number of events. B: averaged number of events per PSG recording. C: averaged numbers of events used for training and validation. D: averaged duration of events.

A

1.0

B

1.0

0.6

0.6

0.2

0.2 0.2

0.4

0.6

Recall

0.8

1.0

0.00.0

0.2

0.4

IoU

0.6

0.8

1.0

us

0.00.0

ip

0.4

cr

0.4

t

F1

Precision

0.8

Training separately Training jointly spindles spindles K-complexes K-complexes 0.8

Figure 8: Joint spindles and K-complexes detection: learning to detect both events jointly or separately leads to the same performances.

an

Learning to detect both events jointly or separately leads to similar performances in terms of precision and recall at IoU = 0.3 and F1 score as a function of IoU.

M

Sampling. In this experiment, we investigate the influence of the batch sampling strategy. We vary the proportion of samples in a training batch which contains a true event from 0.1 to 1. and we quantify its influence on the F1 375

score at IoU = 0.3. In other words, setting the proportion to 0.5 means that half of the samples present in the

Ac ce pt e

1.0

d

batch contains a true event. We report the obtained performances on SS2 dataset in Figure 9.

F1 (IoU = 0.3)

0.8 0.6 0.4

Training separately spindles K-complexes

0.2

0.00.0

0.2 0.4 0.6 0.8 1.0 Proportion of samples containing a true event in training batches

Figure 9: Influence of batch sampling on detection performances: using batches only composed of samples containing a true event leads to lower performances than using batches composed of a mix of samples with a true events and ”empty samples”.

One can observe that the proportion of samples in a training batch, which contain a true event does not influence significantly the proposed approach’s performance on spindles detection. However, for K-complexes detection, a higher proportion of samples containing a true event in a batch leads to 380

a significantly lower detection performance. This might be due to the following facts. First, K-complexes are events with a low frequency content arising in N2 sleep stage [2] and similar low frequency content events also occur in N3 sleep stage under the form of grouped slow oscillations. Second, with high proportions of samples containing

12

Page 14 of 24

a true event the proposed approach is likely to never learn to differentiate a K-complex from a slow oscillation. The analysis of predictions with respect to sleep stages (not shown) agrees with this explanation. Indeed, a higher 385

proportion (in training batches) of samples with a true event leads to a higher number of predicted K-complexes in period of N3 sleep. However no K-complex has been annotated over these periods of N3. To prevent such an issue, a 0.5 proportion of samples containing a true event appears as a good compromise. The same observations apply for the proposed approach when it is trained to detect jointly spindles and Kcomplexes (not shown).

390

Learning curves: Do we have enough data?. In this experiment, we investigate the influence of the quantity of labeled events used for training on the performances of the proposed approach on SS2, SSC and WSC. We report the F1 scores at IoU = 0.3 when the number of training PSG recordings varies. The number of PSG recordings used for training is selected in {1, 2, 4, 6, 8, 10} for SS2, in {1, 2, 4, 6, 8, 10, 15} for SSC and in {1, 2, 4, 6, 8, 10, 15, 20} for WSC. The number of PSG recordings used for validation and testing remains fixed as described in the description of the datasets. The results are reported in Figure 10.

0.00

0.2 5 10 15 # training records

0.00

20

t

ip

# training events

0.4

0.2

14000 SS2 spindles 12000 K-complexes 10000 8000 6000 SSC spindles 4000 WSC 2000 spindles 00 20 5 10 15 20 # training records

cr

spindles K-complexes 5 10 15 20 # training records

D

0.6

0.4

Training separately

C

0.8

0.6

0.4

0.00

1.0

0.8

0.6

0.2

B

F1 (IoU = 0.3)

F1 (IoU = 0.3)

0.8

1.0

5 10 15 # training records

us

A

1.0

F1 (IoU = 0.3)

395

an

Figure 10: Learning curves on SS2, SSC and WSC: more training PSG recordings leads to higher detection performances especially on SSC and WSC. A: SS2 - training separately. B: SSC. C: WSC. D: numbers of training events as a function of numbers of training PSGs for each dataset.

M

The learning curves on SS2 demonstrate that learning from few PSG recordings leads to quite good performances. Indeed, the proposed approach exhibits high F1 at IoU = 0.3, even when trained on events from a single PSG. On the other hand, the learning curves on SSC and WSC lead to slightly different observations: learning from

the maximum number of training PSGs available is much higher than the performance obtained when using a few

Ac ce pt e

400

d

just a couple of PSGs is not sufficient to get good performances. Indeed, the performance reached when using

training PSGs. This experiment first indicates that the spindle detection task is more complex on SSC and WSC than on SS2 which agrees with observations previously made. Second, it suggests that we might lack of labeled data on SSC and WSC to obtain detection performances similar to the ones obtained on SS2. Arousal detection. In this section, we apply and demonstrate that the proposed approach can be used to detect 405

other types of events, here arousals. We investigate the following technical questions: (1) the influence of the duration of default events (2) the influence of the quantity of training data and (3) the influence of the number of channels considered.

To do so, we first train the proposed approach while varying the duration tdi of default events from 2 s up to 40 s and keeping an overlapping of 50% between two consecutive default events: default events are generated every 410

tdi /2 s ensuring that only 2 default events overlap. We report the obtained results in Figure 11 - A. We furthermore investigate the influence of the quantity of training data available modulated by the quantity of spatial information that is processed, Figure 11 - B. We additionally report some statistics about the events annotated over MESA in Figure 12. Three observations can be made from the reported results. First, scaling the duration of default events onto the

415

averaged duration of true events leads to the best performances, see Fig 11 - A. This strategy shall be considered as the preliminary step when investigating the detection of new types of events with the proposed approach. Second, Figure 11 - B demonstrates that the more data the better, yet for this task and with the chosen network architecture, the metric increases at a limited pace after 100 training PSG recordings. Finally, using more channels leads to a significant boost in detection performances and may also be used to balance a potential lack of training PSGs as

420

shown in Figure 11. This positive impact of the number of channels was also reported in [29]. 13

Page 15 of 24

A

1.0

B

1.0 0.8

0.6

0.6

F1

0.8

1 EEG 3 EEG + 2 EOG

0.4

0.4 true events durations (mean) default events durations

0.2 0.00

10

20 30 Duration (s)

0.2 40 0.00

100 200 300 # training records

400

Figure 11: Application of the proposed approach on arousals detection: the parameterization of default events appears as important as the quantity of training data or the quantity of spatial information. A: F1 score at IoU = 0.3 as a function of default event duration. B: F1 score at IoU = 0.3 as a function of the quantity of training PSG recordings and the number of channels

200000 100000

t

averaged duration

MESA Dataset

10 20 30 Duration (s)

40

an

MESA Dataset

C

ip

300000

0.150 0.125 0.100 0.075 0.050 0.025 0.0000

cr

# events - total

400000

0

B

300 250 200 150 100 50 0

us

A

# events - per record

500000

Figure 12: Statistics about arousals in the 1000 PSG recordings from MESA used for this detection task. A: total number of annotated

4. Discussion

M

events in the 1000 PSGs. B: averaged number of events per record. C: distribution of events durations

This work reports on the successful use of a novel approach to simultaneously detect multiple EEG microarchi-

d

tectural features in EEG signals. It is based on object recognition approaches developped for computer vision and

425

Ac ce pt e

is inspired by SSD [22] and YOLO approaches [23]. The model builds by back-propagation a feature representation of the data relevant for the task to perform and learns how to predict centers, durations and classes of events of interest. The approach is versatile enough to detect any type of events provided some labeled samples are available. A major advantage of the proposed approach compared to state-of-the-art baselines is that it can detect multiple events of multiple time scales simultaneously. It could therefore easily be extended to detect other types of events in other modalities or concurrent signals (for example sleep disordered breathing events on breathing channels or 430

Periodic Leg Movements during sleep on leg EMGs), provided labeled events are available. This work also raises interesting technical issues that are essential to a successful application of this algorithm, such as the importance of including an heterogeneous set of clinical PSGs of different origins, having sufficient data available for training, the importance of correct labeling and of defining optimal duration for default events. Impact of clinical population and labeling errors. The importance of datasets and of proper labeling is best illustated

435

by the significant gap in performances observed between SS2 and SSC / WSC datasets in detecting spindles using all considered methods (see also Figure 14 in Appendix). One factor that may explain the differences in performance is related to the saliency of the frequency content inside the 11 − 16 Hz band. Indeed, power spectrum in this band is much higher for SS2 than SSC / WSC, see Figure 3 - E. The frequency difference might be due to population differences between SS2 and SSC / WSC datasets. Indeed, subjects in SSC / WSC are older, and it is known that

440

age affects spindle frequency, duration, density and amplitude [4]. These age dependent changes may make spindle detection harder for experts and for algorithms. It also illustrates the need to test performance for any new method on multiple datasets. Another likely factor that affects performance across datasets is quality of experts annotations. Indeed, some scorers using the SSC and WSC data were unable to precisely mark start and end times of events (up to a 14

Page 16 of 24

445

0.5 s precision) due to PSG viewer limitations. Annotations can depend on guidelines given to scorers, their interpretations, and other technical issues such as the viewer used [3]. Taking the union of scorers annotations allowed us to partially cope with this problem but not entirely. Indeed, there is no strong guarantees that a consensus strategy will improve annotations quality. Besides, a higher consensus level leads to fewer annotated events making it harder to train using the proposed approach (see Figure 4). One possible strategy to improve on

450

this issue could be to use a larger number Ns of scorers, as previously performed [3], and selecting only the K ∈ N best scorers depending on overall performances compared to a consensus of Ns − 1 scorers. Quantity of available data for training. Deep learning methods are known to require a large quantity of labelled data for training. Yet, there is now a number of works that have successfully trained deep nets on both large EEG datasets formed by thousands of full night recordings [30, 5, 32, 59], and on smaller datasets containing dozens or

455

hundreds of 10 hours night recordings [25, 27, 29, 31]. It is thus important to investigate how the quantity of data available for training impacts the performance of an approach which was explored in this paper by varying either the quantity of labeled data used for training or the quantity of spatial information available. More data can mean different things. It can mean more PSG recordings or more channels, for example with more

t

modalities than just EEG. To address the question of the impact of the number of PSGs, we computed learning curves which present the prediction performance as a function of the number of training samples (Figure 10).

ip

460

Surprisingly, this analysis shows that training with the proposed approach to detect spindles or K-complexes on

cr

just 1 PSG already leads to decent performances on SS2. This result is likely explained by the sampling strategy used in the training process.

465

us

Figures 10 - B / C also demonstrate that the proposed approach benefits significantly more from the availability of additional PSGs for SSC and WSC. This suggests that the learning task is more difficult on these two datasets.

an

This can be explained, as already discussed above, by the clinical nature of the population and annotations present in these two datasets. A similar comment applies to the detection of arousals, as F1 score at IoU = 0.3 increases as a function of the number of training PSG recordings, and reaches a smaller increase pace from 100 training PSGs

470

M

in Figure 11 - B.

Regarding the quantity of spatial information, using multiple EEG and EOG channels for arousals detection delivers a significant gain of performance. This demonstrates that increasing the number of channels can compensate

d

for a potential lack of training PSG recordings (see Figure 11 - B). Investigating the joint processing of multiple modalities, such as electro-myography, breathing, or pulse-oxymetry signals remains to be done for event detection,

475

Ac ce pt e

although one can expect an increase in performance [29, 30, 32]. In particular, such a perspective might prove successful to detect specific types of events occurring through multiple modalities like sleep-apnea. The use of a deep learning approach implies that a certain quantity of labeled data has to be available for training as demonstrated above. Yet, the use of deep learning offers also some specific solutions to face the limitation of labeled data such as transfer learning which prevents the necessity of training a network from scratch and limits the dependency on labeled data [60]. Transfer learning has already been explored in the case of sleep stage classification 480

to improve the performance of predictor on an already known subject [32, 61]. Exploring such a transfer learning strategy could be beneficial for our approach and lead to gains of performance. Tractability of the proposed approach. The training procedure is often tractable only if one has access to dedicated hardware, such as a graphical computer unit (GPU). However, so far training deep networks to process EEG time series involve less computational efforts than in the context of computer vision. This can be explained by the fact

485

that EEG time series are mostly 1D data. As a consequence, the hardware required to train a deep learning model on EEG is still quite affordable. Impact of default event duration and joint detection of multiple events with variable durations. The parameterization of default events is a crucial step of the proposed approach. While the parameterization of default events for spindles and K-complexes detection was quite straightforward because these events exhibit a 1 s averaged duration,

490

experiments on arousals detection show that default duration and overlap factor must be carefully selected. Indeed, bad parameterization compromises training, as some true events are never matched during training and are thus never used to train the network localization and prediction modules. This reduces the effective number of samples

15

Page 17 of 24

considered for training and makes recognition of such events more difficult. The selection of a good default duration is a crucial first step when investigating the use of the proposed approach for a new type of event. A good heuristic 495

though is to set the duration of default events to the averaged duration of events in the training set. Of note, detecting jointly multiple events is possible with the proposed model as demonstrated by Figure 8. However it shall be stressed that in this experiment, we explored the joint detection of spindles and K-complexes, features which exhibit about the same durations. To detect events of different durations, e.g. spindles and arousals for example, one would need to parameterize default events of multiple scales which was not investigated in this work.

500

The influence of the model architecture and of balanced sampling on performance. The proposed approach exhibits a VGG-like architecture [62], where for each convolution block, the number of feature maps is multiplied by 2 while the temporal resolution is divided by 2. Such a choice enables to extract relevant features for the considered tasks, although more complex architectures remain to be investigated. The proposed approach exhibits much higher performances than our earlier work [36]. This stems from two major changes.

505

The first major change is related to the network architecture. The implementation of prediction modules in this work differs from the one embraced in our previous work [36] as we have opted for predicting locations and

t

classes of default events from the whole feature maps returned by φT . This choice had three positive impacts: (1)

ip

it boosted the detection performances of K-complexes and (2) it allowed for a simpler parameterization of defaults events, making them independent of the size of the feature maps returned by φT , also allowing to freely choose numbers, durations and overlaps of defaults events to use. Finally, (3) it also allowed to predict a wide range of

cr

510

events durations: 1 s spindles as well as 10 s arousals with the same architecture - provided one took care of the

us

parameterization of default events as mentioned above. Our previous approach was predicting potential events by processing 3 consecutive time steps of the feature maps returned by φT which may not give access to a sufficiently

515

an

large temporal context for the prediction of long events like arousals.

The second change is the use of a balanced sampling strategy during training. The current approach, training batches are composed of 50% of samples containing no event of interest. This led to a significant increase in

M

performance for the detection of K-complexes (see Figure 9). It also contributed to a reduction of false positives predicted in N3 sleep, which resulted in a boost of F1 score.

Comparison with YOLO and SSD and Perspectives. Our algorithm is inspired by both the SSD [22] and YOLO approaches [23]. Similarities between our work and these approaches include the fact that we employ an SSD loss

d

520

Ac ce pt e

function which is a combination of a smooth L1 loss and a classification loss. Furthermore, the same matching strategy as SSD is used during training. Finally, the prediction module ψ makes predictions from the whole feature map returned by the feature extractor φT (φC ). This is similar to YOLO. To that end, the proposed approach relies on a grid of default events independent of the temporal size of this latter features maps. This allows for more 525

flexibility in the definition of default event.

Our proposed approach howewer also differs from SSD and YOLO because it predicts on large objects, such as a 10 h PSG containing mostly no events. The approach was thus further developed to processes chunks of signals and trained both on samples containing true events and on samples containing no event of interest (see balanced sampling strategy discussed above). 530

The proposed approach could be extended into several directions. First, the model actually predicts from a single level of feature maps. Exploiting several levels of feature maps containing multiple temporal resolutions and associated with several scales of default events as performed by SSD [22], or more recent approaches such as Feature Pyramid Networks [20] or RetinaNet [21] could be considered. This would likely boost detection performances by leveraging detection of events which exhibit a wide range of durations (from 1 s to possibly several minutes events),

535

this being espacially useful in the context of full PSG automatic scoring of both macro and microarchitectural events. Second, exploiting more complex prediction modules, combining several convolution layers and non-linearities, could enhance detection performances [21, 20]. Third, minimizing a different loss function like Focal Loss [21] should also be considered as a direction of investigation. Finally, events of interests such as spindles might exhibit specific temporal dynamic patterns, with the likelihood of a spindle occurring in a sample being related to the occurrence

540

of spindles in a previous sample. Integrating a temporal context using a recurrent neural network as performed for EEG processing [28] or for sleep stage classification [63, 33, 31, 5, 64] might enhance detection performance for some

16

Page 18 of 24

events. In all cases, however, the proposed approach has the considerable advantage of simultaneous multi-event detection, a crucial feature that should allow to build more easily additional event detection methods on the same architecture.

545

Acknowledgments This work was supported in part by the french Association Nationale de la Recherche et de la Technologie (ANRT) under Grant 2015 / 1005.

References [1] T. Porkka-Heiskanen, K. M. Zitting, H. K. Wigren, Sleep, its regulation and possible mechanisms of sleep 550

disturbances, Acta Physiologica 208 (4) (2013) 311–328. [2] C. Iber, S. Ancoli-Israel, A. Chesson, S. F. Quan, The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications, American Academy of Sleep Medicine, 2007.

ip

t

[3] S. C. Warby, S. L. Wendt, P. Welinder, E. G. S. Munk, O. Carrillo, H. B. D. Sorensen, P. Jennum, P. E. Peppard, P. Perona, E. Mignot, Sleep spindle detection: crowdsourcing and evaluating performance of experts, non-experts, and automated methods, Nature methods 11 (4) (2014) 385–392.

cr

555

[4] S. M. Purcell, D. S. Manoach, C. Demanuele, B. E. Cade, S. Mariani, R. Cox, G. Panagiotaropoulou, R. Saxena,

us

J. Q. Pan, J. W. Smoller, S. Redline, R. Stickgold, Characterizing sleep spindles in 11,630 individuals from the National Sleep Research Resource, Nature Communications 8 (2017) 15930.

560

an

[5] J. B. Stephansen, A. N. Olesen, M. Olsen, A. Ambati, E. B. Leary, H. E. Moore, O. Carrillo, L. Lin, F. Han, H. Yan, Y. L. Sun, Y. Dauvilliers, S. Scholz, L. Barateau, B. Hogl, A. Stefani, S. C. Hong, T. W. Kim, F. Pizza, G. Plazzi, S. Vandi, E. Antelmi, D. Perrin, S. T. Kuna, P. K. Schweitzer, C. Kushida, P. E. Peppard,

M

H. B. D. Sorensen, P. Jennum, E. Mignot, Neural network analysis of sleep stages enables efficient diagnosis of narcolepsy, Nature Communications 9 (1) (2018) 5229.

endophenotype that links risk genes to impaired cognition?, Biological psychiatry 80 (8) (2016) 599–608.

Ac ce pt e

565

d

[6] D. S. Manoach, J. Q. Pan, S. M. Purcell, R. Stickgold, Reduced sleep spindles in schizophrenia: A treatable

[7] E. S. Musiek, D. D. Xiong, D. M. Holtzman, Sleep, circadian rhythms, and the pathogenesis of Alzheimer Disease, Exp Mol Med 47 (3).

[8] L. B. Ray, S. Sockeel, M. Soon, A. Bore, A. Myhr, B. Stojanoski, R. Cusack, A. M. Owen, J. Doyon, S. M. Fogel, Expert and crowd-sourced validation of an individualized sleep spindle detection method employing 570

complex demodulation and individualized normalization, Frontiers in Human Neuroscience 9 (2015) 507. [9] E. J. Wamsley, M. A. Tucker, A. K. Shinn, K. E. Ono, S. K. McKinley, A. V. Ely, D. C. Goff, R. Stickgold, D. S. Manoach, Reduced sleep spindles and spindle coherence in schizophrenia: Mechanisms of impaired memory consolidation?, Biological psychiatry 71 (2) (2012) 154–161. [10] S. L. Wendt, J. A. E. Christensen, J. Kempfner, H. L. Leonthin, P. Jennum, H. B. D. Sorensen, Validation of

575

a novel automatic sleep spindle detector with high performance during sleep in middle aged subjects, in: Proc. IEEE EMBC, 2012, pp. 4250–4253. [11] M. M¨ olle, T. O. Bergmann, L. Marshall, J. Born, Fast and Slow Spindles during the Sleep Slow Oscillation: Disparate Coalescence and Engagement in Memory Processing, Sleep 34 (10) (2011) 1411–1421. [12] Y. Nir, R. Staba, T. Andrillon, V. Vyazovskiy, C. Cirelli, I. Fried, G. Tononi, Regional Slow Waves and Spindles

580

in Human Sleep, Neuron 70 (1) (2011) 153–169. [13] F. Ferrarelli, R. Huber, M. J. Peterson, M. Massimini, M. Murphy, B. A. Riedner, A. Watson, P. Bria, G. Tononi, Reduced sleep spindle activity in schizophrenia patients, Am J Psychiatry 164 (3) (2007) 483–492.

17

Page 19 of 24

[14] A. Parekh, I. W. Selesnick, R. S. Osorio, A. W. Varga, D. M. Rapoport, I. Ayappa, Multichannel sleep spindle detection using sparse low-rank optimization, J. Neurosci. Methods 288 (2017) 1–16. 585

[15] T. Lajnef, C. O’Reilly, E. Combrisson, S. Chaibi, J.-B. Eichenlaub, P. M. Ruby, P.-E. Aguera, M. Samet, A. Kachouri, S. Frenette, J. Carrier, K. Jerbi, Meet Spinky: An Open-Source Spindle and K-Complex Detection Toolbox Validated on the Open-Access Montreal Archive of Sleep Studies (MASS), Frontiers in Neuroinformatics 11 (2017) 15. [16] A. Parekh, I. W. Selesnick, D. M. Rapoport, I. Ayappa, Detection of K-complexes and sleep spindles (DETOKS)

590

using sparse optimization, J. Neurosci. Methods 251 (2015) 37–46. [17] C. R. Patti, T. Penzel, D. Cvetkovic, Sleep spindle detection using multivariate gaussian mixture models, Journal of Sleep Research 27 (4) (2017) e12614. [18] D. Lachner-Piza, N. Epitashvili, A. Schulze-Bonhage, T. Stieglitz, J. Jacobs, M. D¨ umpelmann, A single channel sleep-spindle detector based on multivariate classification of EEG epochs: MUSSDET, J. Neurosci. Methods 297 (2018) 31–43.

t

595

ip

[19] C. R. Patti, S. S. Shahrbabaki, C. Dissanayaka, D. Cvetkovic, Application of random forest classifier for automatic sleep spindle detection, in: 2015 IEEE Biomedical Circuits and Systems Conference (BioCAS),

cr

2015, pp. 1–4.

[20] T.-Y. Lin, P. Doll´ ar, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in: CVPR, 2017.

us

600

an

[21] T. Lin, P. Goyal, R. B. Girshick, K. He, P. Doll´ar, Focal loss for dense object detection, arXiv:1708.02002. [22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, A. C. Berg, SSD: single shot multibox detector, arXiv:1512.02325.

605

M

[23] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You Only Look Once: Unified, Real-Time Object Detection, in: Proc. CVPR, 2016, pp. 779–788.

d

[24] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards real-time object detection with region proposal

Ac ce pt e

networks, in: Proc. NIPS, 2015, pp. 91–99.

[25] S. Stober, D. J. Cameron, J. a. Grahn, Using Convolutional Neural Networks to Recognize Rhythm Stimuli from Electroencephalography Recordings, NIPS (2014) 1449–1457. 610

[26] V. J. Lawhern, A. J. Solon, N. R. Waytowich, S. M. Gordon, C. P. Hung, B. J. Lance, EEGNet: A Compact Convolutional Network for EEG-based Brain-Computer Interfaces, arXiv:1611.08024v2 (2016) 1–20arXiv: 1611.08024.

[27] S. Stober, A. Sternin, A. M. Owen, J. A. Grahn, Deep Feature Learning for EEG Recordings, arXiv:1511.04306v4 (2016) 1–24. 615

[28] P. Bashivan, I. Rish, M. Yeasin, N. Codella, Learning Representations from EEG with Deep RecurrentConvolutional Neural Networks, ICLR (2016) 1–15. [29] S. Chambon, M. N. Galtier, P. J. Arnal, G. Wainrib, A. Gramfort, A Deep Learning Architecture for Temporal Sleep Stage Classification Using Multivariate and Multimodal Time Series, IEEE Trans Neural Syst Rehabil Eng 26 (4) (2018) 758–769.

620

[30] A. N. Olesen, P. Jennum, P. Peppard, E. Mignot, H. B. D. Sorensen, Deep residual networks for automatic sleep stage classification of raw polysomnographic waveforms, in: proc. EMBC, 2018. [31] H. Phan, F. Andreotti, N. Cooray, O. Y. Ch´en, M. De Vos, Automatic Sleep Stage Classification Using SingleChannel EEG: Learning Sequential Features with Attention-Based Recurrent Neural Networks, in: Int. Conf. of the IEEE Engineering in Medicine and Biology Society, 2018.

18

Page 20 of 24

625

[32] F. Andreotti, H. Phan, N. Cooray, C. Lo, M. T. M. Hu, M. De Vos, Multichannel Sleep Stage Classification and Transfer Learning using Convolutional Neural Networks, in: EMBC, 2018. [33] A. Supratak, H. Dong, C. Wu, Y. Guo, DeepSleepNet: a model for automatic sleep stage scoring based on raw single-channel EEG, IEEE Trans Neural Syst Rehabil Eng. [34] O. Tsinalis, P. M. Matthews, Y. Guo, S. Zafeiriou, Automatic Sleep Stage Scoring with Single-Channel EEG

630

Using Convolutional Neural Networks, arXiv:1610.01683 (2016) 1–10. [35] Y. Roy, H. J. Banville, I. Albuquerque, A. Gramfort, T. H. Falk, J. Faubert, Deep learning-based electroencephalography analysis: a systematic review, CoRR abs/1901.05498. arXiv:1901.05498. URL http://arxiv.org/abs/1901.05498 [36] S. Chambon, V. Thorey, P. K. Arnal, E. Mignot, A. Gramfort, A Deep Learning Architecture To Detect Events

635

In EEG Signals During Sleep, in: IEEE International Workshop on Machine Learning for Signal Processing, 2018.

t

[37] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained

ip

part-based models, IEEE Trans. Pattern Anal. Mach. Intell. 32 (9) (2010) 1627–1645. doi:10.1109/TPAMI. 2009.167. URL http://dx.doi.org/10.1109/TPAMI.2009.167

cr

640

[38] L. C. Parra, C. D. Spence, A. D. Gerson, P. Sajda, Recipes for the Linear Analysis of EEG, NeuroImage 28 (2)

us

(2005) 326 – 341.

[39] B. Blankertz, R. Tomioka, S. Lemm, M. Kawanabe, K. R. Muller, Optimizing spatial filters for robust EEG

645

an

single-trial analysis, IEEE Signal Processing Magazine 25 (1) (2008) 41–56.

[40] B. D. V. Veen, W. V. Drongelen, M. Yuchtman, A. Suzuki, Localization of Brain Electrical Activity via Linearly

867–880.

M

Constrained Minimum Variance Spatial Filtering, IEEE Transactions on Biomedical Engineering 44 (9) (1997)

[41] B. Rivet, A. Souloumiac, V. Attina, G. Gibert, xdawn algorithm to enhance evoked potentials: Application to

d

brain–computer interface, IEEE Transactions on Biomedical Engineering 56 (8) (2009) 2035–2043. [42] A. Rakotomamonjy, V. Guigue, BCI Competition III: Dataset II- Ensemble of SVMs for BCI P300 Speller,

Ac ce pt e

650

IEEE Transactions on Biomedical Engineering 55 (3) (2008) 1147–1154. [43] B. Rivet, H. Cecotti, E. Maby, J. Mattout, Impact of spatial filters during sensor selection in a visual p300 brain-computer interface, Brain Topography 25 (1) (2012) 55–63. [44] P. Mirowski, D. Madhavan, Y. LeCun, R. Kuzniecky, Classification of Patterns of EEG Synchronization for 655

Seizure Prediction, Clinical Neurophysiology 120 (11) (2009) 1927–1940. [45] D. F. Wulsin, J. R. Gupta, R. Mani, J. A. Blanco, B. Litt, Modeling electroencephalography waveforms with semi-supervised deep belief nets: fast classification and anomaly measurement, Journal of Neural Engineering 8 (3).

[46] W. Zheng, J. Zhu, Y. Peng, B. Lu, EEG-based emotion classification using deep belief networks, in: IEEE 660

ICME, Vol. 00, 2014, pp. 1–6. [47] H. Cecotti, A. Gr¨ aser, Convolutional Neural Network with embedded Fourier Transform for EEG Classification, in: International Conference on Pattern Recognition, 2008, pp. 1–4. [48] H. Cecotti, A. Gr¨ aser, Convolutional Neural Networks for P300 Detection with Application to Brain-Computer Interfaces, IEEE Trans. Pattern Anal. Mach. Intell. 33 (3) (2011) 433–445.

665

[49] R. Manor, A. B. Geva, Convolutional Neural Network for Multi-Category Rapid Serial Visual Presentation BCI., Frontiers in Computational Neuroscience 9 (December) (2015) 146.

19

Page 21 of 24

[50] S. Biswal, J. Kulas, H. Sun, B. Goparaju, M. B. Westover, M. T. Bianchi, J. Sun, SLEEPNET: automated sleep staging system via deep learning, arXiv:1707.08262 (2017) 1–17. [51] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate 670

shift, in: Proc. ICML, 2015, pp. 448–456. [52] V. Nair, G. E. Hinton, Rectified Linear Units Improve Restricted Boltzmann Machines, in: Proc. ICML, 2010, pp. 807–814. [53] C. O’Reilly, N. Gosselin, J. Carrier, T. Nielsen, Montreal archive of sleep studies: an open-access resource for instrument benchmarking and exploratory research, Journal of Sleep Research 23 (6) 628–635.

675

[54] O. Andlauer, H. Moore, L. Jouhier, C. Drake, P. E. Peppard, F. Han, S.-C. Hong, F. Poli, G. Plazzi, R. O’Hara, E. Haffen, T. Roth, T. Young, E. Mignot, Nocturnal Rapid Eye Movement Sleep Latency for Identifying Patients With Narcolepsy/Hypocretin Deficiency, JAMA neurology 70 (7) (2013) 891–902. [55] T. Young, L. Finn, P. E. Peppard, M. Szklo-Coxe, D. Austin, F. J. Nieto, R. Stubbs, K. M. Hla, Sleep

(2008) 1071–1078.

ip

680

t

Disordered Breathing and Mortality: Eighteen-Year Follow-up of the Wisconsin Sleep Cohort, Sleep 31 (8)

cr

[56] D. A. Dean, A. L. Goldberger, R. Mueller, M. Kim, M. Rueschman, M. D., S. S. Sahoo, C. P. Jayapandian, L. Cui, M. G. Morrical, S. Surovec, G. Q. Zhang, S. Redline, Scaling up scientific discovery in sleep medicine:

us

The national sleep research resource, Sleep 5 (2016) 1151–1164.

[57] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, A. Lerer, Automatic differentiation in PyTorch, in: NIPS Workshop, 2017.

an

685

[58] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016, http://www.deeplearningbook.org. [59] A. Sors, S. Bonnet, S. Mirek, L. Vercueil, J.-F. Payen, A convolutional neural network for sleep stage scoring

M

from raw single-channel EEG, Biomedical Signal Processing and Control 42 (2018) 107 – 114. [60] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features in deep neural networks?, in: Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, K. Q. Weinberger (Eds.), Proc. NIPS, 2014, pp.

Ac ce pt e

3320–3328.

d

690

[61] K. Mikkelsen, M. de Vos, Personalizing deep learning models for automatic sleep staging, ArXiv e-printsarXiv: 1801.02645.

[62] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, 695

arXiv:1409.1556.

[63] H. Dong, A. Supratak, W. Pan, C. Wu, P. M. Matthews, Y. Guo, Mixed Neural Network Approach for Temporal Sleep Stage Classification, arXiv:1610.06421v1 1. arXiv:1610.06421. [64] H. Phan, F. Andreotti, N. Cooray, O. Y. Chen, M. De Vos, SeqSleepNet: End-to-End Hierarchical Recurrent Neural Network for Sequence-to-Sequence Automatic Sleep Staging, ArXiv:1809.10932. 700

[65] S. Ben-David, J. Blitzer, K. Crammer, F. Pereira, Analysis of representations for domain adaptation, in: NIPS, 2006, pp. 137––144. [66] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, J. Vaughan, A theory of learning from different domains, Machine Learning 79 (2010) 151–175. [67] S. Chambon, M. N. Galtier, A. Gramfort, Domain adaptation with optimal transport improves EEG sleep

705

stage classifiers, in: PRNI, 2018, pp. 1–4.

20

Page 22 of 24

5. Appendix 5.1. Benchmark spindles - learning losses In this paragraph, we investigate further the performances obtained by the proposed approach on the spindles detection task, over the 3 considered dataset: SS2, SSC and WSC. We demonstrate that the spindle detection task 710

is easier for the proposed approach on SS2 compared to SSC and WSC. We report in Figure 13 the F1 scores at IoU = 0.3 and the standard deviations in scores obtained by each method. In Figure 14, we present the classification and the localization losses (on the training and the validation sets) during the learning for the first split over each dataset.

1.0

Parekh et al. 2017 Lajnef et al. 2017

F1 (IoU = 0.3)

0.8

Methods

Lachner-Piza et al. 2018 Proposed approach

0.6 0.4

0.0

SS2

ip

t

0.2 SSC

WSC

cr

Dataset

Figure 13: Averaged F1 scores at IoU = 0.3 and standard deviations: on SSC and WSC the proposed approach exhibits lower detection

Training

Validation

1.4

Classification loss

an

1.4 1.2

1.2 1.0

M

1.0 0.8

0.8

20

30

40

0.6 50 0 5

d

10

Localization loss

Ac ce pt e

0.6 0 5 4

4

3

3

2

2

10

us

performances and larger standard deviations

10

20

30

# epochs

40

50 10

10

20

30

40

50

SS2 SSC WSC 10

20

30

# epochs

40

50

Figure 14: Classification and localization losses for spindles detection, on training and validation sets, over the first split of crossvalidation for each dataset: the losses agree with the assumption that the spindles detection task is easier on SS2 than on SSC / WSC

The detection scores reached by the proposed approach exhibit large differences between the 3 considered 715

datasets: the averaged scores are higher on SS2 than on SSC and WSC, and the standard deviations are smaller. The fact that the state-of-the-art baselines suffer also some decreases in their detection performances suggests that the spindles detection task is more difficult on SSC and WSC than on SS2. The classification and localization losses during training agree with this assumption. Indeed, the classification and localization losses exhibited by the proposed approach on SS2 reach much lower asymptotic values than on

720

SSC or WSC. This indicates that the events are much easier to identify and localize on SS2 than on SSC or WSC.

21

Page 23 of 24

5.2. Generalization to new data In this experiment, we evaluate the proposed approach and the baselines in the context of an inter-dataset benchmark. This enables to quantify the generalization performances of the proposed approach on new datasets. For this benchmark, all recordings were resampled to 256 Hz. For the cross-validation, the recordings of two 725

datasets are used for training and validation (80% of recordings for training, 20% of recordings for validation). All the recordings from the remaining dataset are used for testing.

SSC

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

Precision

0.8

0.00.0 1.0

0.2

0.4 0.6 Recall

0.8

1.0 0.00.0 1.0

0.2

0.4 0.6 Recall

0.8

1.0 0.00.0 1.0 0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

IoU

0.6

0.8

Parekh et al. 2017

1.0 0.00.0

0.2

0.4

IoU

Lajnef et al. 2017

0.6

us

0.4

0.8

1.0 0.00.0

Lachner-Piza et al. 2018

an

0.2

0.2

0.4 0.6 Recall

0.8

1.0

cr

0.8

F1

0.8

0.00.0

WSC

1.0

t

1.0

ip

SS2

1.0

0.2

0.4

IoU

0.6

0.8

1.0

Proposed approach

M

Figure 15: Spindle detection: inter-dataset benchmark on 3 different datasets when 2 are used for training / validation and the remaining one is used for testing. First row: averaged precision / recall at IoU = 0.3. Second row: F1 score as a function of IoU. The proposed approach exhibits lower performances than in the intra-dataset benchmarks but outperforms the alternative methods on 2 out of the 3

d

considered datasets.

The inter-dataset benchmark demonstrates first that the proposed approach outperforms the considered baselines

Ac ce pt e

on 2 out of the 3 considered tasks i.e. when SS2 and SSC are used for testing. However, it is outperformed by Lachner-Piza et al. 2018 and exhibits detection performances similar to the other baselines on WSC. 730

Second, the proposed approach exhibits lower detection performances than the one obtained on the intra-dataset benchmarks, especially on SS2 and WSC.

Indeed, the lower performances might be explained by a bad choice for the detection threshold. The precision and the recall scores on SS2 and WSC demonstrate that the precision is too high on WSC and the recall too high on SS2. A way to validate this assumption could be to use some validation recordings from the testing dataset to 735

investigate whether this leads to higher performances. Reduced performances on new datasets is a well known problem in machine learning. It is often referred to as the domain discrepancy [65, 66] that exists between the training data SSC - SS2 (resp. SSC - WSC) and the testing date WSC (resp. SS2). This statistical effect can be induced either by the change of population, the change of EEG recording devices or even the change of labeling guidelines. Specific domain adaptation strategy shall be

740

investigated to address this generalization [67].

22

Page 24 of 24