Real-time EEG classification via coresets for BCI applications

Real-time EEG classification via coresets for BCI applications

Engineering Applications of Artificial Intelligence 89 (2020) 103455 Contents lists available at ScienceDirect Engineering Applications of Artificia...

1MB Sizes 1 Downloads 80 Views

Engineering Applications of Artificial Intelligence 89 (2020) 103455

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai

Real-time EEG classification via coresets for BCI applications✩ Eitan Netzer a , Alex Frid b ,∗, Dan Feldman a a b

Robotics and Big Data Lab, Computer Science department, University of Haifa, Haifa, Israel Laboratory of Clinical Neurophysiology, Faculty of Medicine, Technion IIT, Haifa, Israel

ARTICLE

INFO

Keywords: Machine learning Coreset Data structures On-line learning Electroencephalogram (EEG) Brain computer interface (BCI)

ABSTRACT A brain-computer interface (BCI) based on the motor imagery (MI) paradigm translates a subject’s motor intention into a control signal by classifying the electroencephalogram (EEG) signals of different tasks. However, most existing systems use either (i) a high-quality algorithm to train the data off-line and run only the classification in real-time since the off-line algorithm is too slow, or (ii) low-quality heuristics that are sufficiently fast for real-time training but introduce relatively large classification error. In this work, we propose a novel processing pipeline that allows real-time and parallel learning of EEG signals using high-quality but potentially inefficient algorithms. This is done by forging a link between BCI and coresets, a technique that originated in computational geometry for handling streaming data via data summarization. We suggest an algorithm that maintains the representation of such coresets tailored to handle the EEG signal which enables (i) real-time and continuous computation of the common spatial pattern (CSP) feature extraction method on a coreset representation of the signal (instead of the signal itself) , (ii) improvement of CSP algorithm efficiency with provable guarantees by applying the CSP algorithm on the coreset, and (iii) real-time addition of the data trials (EEG data windows) to the coreset. For simplicity, we focus on the CSP algorithm, which is a classic algorithm. Nevertheless, we expect that our coreset will be extended to other algorithms in future papers. In the experimental results, we show that our system can indeed learn EEG signals in real-time in, for example, a 64-channel setup with hundreds of time samples per second.

1. Introduction Brain–computer interfaces (BCIs) translate brain signals into control signals without using the test subject’s actual movements or peripheral nerves. BCIs based on electroencephalogram (EEG) recordings have many advantages, such as short time constraints, fewer environmental limits, and relatively inexpensive equipment requirements. However, EEGs introduce a great deal of noise and necessitate the handling large amounts of data in real-time. In addition, these systems usually require time-consuming training phases. In recent years many techniques have been developed for EEGMI-based BCI systems which have introduced high classification accuracy. For example, the average accuracy of classifying imaginary left and right hand movements in some cases has reached more than 90% (Ramoser et al., 2000; Dornhege et al., 2004). Many of these techniques are based on common spatial patterns (CSP) signal decomposition (Ang et al., 2008; Selim et al., 2018; Dornhege et al., 2006; Pfurtscheller et al., 1999; Fukunaga, 2013). Nevertheless, these

systems still have very limited use in real-life applications. This is partly because the algorithms used in these systems focus on analyzing multichanneled and densely sampled EEG signals, which requires relatively expensive equipment due to the need for vast amounts of processing power and memory. An example of such computational bottlenecks in these systems is the CSP algorithm (Blankertz et al., 2008). In essence it is a ‘‘batch processing algorithm’’, i.e., whenever a new sample is introduced, the algorithm needs to be re-trained in order to find the updated spatial filters. In this work, we present a method based on coreset representation (Agarwal et al., 2004) of the EEG signal that can be executed prior to CSP signal decomposition (see visualization in Fig. 1), and can thus reduce both computational cost and memory consumption without sacrificing classification accuracy. This in turn will allow for the use cheaper, low-powered hardware in BCI devices.

✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2019.103455. ∗ Corresponding author. E-mail address: [email protected] (A. Frid).

https://doi.org/10.1016/j.engappai.2019.103455 Received 6 December 2018; Received in revised form 5 July 2019; Accepted 23 December 2019 Available online xxxx 0952-1976/© 2020 Elsevier Ltd. All rights reserved.

E. Netzer, A. Frid and D. Feldman

Engineering Applications of Artificial Intelligence 89 (2020) 103455 Table 1 Complexity bound, coreset versus traditional CSP.

1.1. Background The common spatial pattern method was first used in EEG analysis to extract abnormal components from clinical EEGs (Koles, 1991), and in later stages, it was adopted for use in BCI applications. In essence, this method weights each electrode according to its importance in the discrimination task and suppresses noise in individual channels by using correlations between neighboring electrodes. Let 𝑋1 ∈ R𝑑×𝑡1 , and 𝑋2 ∈ R𝑑×𝑡2 be multivariate signals of degree 𝑑, where 𝑑 is the number electrodes or sensors and 𝑡𝑖 is the number of time samples. For example, in MI, 𝑋1 and 𝑋2 may represent the signal associated with a subject imagining the movement of his left or right hand. CSP determines for 𝑇 ‖2 every 𝑤 ∈ R𝑑 , such that non-zero ‖ ‖𝑤𝑋2 ‖ , the component 𝑤 that maximizes the ratio of variance between 𝑋1 and 𝑋2 (Dornhege et al., 2006) is

Algorithm/Complexity Traditional CSP Coreset CSP

𝑋𝑖 𝑋𝑖𝑇 𝑡𝑖

1. The Running time required to minimize 𝑓 (𝑋, 𝑞) may be impractical. A possible solution is to use faster heuristics with no provable approximations, but the cost may be a weaker classifier. In the context of EEGs, in many applications, the signal from the brain received in real-time and the model must both be updated in a fraction of a second. 2. Memory management issues arise for large signals that either cannot fit into memory (RAM), or can fit into memory but are too large to be processed by the optimization algorithm. In the context of EEGs, the memory (signal length) increases because of the high number of channels, the frequency of the sampling, and possibly the number of users. 3. On-line on-the-fly data points that are received from the signals are classified but are not used to update the model (i.e., learning) to improve classification over time. In the context of EEGs, we may get feedback from the user or the real-world regarding the last classification, and we may wish to use the new information for the next samples. 4. Parallel computation. Even if the algorithm is sufficient, when using distributed data, it may not be clear how to run it in parallel over a few parallel computation threads in order to reduce its running time and take advantage of the computation power available with modern multi-core CPUs or GPUs. In the context of EEGs, the input is also parallel when it is received from either few users or few BCI channels from the same user. 5. Distributed computation. Even if the algorithm supports parallel computation, it may not support distributed computation. Here, the input signal itself is partitioned between different machines (cloud, devices) or threads, as in GPUs, that have no shared memory and little communication data between them might be expensive. In the context of EEGs, each user may be connected to a different computation device, but we aim for a single classifier. Similarly, when the signal is streamed to a cloud of machines, we need parallel computation with little shared memory via network communication, which is relatively expensive and slow. 6. Dynamic data. Even on-line or parallel algorithms are typically unable to handle deletion of samples in the signals. In the context of EEGs, this is the case when we use the sliding window model. Here, we want the classifier to represent only samples from the last 𝑡 seconds. That is, every time a new sample point arrives, the sample received 𝑡 seconds ago should be deleted. The classifier is then the one that minimizes 𝑓 (𝑋, 𝑞), above which 𝑋 is only the set of remaining samples. 7. Handling Variations and Constraints. In practice and in industry we usually have constraints, business rules and expert rules that are based on very specific scenarios, signals, laws, users or applications. For example, we may want to minimize 𝑓 (𝑋, 𝑞) + ‖𝑞‖ instead of 𝑓 (𝑋, 𝑞), where ‖𝑞‖ is called a regularization term that is used to obtain a simpler classifier, or we may want to minimize 𝑓 (𝑞), but under some specific constraints where 𝑞 ∈ 𝑄′ ⊂ 𝑄.

, where

𝑖 = 1, 2, then simultaneous diagonalization of both matrices (𝑅−1 𝑅1 ) 2 using generalized eigenvalue decomposition is performed. Since 𝑋𝑖 is of degree 𝑑, 𝑅𝑖 is of full degree and invertible. Let 𝑈 be an eigenvec{ } tor matrix and 𝐷 a diagonal matrix of eigenvalues 𝜆1 , 𝜆2 , … , 𝜆𝑑 in −1 −1 decreasing order, such that 𝑈 𝑅1 𝑈 = 𝐷 and 𝑈 𝑅2 𝑈 = 𝐼𝑑 , where 𝑅1 = 𝑈 𝑈 −1 𝑈 𝐷𝑈 −1 = 𝑈 𝐷𝑈 −1 , hence its 𝐼𝑑 is the identity matrix. 𝑅−1 2 equivalence to the eigendecomposition of 𝑅−1 𝑅1 , and 𝑤𝑇 corresponds 2 to the first column of 𝑈 . A more detailed description is provided in Algorithm 1 below. Let Data: 𝑋1 ∈ R𝑑×𝑡1 and 𝑋2 ∈ R𝑑×𝑡2 be multivariate signals, where 𝑑 is the number of electrodes or sensors and 𝑡𝑖 is the number of time samples from class 𝑖 = 1, 2 Result: component 𝑤𝑇 that maximizes the ratio between 𝑋1 and 𝑋2 , as in Eq. (1). Function CSP(𝑋1 ,𝑋2 ) 1. 𝑅1 ← 2. 𝐴 ←

𝑋1 𝑋1𝑇

𝑡1 𝑅−1 𝑅1 2

, 𝑅2 ←

Space ( ( )) 𝑂 𝑑 𝑡1 + 𝑡2 ( ) 𝑂 𝑑2

is essentially an adaptation of the algorithm presented in Levey and Lindenbaum (2000). Suppose 𝑋 ∈ R𝑑×𝑡 represents a set of sample points, where 𝑑 is the number of features, 𝑡 is the number of samples and 𝑞 ∈ 𝑄 a query from a set queries or family of models to optimize equation (1). However, these methods have one or all of the following disadvantages:

‖𝑤𝑋1 ‖2 ‖ . (1) 𝑤 ∈ arg max ‖ ‖𝑤𝑋2 ‖2 𝑤 ‖ ‖ In order to solve this problem (Legendre and Fortin, 1989), the CSP algorithm first computes the covariance matrices 𝑅𝑖 =

Time ( ( )) 𝑂 𝑑 2 𝑡1 + 𝑡2 + 𝑑 ( ) 𝑂 𝑑2

𝑋2 𝑋2𝑇 𝑡2

3. compute 𝐷, 𝑈 ← the eigendecomposition of 𝐴 4. 𝑤𝑇 ← 𝑈1 (the first column of 𝑈 ) 5. return 𝑤 Algorithm 1: Computation of Common Spatial Pattern (CSP) The main drawback for using this algorithm in real-time applications lies in its time and space complexity. Indeed, computing the ( ( )) covariance matrices (step 1 in Algorithm 1) is an 𝑂 𝑑 2 𝑡1 + 𝑡2 time complexity, followed by inversion of a 𝑑 × 𝑑 matrix, which takes ( ) 𝑂 𝑑 3 time complexity, then finding eigenvalues and eigenvectors with ( ) ( ( )) 𝑂 𝑑 2 . The total time complexity is thus 𝑂 𝑑 2 𝑡1 + 𝑡2 + 𝑑 , and the ( ( )) required space (memory) complexity is 𝑂 𝑑 𝑡1 + 𝑡2 , when typically, 𝑑 ≪ 𝑡𝑖 (due to the EEGs high sampling rate and its continuous operation). This dependency on time samples eliminates the possibility of using CSP in real-time streaming. Our coreset-based algorithm has a ( ) fixed time and space complexity of 𝑂 𝑑 2 , allowing real-time streaming applications for new added samples. We summarize these differences in Table 1. Past attempts have been made to reduce this computational cost. For example, Zhao et al. (2008) proposed an incremental way to update the spatial filters in which new samples are introduced to the algorithm, i.e., the feature extraction process is performed in an on-line fashion. Ross et al. (2008) proposed a new incremental learning algorithm for principal component analysis (PCA) with a forgetting factor. This 2

E. Netzer, A. Frid and D. Feldman

Engineering Applications of Artificial Intelligence 89 (2020) 103455

Fig. 1. CSP weight representation using Coresets. The leaves of the tree represent CSP algorithm results (i.e., weights) computed on an individual sample, a window or a single EEG trial. The vertices of the tree represent a ‘‘coreset CSP’’ computed on all samples up to that point (a unification of coresets in the proposed algorithm as opposed to recomputation of the CSP algorithm).

Alexandroni et al., 2016) in image processing with applications for medicine. Improved techniques for using coresets for distributed data and low communication on the cloud, with both theoretical guarantees and experimental results, were recently suggested in data mining conferences including (Feldman et al., 2015; Liberty, 2013). Classical optimization techniques such as Frank–Wolfe (Clarkson et al., 2010) and semi-definite programming (Cohen et al., 2016) appear to produce deterministic and smaller types of coresets. In Numerical linear algebra coresets were suggested for matrix approximations (Drineas et al., 2006a,b; Dasgupta et al., 2008) using random projections called sketches. The first coresets for signal processing with applications for GPS or video data were suggested in Feldman et al. (2013), Rosman et al. (2014) and Sung et al. (2012). The first results for probabilistic databases have appeared recently (Munteanu et al., 2014; Huang et al., 2016) In this work, we show that the coreset paradigm can improve the computations described in different sections, including provable guarantees of complexity in terms of training/inference time and memory, and also EEG applications. In particular, we demonstrate how coresets can be used to train a classifier in real-time for EEG signals. More details on coresets and the theoretical proofs for the computation models below can be found, e.g., in Har-Peled and Mazumdar (2004), Agarwal et al. (2005), Feldman and Langberg (2011) and Feldman et al. (2015). As in the previous section, consider the problem of minimizing 𝑓 (𝑋, 𝑞) over 𝑞 ∈ 𝑄, where 𝑄 is a set of queries and 𝑓 is a function 𝑓 (𝑋, 𝑞) → R+ . In this paper, a coreset for this optimization problem, as in the coreset for CSP, would be another set 𝐶 such that 𝑓 (𝐶, 𝑞) = 𝑓 (𝑋, 𝑞) for every 𝑞 ∈ 𝑄. The fact that the coreset approximates the original data in the above sense is not sufficient to handle the computation models in the previous sections. What we need for these is a composable coreset construction. This means that the coreset construction satisfies two properties. First, the union of two coresets is a coreset. That is, coresets are mergeable in the sense that if 𝑓 (𝐶𝑎 , 𝑞) approximates 𝑓 (𝑋𝑎 , 𝑞) and 𝑓 (𝐶𝑏 , 𝑞) approximates 𝑓 (𝑋𝑏 , 𝑞), then 𝑓 (𝐶𝑎 ∪ 𝐶𝑏 , 𝑞) approximates 𝑓 (𝑋𝑎 ∪ 𝑋𝑏 , 𝑞) for every 𝑞 ∈ 𝑄. The second property is that we can compress a pair of coresets to obtain another coreset. Formally, we can compute a coreset 𝐶 ′ such that 𝑓 (𝐶 ′ , 𝑞) approximates 𝑓 (𝐶𝑎 ∪ 𝐶𝑏 , 𝑞) for every 𝑞 ∈ 𝑄. Using these construction properties, we can build a coreset tree (see Fig. 1 for an example).

In this work, we present an alternative approach for optimizing CSP algorithm learning and its variants by using coreset representation of the data. This in turn satisfies the aforementioned requirements, i.e., it allows on-line learning, constant memory consumption, computational efficiency, and parallel and distributed computation while allowing not only addition but also deletion of the data. The rest of the paper is organized as follows: in Section 2, we present an introduction to coresets and provide a formulation of the corsets for EEG. In Section 3, we present the proposed algorithm, its analysis and theorems. In Section 4, we compare the practical performance of the coreset-based CSP algorithm with a traditional algorithm on a well-known EEG dataset. Finally, in Section 5, we summarize the work and our conclusions. 2. Related work: Coresets for EEG real-time processing The term coreset was coined by Agarwal, Har-Peled and Varadarajan in Agarwal et al. (2004). Initially, coresets improved the running of many open problems in computational geometry (e.g., Agarwal and Procopiuc, 2000; Agarwal et al., 2002; Har-Peled, 2004; Feldman et al., 2007); see surveys in Agarwal et al. (2005), Czumaj and Sohler (2007) and Phillips (2016). Later, coresets were designed for obtaining the first PTAS or LTAS (polynomial/linear time approximation schemes) for more classic and graphic problems in theoretical computer science (Frahling and Sohler, 2005; Czumaj et al., 2005; Frahling et al., 2008; Buriol et al., 2007), and have appeared more recently under the name ‘‘composable coresets’’ (Indyk et al., 2014; Mirrokni and Zadimoghaddam, 2015; Aghamolaei et al., 2015). Coresets are usually used when we want to approximate large data using a simple model with relatively few parameters; they are used less in real-time systems like the ones in this paper. In particular, in projective clustering (Feldman et al., 2013b; Agarwal and Procopiuc, 2000; Har-Peled and Varadarajan, 2002; Agarwal and Mustafa, 2004), the model is a set of 𝑘 points, lines or subspaces with an appropriate fitting cost. This is also a common setting in machine learning (Feldman et al., 2016, 2011; Tsang et al., 2005; Lucic et al., 2016, 2015; Huggins et al., 2016). More applied research has been suggested, e.g., by Rus et al. (Feldman et al., 2016; Sung et al., 2012; Feldman et al., 2013; Rosman et al., 2014), Krause (Feldman et al., 2011; Lucic et al., 2016), Smola (Reddi et al., 2015) and Sochen (Feigin et al., 2011; Feldman et al., 2013a; 3

E. Netzer, A. Frid and D. Feldman

Engineering Applications of Artificial Intelligence 89 (2020) 103455

Running time. Let 𝑠 ≥ 1 be an integer such that, if the input set 𝑋 to the coreset construction algorithm is of size |𝑋| ≤ 𝑠, then the resulting coreset is of cardinality |𝐶| ≤ 𝑠∕2. Assume that this construction takes 𝑔(𝑠) time. As shown in Fig. 1, we can now merge-and-reduce a dataset 𝑋 of size 𝑛 recursively to obtain its coreset under the above models. That is, we partition the input signal into subsets of samples, each of size 𝑠∕2 (the leaves of the binary tree). In the next level of the tree, we take every union of coresets in a pair of leaves (that consist of 𝑠 points), and reduce them back to 𝑠 points. This takes overall (2𝑛∕𝑠) ⋅ 𝑔(𝑠) time for the 𝑛∕𝑠 leaves, which is linear in 𝑛, even if our coreset constructions takes, say, 𝑔(𝑛) = 𝑛10 time for the input of |𝑋| = 𝑛 points.

Each of these steps is described in detail below: The EEG signal 𝑋𝑖,𝑡 at time 𝑡 of each class 𝑖 ∈ {0, 1} is represented using a coreset in the following way: For each new time sample (a processing window or an experimental trial), the coreset compresses the signal to be bounded by the number of electrode leads (i.e., sensors). When a new time sample is entered into the system, the current signal is represented by a 𝑑 × (𝑑 + 1) matrix. In order to compress it back to 𝑑 × 𝑑, first an SVD matrix ( ) decomposition is applied, resulting in 𝑈 , 𝑆, 𝑉 = 𝑠𝑣𝑑 𝑋𝑖,𝑡 , where 𝑆 is a diagonal eigenvalue matrix and 𝑈 , 𝑉 are matrices whose columns are the singular vectors. Let 𝑌 = 𝑈 𝑆, then ‖𝑋‖2 = ‖𝑌 𝑉 ‖2 = (𝑌 𝑉 ) (𝑌 𝑉 )𝑇 = 𝑌 𝑉 𝑉 𝑇 𝑌 𝑇 = ‖𝑈 ⋅ 𝑆‖2 . Then, we update the coreset representation of the signal to be 𝐶𝑖,𝑡 = 𝑈 ⋅ 𝑆, an eigenvalue 𝑑 × 𝑑 matrix, and repeat this procedure for each incoming sample (or processing window). The time complexity ( ) of adding a sample to the coreset is 𝑂 𝑑 3 and the space complexity is ( 2) 𝑂 𝑑 , when 𝑑 is typically very small. See Algorithm 2 for more details.

Streaming data. Streaming data can be handled in a similar way, where the leaves arrive on the fly. Every set in a pair of leaves is reduced to a single coreset and the pair of leaves is then deleted. Hence, there is no more than one pair of leaves during the streaming. Similarly, whenever there are two inner nodes in a level of the tree, we reduce them to a single node in the higher level. At any given moment, there is at most a single coreset in each of the 𝑂(log 𝑛) levels of the tree for the 𝑛 points seen so far in the stream.

Data: 𝐶 empty matrix or 𝐶 ∈ R𝑑×𝑑 coreset representation. 𝑥 ∈ R𝑑 a multivariate time sample of a signal, where 𝑑 is the number of electrodes or sensors. Result: An update by sample coreset representation of the input signal. Function EEGSignalCorest(𝐶, 𝑥):

Distributed data. When the data is both streamed and distributed, say, to 𝑀 = 2 machines, we assume that every second point is being sent to the second machine, and the rest of the (odd) points are being sent to the first machine. This can be done directly by the users, or from a main server. Each machine is then independently computing the merge-andreduce tree of its points, as explained in the previous paragraph. The speed of computation and streaming then grows linearly with 𝑀. This is known as ‘‘embarrassingly parallel’’ independent computation. When a coreset for the complete data is needed, a main server can collect the coreset of each tree on each machine. This requires communication with a main server; however, since each machine sends only the coreset of its data, only 𝑂(𝑠) bits are sent in parallel from the 𝑀 machines.

1. 𝐶 ← 𝐶||𝑥 2. if 𝑅𝑜𝑤(𝐶) > 𝑑: 3. 𝑈 , 𝑆, 𝑉 = 𝑠𝑣𝑑(𝑠𝑒𝑙𝑓 .𝐶) 4. 𝐶 ←𝑈 ⋅𝑆 5. Return 𝐶 Algorithm 2: EEG signal coreset construction

Dynamic computations. To support deletion of input points (as in the sliding window model above), we need to store the complete mergeand-reduce tree as a balanced 2–3 binary tree whose leaves are the input points (a single point for each leaf). Here, every inner node, which is the root of a sub-tree, contains the coreset for the leaves in this sub-tree. When a point (leaf) is being deleted, we only need to update its 𝑂(log 𝑛) ancestors in the tree with their corresponding coresets. Recomputing these coresets takes 𝑓 (𝑠) ⋅ 𝑂(log 𝑛) time per point insertion/deletion, which is only logarithmic in 𝑛, the number of points seen so far.

Additionally we show that it is possible to concatenate a window of samples to a coreset. Lemma 1. For every 𝑤 ∈ R𝑑 ∶ ‖ [𝑈 ⋅ 𝑆 ]‖2 ‖ [𝑈 ⋅ 𝑆 ] [𝑉 0]‖2 ‖ [𝑈 ⋅ 𝑆 ⋅ 𝑉 ]‖2 ‖ ‖ ‖ ‖ ‖ ‖ ‖𝑤 ‖ = ‖𝑤 ‖ = ‖𝑤 ‖ ‖ ‖ ‖ ‖ ‖ ‖ 𝑥 𝑥 0 1 𝑥 𝑛+1 𝑛+1 𝑛+1 ‖ ‖ ‖ ‖ ‖ ‖ ]‖2 ‖ [𝑋 ‖ 1,2,…,𝑛 ‖ = ‖𝑤 ‖ , ‖ 𝑥𝑛+1 ‖ ‖ ‖ [ ] 𝑋1,2,…,𝑛 where is the concatenation of matrix 𝑋1,2,…,𝑛 ∈ R𝑑×𝑛 of samples 𝑥𝑛+1 1 to 𝑛 with the vector 𝑥𝑛+1 (which represents the sample 𝑛 + 1), 𝑆 ∈ R𝑑×𝑑 is the diagonal matrix of the eigenvalues, 𝑉 ∈ R𝑑×𝑑 is the matrix of [ ] 𝑈 ⋅𝑆 eigenvectors, and is the concatenation of samples 1 to 𝑛 after svd 𝑥𝑛+1

Real-time training as a feedback. Our coreset allows real-time training of the model. Brain-related signals, such as EEGs, are generated by a user. Using the system reaction of an update (a.k.a. inference) not only allows the system to ‘‘learn’’ the human participant, but also allows the user to learn the system. This allows the user to aim his thoughts best so as to control the system, shortening and updating training phase. 3. Proposed algorithm

decomposition with vector 𝑥𝑛+1 from sample 𝑛 + 1.

A full description of the algorithm flow is provided in this section, along with its graphical representation (see Fig. 2). As can be seen in Fig. 2, the input signal is streamed out, either from the database (i.e., a real-time simulation) or from the EEG headset. After acquiring the data, the EEG signal undergoes a preprocessing stage, during which (i) various signal artifacts are checked (such as eye blinks and loosened electrodes) and (ii) the signal is then band-passed to frequencies containing MI information. In the next step, a coreset is fitted to the EEG data, which leads to a more compact representation of the signal. The CSP algorithm is applied to this compact representation to find the discriminative spatial filters. Finally, the last step of the algorithm is the classification of the MI task type (i.e., the left or right hand movement). This step is performed using the Linear Discriminant Analysis (LDA) algorithm.

3.1. Common spatial patterns and corsets The coreset signal 𝐶𝑖,𝑡 for both signals 𝑖 = 1, 2, is used in every new sample to maximize the following equation (Eq. (2)): 𝑤 = argmax 𝑤

‖𝑤𝐶1,𝑡 ‖2 ‖ ‖ , ‖𝑤𝐶2,𝑡 ‖2 ‖ ‖

(2)

where 𝑤 is equivalent to Eq. (1) for the real-time process. Using the coreset signal, we are able to compute the CSP component much faster and using fewer samples. For the diagonal matrix and the real unitary matrix, ( )( )𝑇 𝑅𝑖 = 𝑈𝑖 𝑆𝑖 𝑉𝑖 𝑈𝑖 𝑆𝑖 𝑉𝑖 = 𝑈𝑖 𝑆𝑖 𝑉𝑖 𝑉𝑖𝑇 𝑆𝑖𝑇 𝑈𝑖𝑇 = 𝑈𝑖 𝑆𝑖2 𝑈𝑖𝑇 , 4

E. Netzer, A. Frid and D. Feldman

Engineering Applications of Artificial Intelligence 89 (2020) 103455

Fig. 2. Proposed system diagram. The additional process of using coresets to compress the EEG data is indicated by the red box.

the covariance matrix is 𝑅1 = 𝑈1 ⋅ 𝑆12 ⋅ 𝑈1𝑇 , where for 𝑆12 , we need to calculate for the main diagonal since 𝑆1 is a diagonal matrix, 𝑅−1 = 2 )−2 ( = 𝑈2𝑇 ⋅ 𝑆2−2 ⋅ 𝑈2 . For 𝑆2−2 , we simply calculate for the 𝑈2 ⋅ 𝑆22 ⋅ 𝑈2𝑇 main diagonal since 𝑆2 is a diagonal matrix and 𝑉2 is unitary and real matrix, hence 𝑈2−1 = 𝑈2𝑇 . The problem is reduced to computing

(Pfurtscheller et al., 2000). Choosing LDA to handle BCI tasks has several advantages over using what are sometimes considered more powerful algorithms. In this section, we lay out our main considerations for selecting the LDA algorithm as the final decision maker in the proposed processing pipeline. As previously stated, our aim is to develop a real-time application that will work on devices with limited computational power and memory. Thus the selected classifier must have (i) the ability to generalize from a small amount of data during the training, and (ii) relatively short training and testing times. These two criteria rule out most state-of-the-art algorithms, including

‖ −1 ‖2 ‖ 𝑇 −2 𝑇 ‖2 ‖𝑅2 𝑅1 ‖ = ‖𝑈2 𝑆2 𝑈2 𝑈1 𝑆12 𝑈1𝑇 ‖ . ‖ ‖ ‖ ‖ The complexity bound of adding a sample is determined by Algorithm 2. The time ( )complexity of calculating the covariance matrix and inverting is 𝑂 𝑑 2 because 𝑆𝑖 is a diagonal matrix and 𝑈𝑖 is a unitary and ( real ) matrix. Finding the eigenvalues and ( )eigenvectors takes time 𝑂 𝑑 3( , resulting in a time complexity of 𝑂 𝑑 3 and a space complexity ) of 𝑂 𝑑 2 . See Algorithm 3 (classification scheme) for additional details. Fig. 1, presented earlier in this paper, graphically depicts the advantage of using coresets for CSP computation. Each leaf in the graph represents a spatial filter (i.e., ‘‘𝑤’’) computed by the CSP algorithm. When an additional trial (or sample) is presented to the system, the traditional CSP computation requires re-computation of all previous EEG trials along with the new one. By comparison, using the ‘‘coresetbased CSP’’ only requires unification of two coresets (i.e., two leaves in the tree, as seen in Fig. 1). In comparison, the traditional approach requires re-computation of all of the data, i.e., all of the previous trials. Additionally, this coreset representation allows parallel computation and removal of unwanted or bad trials, or even a group of trials, from the CSP filters without recomputing the CSP algorithm on the remaining data.

1. Linear and non-linear Support Vector Machines (SVM) (Cortes and Vapnik, 1995). This algorithm requires solving a constrained quadratic problem during the training stage, which can either be accomplished using Quadratic Optimization methods (for a relatively small number of parameters) or using Lagrange Multipliers (for a large number of parameters). Both methods require large amounts of memory and processing power. In addition, an exhaustive parameter search is required during the training stage. 2. Artificial Neural Networks (ANN), i.e., ‘‘shallow networks’’, and Deep Neural Networks (DNN) (LeCun et al., 2015). Both of these algorithms require the use of either a sophisticated feature extraction process (as is the case with ‘‘shallow networks’’) or a significant number of training samples in order to be able to learn features from the data. Additionally, both algorithms use an iterative training process for convergence to a solution. This in turn places significant computational and memory demands on the system, thus making the training process problematic to apply in real-time, especially on low-resource devices. 3. Decision Trees (DT) (Quinlan, 1986). Although training decision trees is relatively straightforward due to its greedy fashion, a large dataset is still required in order to learn the distribution of the data. Additionally, a small change in the training data can cause a big change in the resulting decision tree. These factors make this method less suitable for on-line learning. 4. Other Variants of Discriminant Analysis methods, such as Quadratic Discriminant Analysis (QDA) and Regularized Discriminant Analysis (RDA) (Friedman, 1989). Generally, discriminant analysis classification methods rely on the idea that different classes generate data based on different (Gaussian) probabilities. While LDA assumes the same distribution for both classes, when the assumption of the common covariance matrix is not satisfied, an individual covariance matrix for each group is usually used. This leads to the so-called quadratic discriminant analysis (QDA), as the discriminating boundaries are quadratic curves. There is also an intermediate method between LDA and QDA, which is a regularized version of discriminant analysis (RDA) proposed by Friedman (1989). Since the classifier is personalized for each individual, it is reasonable to assume the

Data: 𝐶1 ∈ R𝑑×𝑑 a coreset of 𝑋1 𝐶2 ∈ R𝑑×𝑑 a coreset of 𝑋2 𝑥 ∈ R𝑑 a multivariate time sample of a signal, where 𝑑 is the number electrodes or sensors. 𝑦 ∈ {0, 1} the label of current sample Result: component 𝑤𝑇 that maximizes the ratio between 𝑋1 and 𝑋2 see Eq. (2). Function ComputeW(𝐶1 ,𝐶2 ,x,y): if y==1 then 𝐶1 ← 𝐸𝐸𝐺𝑆𝑖𝑔𝑎𝑛𝑙𝐶𝑜𝑟𝑒𝑠𝑒𝑡(𝐶1 , 𝑥) else 𝐶2 ← 𝐸𝐸𝐺𝑆𝑖𝑔𝑎𝑛𝑙𝐶𝑜𝑟𝑒𝑠𝑒𝑡(𝐶2 , 𝑥) end w = CSP(𝐶1 ,𝐶2 ) Return w Algorithm 3: CSP coreset representation

3.2. Classification scheme A common classifier for BCI and MI tasks is the Linear Discriminant Analysis (LDA), a.k.a. Fisher’s discriminant analysis method 5

E. Netzer, A. Frid and D. Feldman

Engineering Applications of Artificial Intelligence 89 (2020) 103455

Fig. 3. Four best CSP components for discrimination between left and right imaginary hand movements. Top: Coreset CSP : Bottom: Traditional CSP.

Fig. 4. (a) Indicates the error ratio between the traditional and coreset-based CSP algorithms according to Eq. (3). It can be seen that the error ratio is constant and equal to 1. (b) The time required for computation as a function of the number of samples. The blue line shows the traditional CSP versus the Corset-CSP, shown in red. (c) Shows the memory requirements for both algorithms (traditional CSP in blue and Coreset-CSP in red).

and accuracy of traditional CSP versus coreset-based CSP, computed at each time sample. To evaluate our method, we compare the results with traditional CSP-based MI-BCI using the following criteria:

same distribution parameters can be used for both classes and that there is no need for regularization. LDA is a generalization of the ‘‘classical’’ Fisher’s linear discriminant frequently used in statistical pattern recognition for finding a linear combination of features that separates between classes (Fisher, 1936; Martínez and Kak, 2001). If we let 𝑋 be a feature vector and 𝑦 a known class label, LDA assumes that the conditional probability density functions 𝑝 (𝑥 ∣ 𝑦 = 2) and 𝑝 (𝑥 ∣ 𝑦 = 1) are normally distributed with mean ( ) ( ) and covariance parameters 𝜇1 , 𝛴 and 𝜇2 , 𝛴 , respectively, where 𝛴 is Hermitian and invertible. It predicts using Bayes optimal solution ( ) with threshold 𝑇 and log likelihood ratio, 𝛴 −1 𝜇2 − 𝜇1 𝑥 > 𝐶, where ) 1 ( 𝑇 𝑇 −1 −1 𝐶 is a constant s.t. 𝐶 = 2 𝑇 − 𝜇1 𝛴 𝜇1 + 𝜇2 𝛴 𝜇2 . The prediction ( ) part relies on the dot product 𝑤𝑥 > 𝐶, where 𝑤 = 𝛴 −1 𝜇2 − 𝜇1 .

1. First, we show that the 𝑤𝑇 component in both methods reaches the same solution. 2. Next, a visualization of the four best CSP components is compared. 3. We then compare the classification accuracy. 4. Finally, we present the time and memory allocation used by both methods.

4. Results

In order to show that the 𝑤𝑇 component reaches the same solution, we calculated the ratio described by Eq. (3). Essentially, this is the ratio between two CSP results (i.e., the result using traditional CSP and the result using coreset-based CSP).

Hardware: A four core i7 laptop with 8 GB RAM. Input Data: For our system evaluation, we used the Motor Imaginary right/left hands task from the ‘‘EEG Motor Movement/Imaginary dataset’’ that was created and contributed to Physionet (Goldberger et al., 2000) by the developers of the BCI2000 instrumentation system (Schalk et al., 2004). The dataset was recorded using 64 electrodes in a 10–10 system arrangement, with a sampling frequency of 160 Hz. The dataset includes 109 participants (107 of whom performed the selected task) with about 44 trials (depending on the artifact rejection process applied) for each MI task. Pre-Processing: During this step, several sub-routines are applied, as depicted in Fig. 2 (see second stage). First, an artifact rejection process is applied in order to (i) remove eye blinks, (ii) detect loosened/noisy electrodes. Second, the signal is band-passed to 0.5–8 Hz in order to focus on delta and theta frequencies, which are known to be related to sensi-motor activity (Cruikshank et al., 2012). Evaluation: The goal of the experiment was to compare the time, memory consumption

‖2 ‖𝑤𝑋1 ‖2 ‖𝑤𝑋 ‖ ∕ ‖ ̃ 1‖ ‖ (3) ‖𝑤𝑋2 ‖2 ‖𝑤𝑋 ‖2 ‖ ‖ ‖ ̃ 2‖ where 𝑤̃ is calculated using ‘‘coreset-based CSP’’ and 𝑤 using the traditional (batch) CSP algorithm. Measurements: The aforementioned ratio (Eq. (3)) was calculated per sample in order to simulate real-time data acquisition conditions. The blue line in Fig. 4(a) indicates a stable ratio value of 1 (i.e., no error). In Fig. 3, we visualize the four largest components computed by each method. The top row shows the coreset-based CSP and the bottom row shows the traditional CSP, where the highest component is located to the left. It can be seen that the selected CSP weights are the same. In order to compare the classification results, we cross-validated the dataset using 5-fold cross validation, in which 9 trials were selected for the training set. The rest, approximately 35 trials (depend on the specific participant), were used in the testing set. Table 2 summarizes the classification results, averaged across all participants. Confusion 6

E. Netzer, A. Frid and D. Feldman

Engineering Applications of Artificial Intelligence 89 (2020) 103455

Table 2 Tables (a) and (b) present the confusion matrices of the traditional CSP-based and coreset-based algorithms, respectively. Table (c) summarizes the average classification results of both algorithms.

References

matrices in tables (a) and (b) show both type I and type II errors. The rows of the matrices indicate the actual results and the columns are the predicted (by the algorithm) results. The upper left confusion matrix, (a), presents the classification results based on the traditional CSP-based algorithm. The upper right confusion matrix, (b), presents the classification results computed from EEG data, approximated using coreset (i.e, coreset-based CSP method). Table 2(c) compares the overall, average, classification of both algorithms. It can be seen that the classification result is not damaged by the coreset approximation. Fig. 4(B) presents the computation time and memory (C) consumption of both methods. It can be seen that the coreset-based algorithm is superior in terms of memory allocation from the outset and maintains the same level with time (as opposed to the traditional CSP-based BCI algorithm). In addition, it can be seen that the coreset-based algorithm is computationally efficient and maintains the same computation time regardless of the number of samples from the signal that are used.

Agarwal, P.K., Har-Peled, S., Varadarajan, K.R., 2004. Approximating extent measures of points. J. ACM 51 (4), 606–635. Agarwal, P.K., Har-Peled, S., Varadarajan, K.R., 2005. Geometric approximations via coresets. Combin. Comput. Geom. - MSRI Publ. 52, 1–30. Agarwal, P.K., Mustafa, N.H., 2004. 𝑘-means projective clustering. In: Proc. 23rd ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems (PODS), pp. 155–165. URL http://doi.acm.org/10.1145/1055558.1055581http://www.acm. org/sigmod/pods/proc04/pdf/P-16.pdf. Agarwal, P.K., Procopiuc, C.M., 2000. Approximation algorithms for projective clustering. In: Proc. 11th Annu. ACM-SIAM Symp. on Discrete Algorithms (SODA), 2000, pp. 538–547. Agarwal, P.K., Procopiuc, C.M., Varadarajan, K.R., 2002. Approximation algorithms for k-line center. In: Proc. 10th Annu. European Symp. on Algorithms (ESA). In: Lecture Notes in Computer Science, vol. 2461, Springer, pp. 54–63. Aghamolaei, S., Farhadi, M., Zarrabi-Zadeh, H., 2015. Diversity maximization via composable coresets. In: Proceedings of the 27th Canadian Conference on Computational Geometry. Alexandroni, G., Moreno, G.Z., Sochen, N., Greenspan, H., 2016. Coresets versus clustering: comparison of methods for redundancy reduction in very large white matter fiber sets. In: SPIE Medical Imaging. International Society for Optics and Photonics, p. 97840A. Ang, K.K., Chin, Z.Y., Zhang, H., Guan, C., 2008. Filter bank common spatial pattern (fbcsp) in brain-computer interface. In: Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on. IEEE, pp. 2390–2397. Blankertz, B., Tomioka, R., Lemm, S., Kawanabe, M., Muller, K.-R., 2008. Optimizing spatial filters for robust eeg single-trial analysis. IEEE Signal Process. Mag. 25 (1), 41–56. Buriol, L.S., Frahling, G., Leonardi, S., Sohler, C., 2007. Estimating clustering indexes in data streams. In: Proc. 15th Annu. European Symp. on Algorithms (ESA). In: Lecture Notes in Computer Science, vol. 4698, Springer, pp. 618–632, URL http://dx.doi.org/10.1007/978-3-540-75520-3_55. Clarkson, K.L., Coresets, approximation, sparse.greedy., 2010. Sparse greedy approximation and the frank-wolfe algorithm. ACM Trans. Algorithms (TALG) 6 (4), 63. Cohen, M.B., Lee, Y.T., Miller, G., Pachocki, J., Sidford, A., 2016. Geometric median in nearly linear time. In: Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing. ACM, pp. 9–21. Cortes, C., Vapnik, V., 1995. Support-vector networks. Mach. Learn. 20 (3), 273–297. http://dx.doi.org/10.1007/BF00994018. Cruikshank, L.C., Singhal, A., Hueppelsheuser, M., Caplan, J.B., 2012. Theta oscillations reflect a putative neural mechanism for human sensorimotor integration. J. Neurophysiol. 107 (1), 65–77. Czumaj, A., Ergün, F., Fortnow, L., Magen, A., Newman, I., Rubinfeld, R., Sohler, C., 2005. Approximating the weight of the euclidean minimum spanning tree in sublinear time. SIAM J. Comput. 35 (1), 91–109. Czumaj, A., Sohler, C., 2007. Sublinear-time approximation algorithms for clustering via random sampling. Rand. Struct. Algorithms (RSA) 30 (1–2), 226–256, URL http://dx.doi.org/10.1002/rsa.20157. Dasgupta, A., Drineas, P., Harb, B., Kumar, R., Mahoney, M.W., 2008. Sampling algorithms and coresets for 𝓁𝑝 -regression. In: Proc. 19th Annu. ACM-SIAM Symp. on Discrete Algorithms (SODA), pp. 932–941. URL http://doi.acm.org/10.1145/ 1347082.1347184. Dornhege, G., Blankertz, B., Curio, G., Muller, K.-R., 2004. Boosting bit rates in noninvasive eeg single-trial classifications by feature combination and multiclass paradigms. IEEE Trans. Biomed. Eng. 51 (6), 993–1002.

5. Conclusions We show that coresets can indeed be used to learn EEG signals in real-time and with dynamic data by applying existing algorithms to them. Our theoretical and experimental results demonstrate that this can be done via 64 channels with hundreds of time samples per second, without compromising the accuracy of the system. The most significant added value of the proposed coreset-based CSP is its ability to enable on-line, incremental updates of the training data without compromising computational resources. This is achieved through the introduction of coreset-based compact representation of the feature extraction process (i.e., computation of CSP) which allows (i) parallel computation, (ii) addition of new data and (iii) removal (in real time) of bad trials or outliers from the system. Although this is not the only method capable of incrementally updating the CSP features, as mentioned in the introduction, it is the only one that satisfies all three of these advantages. A real-time EEG system based on cheap hardware is valuable for immediate interactive systems such as those used in neurofeedback settings. In the neurofeedback context, immediate response in a realtime EEG system can allow subjects to learn the system behavior and ‘‘teach’’ themselves how to control their thoughts to improve the system’s output. These capabilities can shorten the training period of EEG tasks and result in better, more adaptive and personalized systems. Additionally, we show that by using coreset approximation of the EEG signal, cheaper (i.e., equipped with less memory and computational power) or low-powered hardware can be used for training and running BCI systems. 7

E. Netzer, A. Frid and D. Feldman

Engineering Applications of Artificial Intelligence 89 (2020) 103455 Indyk, P., Mahabadi, S., Mahdian, M., Mirrokni, V.S., 2014. Composable core-sets for diversity and coverage maximization. In: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, pp. 100–108. Koles, Z.J., 1991. The quantitative extraction and topographic mapping of the abnormal components in the clinical eeg. Electroencephalogr. Clin. Neurophysiol. 79 (6), 440–447. LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. nature 521 (7553), 436. Legendre, P., Fortin, M.J., 1989. Spatial pattern and ecological analysis. Vegetatio 80 (2), 107–138. Levey, A., Lindenbaum, M., 2000. Sequential karhunen-loeve basis extraction and its application to images. IEEE Trans. Image Process. 9 (8), 1371–1374. Liberty, E., 2013. Simple and deterministic matrix sketching. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 581–588. Lucic, M., Bachem, O., Krause, A., 2016. Strong coresets for hard and soft bregman clustering with applications to exponential family mixtures. In: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pp. 1–9. Lucic, M., Ohannessian, M.I., Karbasi, A., Krause, A., space, Tradeoffs.for., time, 2015. Tradeoffs for space time data and risk in unsupervised learning. In: AISTATS. Martínez, A.M., Kak, A.C., 2001. Pca versus lda. IEEE Trans. Pattern Anal. Mach. Intell. 23 (2), 228–233. Mirrokni, V., Zadimoghaddam, M., 2015. Randomized composable core-sets for distributed submodular maximization. In: Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing. ACM, pp. 153–162. Munteanu, A., Sohler, C., Feldman, D., 2014. Smallest enclosing ball for probabilistic data. In: Proceedings of the Thirtieth Annual Symposium on Computational Geometry. ACM, p. 214. Pfurtscheller, G., Guger, C., Ramoser, H., 1999. Eeg-based brain-computer interface using subject-specific spatial filters. Eng. Appl. Bio-Inspir. Artif. Neural Netw. 248–254. Pfurtscheller, G., Neuper, C., Guger, C., Harkam, W., Ramoser, H., Schlogl, A., Obermaier, B., Pregenzer, M., 2000. Current trends in graz brain-computer interface (bci) research. IEEE Trans. Rehabil. Eng. 8 (2), 216–219. Phillips, J.M., 2016. Coresets and sketches. arXiv preprint arXiv:1601.00617. Quinlan, J.R., 1986. Induction of decision trees. Mach. Learn. 1 (1), 81–106. http: //dx.doi.org/10.1007/BF00116251. Ramoser, H., Muller-Gerking, J., Pfurtscheller, G., 2000. Optimal spatial filtering of single trial eeg during imagined hand movement. IEEE Trans. Rehabil. Eng. 8 (4), 441–446. Reddi, Sashank J, Póczos, Barnabás, Smola, Alex, 2015. Communication efficient coresets for empirical loss minimization. In: Conference on Uncertainty in Artificial Intelligence (UAI). Rosman, G., Volkov, M., Feldman, D., Fisher I.I.I., J.W., Rus, D., 2014. Coresets for k-segmentation of streaming data. In: Advances in Neural Information Processing Systems (NIPS). pp. 559–567. Ross, D.A., Lim, J., Lin, R.-S., Yang, M.-H., 2008. Incremental learning for robust visual tracking. Int. J. Comput. Vis. 77 (1–3), 125–141. Schalk, G., McFarland, D.J., Hinterberger, T., Birbaumer, N., Wolpaw, J.R., 2004. Bci2000: a general-purpose brain-computer interface (bci) system. IEEE Trans. Biomed. Eng. 51 (6), 1034–1043. Selim, S., Tantawi, M.M., Shedeed, H.A., Badr, A., 2018. A csp∖am-ba-svm approach for motor imagery bci system. IEEE Access 6, 49192–49208. Sung, C., Feldman, D., Rus, D., 2012. Trajectory clustering for motion prediction. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, pp. 1547–1552. Tsang, I.W., Kwok, J.T., Cheung, P.-M., 2005. Core vector machines: Fast svm training on very large data sets. J. Mach. Learn. Res. 6 (Apr), 363–392. Zhao, Q., Zhang, L., Cichocki, A., Li, J., 2008. Incremental common spatial pattern algorithm for bci. In: Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on. IEEE, pp. 2656–2659.

Dornhege, G., Blankertz, B., Krauledat, M., Losch, F., Curio, G., Müller, K.-R., 2006. Optimizing spatio-temporal filters for improving brain-computer interfacing. In: Advances in Neural Information Processing Systems. pp. 315–322. Drineas, P., Kannan, R., Mahoney, M.W., 2006a. Fast monte carlo algorithms for matrices i: Approximating matrix multiplication. SIAM J. Comput. 36 (1), 132–157. Drineas, P., Mahoney, M.W., Muthukrishnan, S., 2006b. Sampling algorithms for 𝑙2 regression and applications. In: SODA. Feigin, M., Feldman, D., Sochen, N., 2011. From high definition image to low space optimization. In: International Conference on Scale Space and Variational Methods in Computer Vision. Springer, pp. 459–470. Feldman, D., Faulkner, M., Krause, A., 2011. Scalable training of mixture models via coresets. In: Advances in Neural Information Processing Systems (NIPS). pp. 2142–2150. Feldman, D., Feigin, M., Sochen, N., 2013a. Learning big (image) data via coresets for dictionaries. J. Math. Imaging Vis. 46 (3), 276–291. Feldman, D., Langberg, M., 2011. A unified framework for approximating and clustering data. In: Proc. 34th Annu. ACM Symp. on Theory of Computing (STOC), see http://arxiv.org/abs/1106.1379 for fuller version. Feldman, D., Monemizadeh, M., Sohler, C., 2007. A PTAS for k-means clustering based on weak coresets. In: Proc. 23rd ACM Symp. on Computational Geometry (SoCG). Feldman, D., Schmidt, M., Sohler, C., 2013b. Turning big data into tiny data: Constantsize coresets for k-means, pca and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, pp. 1434–1453. Feldman, D., Sugaya, A., Sung, C., Rus, D., 2013. Idiary: from gps signals to a text-searchable diary. In: Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems. ACM, p. 6. Feldman, D., Tassa, T., constraints, More., 2015. More constraints, smaller coresets: constrained matrix approximation of sparse big data. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’15). ACM, pp. 249–258. Feldman, D., Volkov, M., Rus, D., 2016. Dimensionality reduction of massive sparse datasets using coresets. In: Advances in Neural Information Processing Systems (NIPS). Fisher, R.A., 1936. The use of multiple measurements in taxonomic problems. Ann. Hum. Genet. 7 (2), 179–188. Frahling, G., Indyk, P., Sohler, C., 2008. Sampling in dynamic data streams and applications. Internat. J. Comput. Geom. Appl. 18 (1/2), 3–28, URL http://dx.doi. org/10.1142/S0218195908002520. Frahling, G., Sohler, C., 2005. Coresets in dynamic geometric data streams. In: Proc. 37th Annu. ACM Symp. on Theory of Computing (STOC), pp. 209–217. Friedman, J.H., 1989. Regularized discriminant analysis. J. Amer. Statist. Assoc. 84 (405), 165–175. http://dx.doi.org/10.1080/01621459.1989.10478752, arXiv:https: //www.t{and}fonline.com/doi/pdf/10.1080/01621459.1989.10478752 URL https: //www.t{and}fonline.com/doi/abs/10.1080/01621459.1989.10478752. Fukunaga, K., 2013. Introduction to Statistical Pattern Recognition. Academic press. Goldberger, A.L., Amaral, L.A., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.-K., Stanley, H.E., Physiobank, physiotoolkit, 2000. And physionet: components of a new research resource for complex physiologic signals. Circulation 101 (23), e215–e220. Har-Peled, S., 2004. Clustering motion. Discrete Comput. Geom. 31 (4), 545–565, URL http://dx.doi.org/10.1007/s00454-004-2822-7. Har-Peled, S., Mazumdar, S., 2004. On coresets for 𝑘-means and 𝑘-median clustering. In: STOC. Har-Peled, S., Varadarajan, K.R., 2002. Projective clustering in high dimensions using coresets. In: Proc. 18th ACM Symp. on Computational Geometry (SoCG), pp. 312–318. Huang, L., Li, J., Phillips, J.M., Wang, H., 2016. Epsilon-kernel coresets for stochastic points. In: Sankowski, P., Zaroliagis, C.D. (Eds.), 24th Annual European Symposium on Algorithms, ESA 2016, August (2016) 22-24, Aarhus, Denmark. In: LIPIcs, vol. 57, Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, http://dx.doi.org/10.4230/ LIPIcs.ESA.2016.50, 50:1–50:18. Huggins, J., Campbell, T., Broderick, T., 2016. Coresets for scalable bayesian logistic regression. 4080–4088.

8