Classification in high-dimensional spectral data: Accuracy vs. interpretability vs. model size

Classification in high-dimensional spectral data: Accuracy vs. interpretability vs. model size

Neurocomputing 131 (2014) 15–22 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Classifica...

627KB Sizes 3 Downloads 35 Views

Neurocomputing 131 (2014) 15–22

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Classification in high-dimensional spectral data: Accuracy vs. interpretability vs. model size Andreas Backhaus n, Udo Seiffert Fraunhofer Institute for Factory Operation and Automation IFF, Sandtorstr. 22, 39106 Magdeburg, Germany

ar t ic l e i nf o

a b s t r a c t

Article history: Received 30 October 2012 Received in revised form 4 August 2013 Accepted 23 September 2013 Available online 23 November 2013

Against the background of classification in data mining tasks typically various aspects of accuracy, and often also of model size are considered so far. The aspect of interpretability is just beginning to gain general attention. This paper evaluates all three of these aspects within the context of several computational intelligence based paradigms for high-dimensional spectral classification of data acquired by hyperspectral imaging and Raman spectroscopy. It is focused on state-of-the-art paradigms of a number of different concepts, such as prototype based, kernel based, and support vector based approaches. Since the application point of view is emphasized, three real-world datasets are the basis of the presented study. & 2013 Elsevier B.V. All rights reserved.

Keywords: Learning Vector Quantization (LVQ) Radial Basis Function (RBF) networks Support Vector Machines (SVM) Supervised Neural Gas (SNG) Hyperspectral imaging Raman spectroscopy

1. Introduction Hyperspectral imaging as recent extension to traditional noninvasive spectroscopic analysis techniques (e.g. NIR spectroscopy) has paved the way to obtain the biochemical constitution of inspected solid materials with the additional advantage of a two-dimensional spatial resolution [1,2]. For the examination of liquid samples, Raman spectroscopy has been shown to be a viable tool to gather information without extensive sample preparation [3]. Often the direct relationship between spectral information and biochemical target values or material category is not known in a closed mathematical form. In this case a machine learning approach is used to acquire an analysis model from reference data, a paradigm often referred to as ‘soft-sensor’. Sensor data analysis becomes a pattern recognition task. Regarding pattern recognition and data mining in the acquired spectral data, computational intelligence based methods are still providing powerful tools to cope with this kind of high-dimensional and complex data (see Fig. 1). From the computational intelligence point of view the recent developments in hyperspectral camera technology with increasingly high resolution in both the spectral and spatial domain have led to high-dimensional input spaces and a large number of training vectors. Both aspects even more motivate and demand computational intelligence based algorithms. n

Corresponding author.

0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2013.09.048

Besides unsupervised visualization and clustering typically used to get a (first) graphical representation of the acquired spectral data, classification and multivariate regression is often required by the underlying application. Here, corresponding labeled data is necessary. Since suitable wet lab analyses to provide continuously valued reference data are typically expensive, frequently categorical labels are provided. This leads to a classification task. Industrial applications in product quality control and sorting also demand on-line classification at a low systems cost. Therefore this classification task has in general three, sometimes conflicting, objectives to address. The first objective is a classification model of high accuracy. The second objective is an as small as possible classification model for quick calculation. A third objective is the restriction to necessary information/features of the examined objects for the classification task at hand. In spectral data processing this means the restriction to necessary spectral bands. This not only speeds up calculation but also leads to less expensive spectral sensor systems. Therefore classification models need to offer a certain degree of interpretability. Relevance profiles for example can indicate the importance of the used input variables, in this case the acquired spectral bands. Additionally, classification models should require small or no expert interference in order to tune model parameters which could lead to biased, non-optimal decisions by the user. Keeping these requirements in mind, a number of computational intelligence paradigms appear to be particularly suitable. Among them are prototype-based neural networks, such as the Generalized Learning Vector Quantization (GLVQ) family [4],

16

A. Backhaus, U. Seiffert / Neurocomputing 131 (2014) 15–22

For the hyperspectral image acquisition coffee beans of four different green coffee varieties, two varieties of Arabica and two varieties of Robusta and a standard optical PTFE (polytetrafluoroethylene) calibration pad were positioned on a translation table, one class at a time. Hyperspectral images were recorded using a HySpex SWIR-320m-e line camera (Norsk Elektro Optikk A/S). Spectra are from the short-wave infra-red range (SWIR) between 970 nm and 2500 nm at 6 nm resolution yielding a 256dimensional spectral vector per pixel. The camera line has a spatial resolution of 320 px and can be recorded with a maximum frame rate of 100 fps. Radiometric calibration was performed using the vendors software package. Spectra are normalized to a vector length of one. Coffee beans were segmented from background via Neural Gas (NG) clustering [12]. We used five prototype vectors, each representing a cluster with a receptive field determined by the smallest Euclidean distance from data sample v to the prototypes. The prototype spectra w are randomly initialized and updated by minimizing the following energy function [13]:

x

y

Normalized Relative Reflectance

EðV; WÞ ¼ 0.2

with d being the Euclidean distance and where   kr ðv; Wc Þ hγ ðr; v; Wc Þ ¼ exp 

0.15

γ

0.1

0.05

0

1 ∑ ∑ hγ ðr; v; WÞdðv; wÞ; Cðγ ; K c Þv A Vw A W

0

50

100

150

200

250

300

Spectral Channel Fig. 1. Example of hyperspectral imaging cube: (A) Reflectance properties of objects can be recorded with spatial resolution; (B) each spatial pixel contains a spectral signature that contains information about the chemical composition of the observed material.

Supervised Neural Gas [5], and RBF (Radial Basis Function) networks [6] as well as Multilayer Perceptron (MLP) networks [7,8] and Support Vector Machines (SVM) [9]. These five approaches span the scope of the presented paper. The qualification of these different approaches regarding classification data from the hyperspectral imaging domain as well from Raman spectra data in terms of several theoretical considerations as well as practical aspects is evaluated. In order to derive practically relevant information from this study, several real-world datasets are used.

2. Material and methods 2.1. Green coffee spectra Quality control of coffee products, from basic green coffee to the finished roasted coffee, by hyperspectral imaging offers the means for a non-invasive, on-line and automated screening method to control large product quantities [10,11]. For example green coffee has to be inspected for the Robusta or Arabica varieties since Arabica based coffee is sold at a different price than Robusta based coffee. The spatial resolution of hyperspectral imaging makes it the ideal tool for loose material sorting especially in the case where information from color, shape, or texture is not sufficient for differentiation.

ð1Þ

ð2Þ

denotes the degree of neighborhood cooperation. The function kr ðv; WÞ gives the number of prototypes that have equal or smaller distance to the input spectra than prototype wr , and Cðγ ; K cv Þ is a normalization constant depending on the neighborhood range γ and cardinality K of W. Minimization was achieved with the freely available Matlab ‘minFunc’ optimization toolbox1 using the nonlinear conjugate gradient approach with automatic step size. The cluster representing coffee was chosen through manual inspection and all spectra in this cluster formed the respective coffee class. Fig. 2 depicts the clustering/segmentation process. The dataset contains the four green coffee varieties forming a 4-class problem with 2000 spectra per class. Fig. 3A shows average spectra for the four green coffee classes. 2.2. Scotch whisky spectra The automated, on-line assessment of high-priced liquor products is essential for the standardization and quality monitoring in liquor production as well as potential fraud detection. An ideal sensor should be compact for mobile applications and requires no special sample preparation while measuring sample quality instantaneously. In [3] an optofluidic chip was presented that uses Raman spectroscopy to acquire a Raman spectrum of the fluid sample. The procedure to acquire the Raman spectra from Whisky samples is shown in detail in [3]. In Raman spectroscopy a sample is illuminated with a laser beam. The laser light interacts with molecular vibrations, phonons or other excitations in the system, resulting in the energy of the laser photons being shifted up or down. The shift in energy gives information about the vibrational modes in the system. Raman spectroscopy is commonly used in chemistry, since vibrational information is specific to the chemical bonds and symmetry of molecules. Therefore, it provides a fingerprint by which molecules can be identified. Whisky samples of 20 μl were directly loaded into the microfluidic chip without any preparation. After Raman acquisition, any remaining liquid at the sample inlet was wiped off and 40 μl of 1

http://www.di.ens.fr/  mschmidt/Software/minFunc.html

A. Backhaus, U. Seiffert / Neurocomputing 131 (2014) 15–22

17

one. Fig. 3B shows average spectra for three whisky classes with standard deviation. 2.3. Crisp bread spectra Hyperspectral imaging offers the possibility to examine the spatial distribution and degree of coverage of food ingredients on the product surface in order to check with the design reference. As an example we consider the segmentation problem of sunflower stones on crisp bread. Here a classifier of very high accuracy is needed to label each pixel with their respective class. Subsequent processing steps could remove some classification errors but have to become more sophisticated with lower classification accuracy. Crisp bread samples containing sunflower stones were placed along with a standard optical PTFE calibration pad on a translation table. Hyperspectral images were recorded using the same HySpex SWIR-320m-e line camera as used in the coffee dataset. Dough, sunflower stones, and background material were manually marked and form a 3-class problem with 616 spectra per class. Radiometric calibration and spectra preprocessing was identical to the coffee dataset. Fig. 3C shows the mean spectra per class.

3. Theory

Relative Reflectance

1

Coffee Prototype

0.8 0.6 0.4 0.2 0

0

50

100

150

200

250

Spectral Channel

For machine learning, five different classification models are considered, a Radial Basis Function (rRBF) Network with Relevance Learning [6,2], Generalized Relevance Learning Vector Quantization (GRLVQ) [4], Supervised Relevance Neural Gas (SRNG) [5], a Multilayer Perceptron (MLP) Network as well as a Support Vector Machine [9]. rRBF, SRNG and GRLVQ Networks are similar in terms that they process the input data in a layer of prototypical data points. While the RBF generates activation due to the similarity with prototypes which are accumulated in a second layer for the network output, the GRLVQ and SRNG directly assign classes to prototypical data points. Prototypes usually represent central positions in a data cloud. In contrast, the Support Vector Machines store support vectors, e.g. representative data points at the margin between data clouds. The used Support Vector Machine implementation of the ν-SVM variant [14] from the freely available libSVM package2 takes up a variable amount of support vectors. In order to compute the distance of spectral data point v and a prototype w in the rRBF, SRNG and GRLVQ, we used the weighted Euclidean distance metric dðv; w r ; λÞ ¼ ∑λi ðvi  wir Þ2 ;

ð3Þ

i

Fig. 2. Segmentation of hyperspectral images using Neural Gas clustering: (A) The hyperspectral image as gray value coded image of the spectral band with highest contrast; (B) the labeled image from Neural Gas with five prototypes; (C) the prototype spectra acquired from minimizing the Neural Gas objective function; (D) pixels of the cluster representing coffee beans are used for the segmentation mask.

deionized water rinsed the system. Raman excitation was performed with 200 mW of laser power at a wavelength of 785 nm. Six commercially available Scotch whisky brands and their variants were used to build the dataset. All available data was labeled according to their distillery of origin resulting in a 6-class problem. For each class, 400 Raman spectra were available. Each dataset was scaled so the maximum across spectral bands was

where λi is the relevance factor per spectral band which is adapted during the learning process to form the relevance profile. The rRBF, SRNG, and GRLVQ learning approach is essentially an energy minimization problem. In the standard learning scheme, stochastic gradient descent with step-sizes manually set for different parameters are used. In order to avoid a manually chosen step-size, we used the non-linear conjugate gradient approach with automatic step size from the optimization toolbox. For this purpose we had to provide the objective/energy function along with the first derivatives according to the optimization parameters. Normally update rules based on the stochastic gradient approach are formulated for these methods [4,5,2]. We decided to give the full set of partial derivatives instead in this paper so the reader can implement the methods for an optimization package of his/her choosing. A completely different concept is behind MLP networks. Here, each node of the hidden and output layer(s) represents a piece of 2

www.csie.ntu.edu.tw/  cjlin/libsvm/

A. Backhaus, U. Seiffert / Neurocomputing 131 (2014) 15–22

0.14 0.12 0.1 0.08 0.06

Whiskey Spectra Ardbeg Edradour Macallan

1150 1100 1050 1000 950 900

0.04

850

0.02

800

Crispbread Spectra Normalized Relative Reflectance

Green Coffee Spectra Indian Robusta Peru Arabica Medelin Arabica Mexico Robusta

0.16

Intensity

Normalized Relative Reflectance

18

0.14

0.1 0.08 0.06 0.04 0.02

750

0 0

50

100

150

200

0 0

250

Sunflower Stone Dough Background

0.12

50 100 150 200 250 300 350 400

0

50

100

150

200

250

200

250

200

250

200

250

Spectral Channel

Spectral Channel

Spectral Channel

GRLVQ 0.04 0.02 0

0.03

Relevance

Relevance

Relevance

0.1

0.04

0.06

0.02 0.01 0

0

50

100

150

200

0

250

4 3

4 3 2

150

200

1

250

0

0.05

50

100

150

50

0.02 0.01

200

250

100

150

0.03 0.02 0.01

0 0

0

Spectral Channel

Relevance

Relevance

Relevance

0

SRNG

0.03

0.1

0

5

Spectral Channel

0.15

150

1

50 100 150 200 250 300 350 400

Spectral Channel

100

10-3 15 Relevance

Relevance

Relevance

5

100

50

Spectral Channel

rRBF

10-3 5

50

0

Spectral Channel

10-3 6

0

0

50 100 150 200 250 300 350 400

Spectral Channel

2

0.05

0 0

0

50 100 150 200 250 300 350 400

Spectral Channel

50

100

150

Spectral Channel

Spectral Channel

Fig. 3. Datasets and relevance profiles: (A)–(C) show the mean spectra per class for the different datasets used in this study; (D)–(L) depict the relevance profiles acquired with the rRBF, SRNG, and GRLVQ for the different classification tasks; GRLVQ and SRNG clearly acquired highly sparse profiles while the rRBF profiles are flat or less specific.

the borderline between data clusters. For highly curved and rugged cluster shapes this concept typically offers a close approximation of the borderlines at a rather low number of required nodes.

n o ∂E ¼  ∑∑ yk ðvj Þ  tjk ∂λi j k ∑urk ϕðdðvj ; wr ; λÞÞ r

ðvji  wir Þ2 : 2s2r

ð7Þ †

3.1. Radial basis function network with relevance For the rRBF the objective function is the accumulated quadratic error of the network output y and target value t across network outputs and data samples vj : 2 1 EðV; W; λÞ ¼ ∑∑fyk ðvj Þ  tjk g 2j k

ð4Þ

with yk ðvÞ ¼ ∑r urk ϕðdðv; w r ; λÞÞ and ϕðxÞ ¼ expð  x=2s2 Þ. The partial derivatives are as follows: n o ðxj  wir Þ ∂E ¼ ∑∑ yk ðvj Þ  tjk urk ϕðdðvj ; wr ; λÞÞ i 2 ∂wir sr j k

ð5Þ

The output weights urk are yielded by direct update UT ¼ Φ T where † denotes the pseudo inverse [15]. For the classification task a 1-out-of-N coding scheme for the target vector was used. 3.2. Generalized relevance learning vector quantization For the GRLVQ the objective function is the accumulated difference in shortest distance of a data point to a prototype þ representing its class dr and a prototype representing any other  class dr [4]: ! þ  dr  dr EðV; W; λÞ ¼ ∑ Φ þ ð8Þ  : dr þ dr vAV The partial derivatives are as follows:

n o ∂E ∑ λ ðv  wir Þ2 ¼ ∑∑ yk ðvj Þ  tjk urk ϕðdðvj ; wr ; λÞÞ i i i sr ∂sr j k

ð6Þ

∂E þ ∂dðÞ ¼ξ ∂wirþ ∂wirþ

ð9Þ

A. Backhaus, U. Seiffert / Neurocomputing 131 (2014) 15–22

∂E  ∂dðÞ ¼ ξ ∂wir ∂wir

ð10Þ

∂E þ ∂dðÞ  ∂dðÞ ¼ξ ξ ∂ λi ∂ λi ∂λi

ð11Þ

ξ



2  dr

þ ðdr

Network: 2 1 EðV; WÞ ¼ ∑∑fyk ðvj Þ  tjk g : 2j k

with

ξþ ¼

ð12Þ

 þ dr Þ2

ð13Þ

All partial derivatives not belonging to the winning prototype of same class wrþ and any other class wr is set to zero. The derivatives are accumulated for all data points (batch learning).

The optimization is performed with a conjugate gradient descent with momentum term. Weights are updated at time t with out Δwout kj ðtÞ ¼ αΔwkj ðt  1Þ  εð1  αÞ

3.5. The SRNG [5] is a supervised version of the well known neural gas clustering algorithm [12]. Like in the GRLVQ a number of prototype vectors with pre-assigned class labels are distributed in the input space while minimizing the energy function ! þ  1 d  dr EðV; W; λÞ ¼ ∑ ∑ hγ ðr; v; Wc ÞΦ rþ  ; C v A Vw r A W c dr þ dr where hγ ðr; v; Wc Þ denotes the degree of neighborhood cooperation analog to Eq. (2) among all prototypes representing the respective spectral vector class. The partial derivatives are as follows: ∂E þ hγ ðr; v; W c Þ ∂dðÞ ¼ ξr  C ∂wirþ ∂wirþ

ð15Þ

ð16Þ



3.4. Multiple layer perceptron with conjugate gradient descent The MLP network in this paper contains one hidden layer and one output layer. The output activation on input presentation is calculated by !! yk ¼ f

∑ wout kj g

j¼0

∑ whidden vi ji

i¼0

;

ð17Þ

where f and g are output functions for the output and hidden layers. The weights wk0 and wj0 contain the neuron's bias, consequently their input values are set to one. The network is optimized so it minimizes the following energy function, identical to the RBF 3

1 EðαÞ ¼  ∑ αl αm yl ym kðvl ; vm Þ 2l;m

http://www.di.ens.fr/  mschmidt/Software/minFunc.html

1 N

ð19Þ

ð20Þ

∑αl yl ¼ 0

ð21Þ

∑αl r ν;

ð22Þ

l

The terms ξr and ξr are according to Eqs. (13) and (12). All partial derivatives not updated are set to zero. The derivatives are accumulated for all data points (batch learning). Energy function minimization for the rRBF, GRLVQ and SRNG was achieved with the freely available Matlab ‘minFunc’ optimization toolbox3 using the non-linear conjugate gradient approach with automatic step size thus eliminating the need for a manually set learning rate. The cumulative energy function and derivatives across the training set (batch learning) were provided to the software package using the formulas above.

d

The ν-SVC was introduced in [14] and realizes a soft margin variant of the optimal hyper-plane classifier storing a number of support vectors closest to the hyper-plane separating two classes. A parameter ν is introduced which is the lower and upper bounds on the number of data samples that are support vectors and that lie on the wrong side of the hyperplane [16]. The ν-SVC implementation for a two-class problem solves the following optimization (maximization) problem:

0 r αv r

 r hγ ðr; v; W c Þ ∂dðÞ C ∂wir

∂EðV; WÞ ; ∂whidden kj

ν-Support vector classifier

ð14Þ

  hγ ðr; v; Wc Þ þ ∂dðÞ ∂E  ∂dðÞ : ¼ ∑ ξr  ξr C ∂ λi w r A W c ∂λi ∂ λi

M

∂EðV; WÞ ∂wout kj

hidden where ε is a learning rate and Δwout ðt 1Þ are kj ðt  1Þ and Δwij the previous weight updates. The term α is the momentum term. The network output is coded according to 1-out-of-N coding scheme. For the machine learning, the MLP of the Matlab Neural Network Toolbox was used with its standard settings for learning rate and momentum factor.

3.3. Supervised neural gas

þ

ð18Þ

Δwhidden ðtÞ ¼ αΔwhidden ðt  1Þ  εð1  αÞ ji ji

þ 2d  ¼ þ r  2: ðdr þ dr Þ

ξ ∂E ¼ ∑ ∂wir wr A Wc

19

l

where yi A ½  1; þ 1 is the decision for either class. One free parameter is the choice of kernel k. In this paper we use the linear kernel kðvl ; vm Þ ¼ ðvl  vm Þ. We considered the RBF and polynomial kernel available in the libSVM implementation as well but which produced classification models of significantly lower test accuracy. The optimization results in a decision function of the form of ! f ðvÞ ¼ sgn ∑αl yl kðv; vl Þ þ b :

ð23Þ

l

Data samples with αl 4 0 are stored as support vectors. Therefore the model sizes or number of support vectors is determined in the learning process. The parameter ν is an upper bound on the fraction of margin errors and a lower bound on the fraction of support vectors, e.g. the smallest expectable model size. 3.6. Machine learning setup The rRBF, SRNG, or GRLVQ model is trained till the step size fell below a threshold with a maximum number of allowed iterations and function evaluations. For the classification, the number of prototypes per class is varied in the SRNG and GRLVQ. To ensure similar model sizes, the rRBF number of prototypes is always the number of classes times the number of prototypes per class in the GRLVQ and SRNG. The model size was empirically chosen at upper and lower bounds where accuracy declined or saturated respectively, where any increase in model size would not justify the accuracy gain.

20

A. Backhaus, U. Seiffert / Neurocomputing 131 (2014) 15–22

Table 1 Test accuracies for the three classification tasks; results are averaged across a 10-fold cross-validation with standard deviation in brackets. The model size denotes the total number of prototypes (rRBF, GRLVQ, and SRNG), neurons in the hidden layer (MLP) and support vectors (SVM). Method

rRBF

Green coffee data 4 classes

Crispbread data 3 classes

Accuracy

Size

Accuracy

Size

Accuracy

Size

0.433 0.901 0.962 0.978 0.986 0.984

4 8 16 24 32 40

0.964 (0.018) 0.971 (0.011) 0.990 (0.006) 0.990 (0.009) 0.988 (0.009) 0.989 (0.010)

3 6 12 18 24 30

0.782 0.930 0.958 0.956 0.959 0.959

6 12 24 36 48 60

(0.012 (0.081) (0.020) (0.012) (0.005) (0.010)

Whiskey data 6 classes

(0.033) (0.020) (0.018) (0.013) (0.014) (0.011)

RBF

0.951 (0.022)

32

0.989 (0.004)

18

0.944 (0.013)

60

GRLVQ

0.780 0.870 0.868 0.894 0.900 0.902

(0.032) (0.017) (0.016) (0.014) (0.011) (0.011)

4 8 16 24 32 40

0.957 0.983 0.986 0.989 0.987 0.987

(0.013) (0.008) (0.013) (0.010) (0.009) (0.008)

3 6 12 18 24 30

0.803 (0.031) 0.864 (0.010) 0.910 (0.012) 0.924 (0.015) 0.927 (0.015) 0.930 (0.013)

6 12 24 36 48 60

GLVQ

0.843 (0.014)

40

0.974 (0.014)

18

0.909 (0.019)

60

SRNG

0.799 (0.017) 0.834 (0.023) 0.888 (0.011) 0.905 (0.010) 0.913 (0.011) 0.909 (0.014)

4 8 16 24 32 40

0.956 0.979 0.978 0.981 0.978 0.981

(0.008) (0.009) (0.007) (0.006) (0.008) (0.013)

3 6 12 18 24 30

0.801 (0.031) 0.855 (0.021) 0.905 (0.014) 0.932 (0.016) 0.930 (0.015) 0.944 (0.015)

6 12 24 36 48 60

SNG

0.729 (0.050)

32

0.977 (0.012)

18

0.918 (0.018)

60

MLP

0.570 (0.086) 0.648 (0.142) 0.744 (0.048) 0.731 (0.051) 0.863 (0.036) 0.869 (0.043)

4 8 16 24 32 40

0.948 0.954 0.969 0.977 0.977 0.973

(0.019) (0.012) (0.012) (0.011) (0.012) (0.013)

3 6 12 18 24 30

0.498 (0.068) 0.642 (0.058) 0.660 (0.056) 0.707 (0.040) 0.730 (0.042) 0.741 (0.046)

6 12 24 36 48 60

0.658 (0.195) 0.859 (0.167) 0.942 (0.007) 0.960 (0.008) 0.971 (0.007) 0.985 (0.005) 0.993 (0.003) 0.986 (0.004)

154.4 (10.2) 210.1 (5.1) 233.2 (12.3) 255.5 (8.1) 249.3 (12.0) 332.4 (8.9) 452.9 (5.9) 3758.7 (10.6)

0.967 0.983 0.981 0.983 0.984 0.990 0.989 0.936

(0.022) (0.014) (0.012) (0.010) (0.008) (0.008) (0.006) (0.011)

62.2 (6.7) 67.4 (4.0) 74.3 (4.9) 74.5 (4.4) 76.1 (4.9) 80.4 (3.9) 80.9 (4.7) 730.5 (2.8)

0.854 (0.029) 0.894 (0.016) 0.916 (0.017) 0.932 (0.012) 0.934 (0.018) 0.955 (0.016) 0.963 (0.011) 0.913 (0.034)

93.4 (5.8) 141.3 (9.5) 174.2 (8.2) 197.9 (8.8) 209.1 (4.8) 243.9 (5.3) 318.1 (7.4) 1452.9 (14.8)

ν-SVM ν-SVM ν-SVM ν-SVM ν-SVM ν-SVM ν-SVM ν-SVM

(ν ¼ 0:001) (ν ¼ 0:002) (ν ¼ 0:003) (ν ¼ 0:004) (ν ¼ 0:005) (ν ¼ 0:01) (ν ¼ 0:02) (ν ¼ 0:3)

Before the training with the respective method was started, prototypes were pre-trained using the Neural Gas algorithm. In the rRBF all training data is used to adapt all prototypes while in SRNG and GRLVQ prototypes are trained only on the training data of their respective class. Especially for the SRNG and GRLVQ using a large number of prototypes this proved to increase model performance significantly in comparison to random initialization or initialization at central class data positions. Data was divided into training and test set according to a 10-fold cross validation. Classification accuracy was averaged and standard deviation was computed. In the ν-SVC the parameter ν altered systematically and parameter settings around the peak performance are shown in the results. The upper bound of the parameter was chosen empirically when accuracy declined or saturated. The lower bound was chosen at significant declining of prediction accuracy. The performance peaked at some ν setting. Model size was determined by the number of support vectors which were averaged across cross-validations. Learning was performed by the standard function from the libSVM software package. The MLP contained one hidden layer with a hyperbolic tangent sigmoid transfer function and an output layer with linear transfer function. The number of hidden units was set identical to the number of prototype vectors in the rRBF model. The MLP was learned using the conjugate gradient approach with momentum term from the MATLAB neural network toolbox.

4. Results Table 1 shows the test accuracy values for all three classification tasks for all methods and model sizes. Methods generally reached high accuracy values in all classification task showing that all three problems can be solved by machine learning of spectral data. In order to achieve a high classification accuracy, the ν-SVC classifier took up a very high number of support vectors, up to 5–10 times the number of prototype spectra at comparable model accuracy. This poses a significant obstacle for the implementation of real time classification methods. In comparison, SRNG, GRLVQ, and rRBF reached a similar level of accuracy with significantly smaller number of prototypes, e.g. model sizes. For the obviously ‘challenging’ classification tasks of classifying green coffee and whiskey distilleries, the rRBF also showed better performance at smaller model sizes than the SRNG and GRLVQ, with exception in the case of one prototype per class. However, RBF contains a second layer which contributes to the computational steps as well as the calculation of the exponential function while SRNG and GRLVQ only need to perform the comparison of a data vector with all prototype vectors. Still, a ν-SVC is much easier to handle in terms of model training. Optimization methods with automatic step sizes in the GRLVQ, SRNG, and rRBF, however, decrease the complexity for the user significantly. As comparison, the performance of a RBF, LVQ and SNG is depicted, where no relevance profile is trained. In general classification accuracy

A. Backhaus, U. Seiffert / Neurocomputing 131 (2014) 15–22

rRBF

GRLVQ

A 1.0 S

A

A 1.0

S

feature selection class-typical data points direct decision boundary

I

-SVM

MLP A 1.0

1.0

S

feature selection class-typical data points direct decision boundary

I

SRNG A

1.0

21

feature delection class-typical data points direct decision boundary

feature delection class-typical data points direct decision boundary

I

S

S

I

feature delection class-typical data points direct decision boundary

I

Accuracy - Slimness - Interpretablity

Fig. 4. Radar charts for quantitative and qualitative evaluation of the three considered aspects of machine learning based classifiers: A – accuracy, S – slimness, and I – interpretability. The latter aspect cannot be directly measured in a quantitative way, the depicted charts show the basic properties of the considered classifiers in terms of interpretation capability. The larger the covered triangular area the better the overall performance. A shift of the triangular areas indicates a pronounced strength towards the corresponding aspect.

is smaller compared to the respective models with relevance profiles. Especially in the challenging dataset of green coffee classification the SNG and LVQ benefit greatly from the feature filtering by the relevance profile. In Fig. 3D–L the relevance profiles of rRBF, GRLVQ, and SRNG are depicted. This offers some interpretability to the classification model and the information it bases its decision on which is lacking in the Support Vector Machine and Multiple Layer Perceptron approach. It is very obvious that the relevance learning in the GRLVQ and SRNG leads to much more pronounced and sparse relevance profiles. This ability is especially useful to reduce the number of spectral bands or narrow down the spectral range that is important to the classification task to reduce system cost significantly. On the other hand, relevance profiles of the rRBF are flat or very unspecific and not as clear as the SRNG and GRLVQ profiles. This might explain the ability of the rRBF to classify data with a smaller model size using the full range of spectral information. In further work, classification accuracy and model sizes of methods with and without relevance learning should be compared. The MLP showed relative good performance in the easier crisp bread classification task while showing average performance in both other tasks. Additionally the MLP showed a higher variance in classification accuracy in the 10-fold cross validation. All other approaches showed a relatively small variation in performance.

5. Discussion The practical application of machine learning methods in spectral data processing offers an efficient approach to generate classification models without deeper insight into the chemical and physical processes underlying a spectral signature. This approximation ability and the flexibility of machine learning methods opens the way to inspection systems for multiple applications based on pattern producing sensor hardware, so-called ‘softsensors’. However, the creation of classification models from reference data is still an expert task and presents a hard to calculate work package in terms of development costs in a softsensor system development. A large number of methods with different design choices like pre-processing or similarity measure and kernel choice have to be evaluated and tested to find the best model. Ideally, machine learning works fully black-boxed where the user only needs to provide the input and output data and is provided with the best model which can also be updated as new data becomes available. Projects like the web-based machine learning services Google Prediction API4 might show the future of applied machine learning against this background. 4

https://developers.google.com/prediction/

Automatic model creation and evaluation needs quantitative performance measures. Heavy emphasis is put on the comparison of classification accuracy when comparing machine learning methods. Looking at model slimness and increasingly interpretability is still underdeveloped. In Fig. 4 radar chart visualization is suggested containing three dimensions to be used by the practitioner to quickly judge model performance in general. Accuracy is a measure bounded by 0.0 and 1.0. The correct classification rate is used, but other measures like kappa-score or F-score could be considered as a separate axis. The slimness factor in a simple case could be the reciprocal difference in model size from the smallest model possible (for example a LVQ with one neuron per class). This size has to be adjusted by the number of operations each model element requires, also secondary operations in multilayer networks like a MLP or a RBF. Finding measurements allowing model complexity/slimness comparison should be focus of future work. In the radar plot, slimness is depicted intuitively. The interpretability measure is here depicted as a qualitative point score. Different possibility to interpret the model could be formulated here. We suggest the ability of the model to selected features from the input pattern, the ability to provide class-typical data points and information about the decision boundary directly encoded in model parameters. The last property is interesting in the way that in order to visualize decision boundaries, the model parameter can be directly used or a dataset and further computation is needed, therefore being “indirect”. This list could be extended with other aspects. Methods score points for fulfilling interpretation aspects which then are taken as the overall interpretability measure. The rRBF is capable of selecting features but does not provide crisp class typical data points. However the output weights could be interpreted how strongly the activation of one prototype vector activates respective class output unit. In some cases that could lead to single class typical data points to be found which is highly depended on the current dataset. Since prototypes are learned, no direct encoding of the decision boundary can be found in the model. In the GRLVQ and SNRG, the relevance weighting also leads to a capability of the method to select input features. Here, prototypes are directly labeled with a class and can therefore be analyzed in any case as class-typical data points. The challenge here is to define “classtypical” if more than one prototype per class is used. However, relative placement of those prototypes could give hints the extension of classes in the feature space and the non-linear shape of a data cloud. The prototypes do not directly encode the decision boundary. The MLP holds neither comprehensive information on which features are selected nor which are class typical data points. Neurons however encode directly the decision boundaries which are combined with each consecutive layer.

22

A. Backhaus, U. Seiffert / Neurocomputing 131 (2014) 15–22

The SVM has no feature selection information. Data points from the boundary are stored, therefor they cannot be judged as class typical. The stored support vectors and kernel enable the direct acquisition of the decision boundary. This scoring so far depicts the general capability of the model to offer interpretation. For a closer weighting quantitative interpretability measures are needed that judge the fulfillment on a specific dataset. Methods using a relevance profile (rRBF, GRLVG, SRNG) have been given by the authors in [17]. Here a number of information theoretic measurements are suggested to judge the relevance profile according to its non-flatness, its capability to select class depended and non-redundant features. Results for the datasets used in this study can be found in the publication. In summary this study showed that a SVM can become impractical for real time implementation due to the high number of support vectors. GRLVQ, SRNG, and rRBF, offering smaller model sizes. The concept of relevance learning offers a level of interpretability that becomes very useful in the area of spectral data processing. System costs are still too high for a number of practical applications. Reducing the number of spectral information needed for a particular task will help reducing costs as well as increasing the processing speed. Both SRNG and GRLVQ showed very good capability to classify spectral data with a minimal amount of information while rRBF did not produce sparse relevance profiles. For applications where the full spectral range is available, rRBF might prove advantageous in terms of accuracy at small model sizes.

6. Conclusion This paper attempts to accompany traditional measures of machine learning based classifiers, such as classification accuracy and model slimness, by a third aspect – interpretability. While the first two aspects can be quantitatively described, interpretability is rather task specific and depends on the user's perspective. The availability of relevance profiles, the existence of class-typical prototypes within the original physical domain of the input data and direct encoding of decision boundaries in the model are three examples of some kind of enhanced interpretability discussed for the used classification methods. However, in order to come to a more comprehensive and self-contained consideration of interpretability, formalized and generally admitted measures need to be developed. Apart from this future work this paper demonstrates performance properties along with an application derived assessment of interpretability for a number of common machine learning based classifiers by means of three challenging real-world datasets and suggested a visualization scheme that lets a practitioner quickly compare model performances.

Acknowledgments The authors want to thank Barbara Hammer and Thomas Villmann for motivating this study under practical considerations. We would like to thank the company Röstfein Magdeburg as well as Thomas Villmann and Marika Kästner for providing coffee samples. Furthermore, the authors are grateful to Praveen Ashok, Bavishna Praveen, and Kishan Dholakia for providing us with extensive datasets of Scotch Whisky samples.

References [1] U. Seiffert, F. Bollenbeck, Clustering of hyperspectral image signatures using neural gas, Mach. Learn. Rep. 4 (2010) 49–59. [2] A. Backhaus, F. Bollenbeck, U. Seiffert, Robust classification of the nutrition state in crop plants by hyperspectral imaging and artificial neural networks, in: Proceedings of 3rd Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing, Lissabon, Portugal, 2011. [3] P.C. Ashok, B.B. Praveen, K. Dholakia, Near infrared spectroscopic analysis of single malt scotch whisky on an optofluidic chip, Opt. Exp. 19 (2011) 22982–22992. [4] B. Hammer, T. Villmann, Generalized relevance learning vector quantization, Neural Netw. 15 (2002) 1059–1068. [5] B. Hammer, M. Strickert, T. Villmann, Supervised neural gas with general similarity measure, Neural Process. Lett. 21 (2005) 21–44. [6] J. Moody, C.J. Darken, Fast learning in networks of locally tuned processing units, Neural Comput. 1 (1989) 281–294. [7] F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization in the brain, Psych. Rev. 65 (1958) 386–408. [8] D. Rumelhart, G. Hinton, R. Williams, Learning internal representations by error propagation, in: D.E. Rumelhart, J.L. McClelland, et al., (Eds.), Parallel Distributed Processing, MIT Press, Cambridge, MA, USA, 1986, pp. 318–362. [9] V. Vapnik, A. Chervonenkis, Theory of Pattern Recognition, Nauka, 1974. [10] A. Fiore, R. Romaniello, G. Peri, C. Severini, Quality assessment of roasted coffee blends by hyperspectral image analysis, in: Proceedings of 22nd International Conference on Coffee Science, Campinas, Brazil, 2008. [11] A. Backhaus, F. Bollenbeck, U. Seiffert, High-throughput quality control of coffee varieties and blends by artificial neural networks from hyperspectral imaging, in: F. Travaglia, M. Bordiga, J. Coïsson, M. Locatelli, V. Fogliano, M. Arlorio (Eds.), Proceedings of the 1st International Congress on Cocoa Coffee and Tea (CoCoTea), Novara, Italy, vol. 1, 2011, pp. 88–92. [12] T.M. Martinetz, K.J. Schulten, A neural-gas network learns topologies, in: T. Kohonen, K. Mäkisara, O. Simula, J. Kangas (Eds.), Artificial Neural Networks, North-Holland, Amsterdam, 1991, pp. 397–402. [13] T. Martinetz, S. Berkovich, K. Schulten, ‘Neural-Gas’ network for vector quantization and its application to time-series prediction, IEEE Trans. Neural Netw. 4 (1993) 558–569. [14] B. Schölkopf, A. Smola, R. Williamson, P. Bartlett, New support vector algorithms, Neural Comput. 12 (2000) 1207–1245. [15] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995. [16] P.-H. Chen, C.-J. Lin, B. Schölkopf, A tutorial on ν-support vector machines, Appl. Stochastic Models Bus. Ind. 21 (2005) 111–136. [17] A. Backhaus, U. Seiffert, Quantitative measurements of model interpretability for the analysis of spectral data, in: Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM), 2013, pp. 18–25.

Andreas Backhaus is a research associate at the Biosystem Engineering competence field at the Fraunhofer Institute for Factory Operation and Automation IFF Magdeburg, Germany since 2009. After completing his PhD in 2007 he was granted a Marie-Curie Fellowship to work as post-doc at the University of Sheffield till 2009. His research interest focus on applied computational intelligence and machine learning, spectral image and data processing and pattern recognition in general.

Udo Seiffert studied Cybernetics at the University of Magdeburg, Germany. After completing his PhD on multidimensional Self-Organizing Maps for image analysis he spent his PostDoc phase at the University of South Australia in Adelaide. He then became a Junior Research Group Leader at the Leibniz-Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Germany, working on recognition and modelling of spatiotemporal development patterns in crop plants. Since 2008 he is a Head of Department Biosystems Engineering at the Fraunhofer Institute (IFF) Magdeburg and Adjunct Professor for Neural Systems at the University of Magdeburg. His current research interests comprise computational intelligence, image analysis, hyperspectral imaging against the background of applications in biology, medicine, and engineering.