Pattern recognition of volcanic tremor data on Mt. Etna (Italy) with KKAnalysis—A software program for unsupervised classification

Pattern recognition of volcanic tremor data on Mt. Etna (Italy) with KKAnalysis—A software program for unsupervised classification

Computers & Geosciences 37 (2011) 953–961 Contents lists available at ScienceDirect Computers & Geosciences journal homepage: www.elsevier.com/locat...

2MB Sizes 0 Downloads 42 Views

Computers & Geosciences 37 (2011) 953–961

Contents lists available at ScienceDirect

Computers & Geosciences journal homepage: www.elsevier.com/locate/cageo

Pattern recognition of volcanic tremor data on Mt. Etna (Italy) with KKAnalysis—A software program for unsupervised classification$ A. Messina a,b,n, H. Langer b a b

Istituto Nazionale di Geofisica e Vulcanologia–Sezione Roma 2, Via di Vigna Murata 605, 00143 Rome, Italy Istituto Nazionale di Geofisica e Vulcanologia–Sezione di Catania, Piazza Roma 2, 95123 Catania, Italy

a r t i c l e i n f o

abstract

Article history: Received 25 September 2009 Received in revised form 25 March 2011 Accepted 27 March 2011 Available online 9 April 2011

Continuous seismic monitoring plays a key role in the surveillance of the Mt. Etna volcano. Besides earthquakes, which often herald eruptive episodes, the persistent background signal, known as volcanic tremor, provides important information on the volcano status. Changes in the regimes of activity are usually concurrent with variations in tremor characteristics. As continuous recording leads rapidly to the accumulation of large amounts of data, parameter extraction and automated processing become crucial. We propose techniques of unsupervised classification and present a software, named KKAnalysis, developed for this purpose. Essentials of KKAnalysis are demonstrated on tremor data recorded on Mt. Etna during various states of volcanic activity encountered in 2007 and 2008. KKAnalysis is based on MATLAB and combines various unsupervised pattern recognition techniques, in particular self-organizing maps (SOM) and cluster analysis. An early software version was successfully applied to seismic signals recorded on Mt. Etna during the eruption in 2001. Since each situation may require different configurations, we designed KKAnalysis with a specific GUI allowing users to easily modify parameters. All results are given graphically, in screen plots and metafiles (MATLAB and TIF format), as well as in alphanumeric form. The synoptic visualization of results from SOM and cluster analysis facilitates an immediate inspection. The potential of this representation is demonstrated by focusing on data recorded during a flank eruption on May 13, 2008. Changes of tremor characteristics can be clearly identified at a very early stage, well before enhanced volcanic activity becomes visible in the time series. At the same time, data reduction to less than 1% of the original amount is achieved, which facilitates interpretation and storage of the essential information. Running the program in a typical configuration requires computing time less than 1 min, allowing an on-line application for early warning purposes at INGV–Sezione di Catania. & 2011 Elsevier Ltd. All rights reserved.

Keywords: Self-organizing map Cluster analysis K-means Fuzzy C-means Volcano seismology Volcano monitoring

1. Introduction Mt. Etna, Europe’s largest active volcano, is one of the best known and monitored volcanoes worldwide. Its eruptive events have severe socioeconomic consequences, for the menace of lava flows invading inhabited areas, and also for the impact on the infrastructure of the hinterland, such as the International Airport of Catania. All these issues demand round-the-clock monitoring, which encompasses a variety of observations, such as surveillance by video-cameras, satellite imagery, the measurement of ground deformation, gravity, and magnetic data as well as the gas flux. Among this information, seismological observation plays a key role. Digital seismic data recorded on the volcano and adjacent

$

Code available from: http://earthref.org/cgi-bin/er.cgi?s=erda.cgi?n=974. Corresponding author at: Istituto Nazionale di Geofisica e Vulcanologia–Sezione Roma 2, Via di Vigna Murata 605, 00143 Rome, Italy. Tel.: þ39 0957165818; fax: þ39 095435801. E-mail addresses: alfi[email protected] (A. Messina), [email protected] (H. Langer). n

0098-3004/$ - see front matter & 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.cageo.2011.03.015

areas have been continuously acquired and processed since 1995. This continuous seismic monitoring leads to the accumulation of cumbersome amounts of data. A modern three-component seismic station produces ca. 100 MB of data per day. Apart from fracturing events, which often herald important volcanic activity, the monitoring of the seismic background signal – commonly referred to as volcanic tremor – is decisive for volcanic surveillance, particularly when field surveys are unsafe and/or visual observations are hindered by bad weather. Changes in the state of volcanic activity are reflected in the amplitude and spectral content of volcanic tremor (see, e.g., Falsaperla et al., 2005; Cannata et al., 2008, and references therein). Hence, the analysis of the characteristics of volcanic tremor passes from a merely monoparametric vision of the data to a multivariate one, which implies modern concepts of multivariate data analysis. Automatic classification has been applied successfully to seismic signals recorded on volcanoes. Falsaperla et al. (1996) were among the first to apply supervised classification – where the classifier learns from given examples – using multilayer perceptrons to classify explosion quakes recorded on Stromboli. Again

954

A. Messina, H. Langer / Computers & Geosciences 37 (2011) 953–961

using Stromboli data, Langer and Falsaperla (1996) used cluster analysis in an unsupervised classification scheme (see Section 2) for identifying activity regimes on the basis of spectral characteristics of volcanic tremor. Esposito et al. (2008) applied selforganizing maps (SOM hereafter; see Section 2.1) to so-called Very Long Period events. Further examples for automatic classification of seismic signals and related problems – in particular, the extraction of feature vectors from time series and the a posteriori evaluation of the results – can be found in Langer and Falsaperla (2003), Langer et al. (2006), Scarpetta et al. (2005). Langer et al. (2009) compared supervised and unsupervised classification schemes, applying them to patterns derived from volcanic tremor recorded during the eruption in July–August 2001. They applied support vector machines (SVM) and multilayer perceptrons for classification with supervision, and SOM and crisp clustering for unsupervised classification. Supervised classification confirmed the a priori distinction of the regimes of volcanic activity, i.e., preeruptive, eruptive, posteruptive, and lava fountains; unsupervised classification, on the other hand, brought out further internal structures in the dataset, which were not so evident a priori. These encouraging results led us to develop a specific, easy-to-use software package for unsupervised classification and open these techniques to a wider public. Unsupervised classification is flexible as it adapts to new patterns in a straightforward manner. It therefore discloses interesting possibilities for online and continuous processing (Section 7). Unsupervised classification schemes are highly user defined and application driven, so the user should be able to ‘‘play’’ with various options and parameter settings. Our software, named KKAnalysis, is designed with respect to these needs. It was developed under MATLAB1 and combines various methods. KKAnalysis allows customizing configurations and keeps track of all sessions creating log files, where time, date, and basic elements of the configuration are reported for each session (see the Extended Documentation downloadable from EarthRef Digital Archive2). The results are given both in alphanumeric output files as well as in graphical metafiles and screen plots. Visualization is a critical issue in multivariate data analysis. In this work we are particularly interested in creating time series representing characteristics of multivariate data. We demonstrate how the colors of the nodes of the SOM and class membership vectors inferred from fuzzy clustering can be effectively applied for monitoring the development of volcanic tremor characteristics. This kind of synoptic view is especially useful when the number of patterns is large and when transitional ranges between clusters are of interest. 2. Unsupervised classification with KKAnalysis In unsupervised classification the classifier identifies objects belonging to a class on the basis of a measure of dissimilarity rather than exploiting a priori information furnished by an expert. KKAnalysis offers four different types of classifiers. Three of them can be referred to as cluster analysis in a strict sense; the fourth follows the concept of SOM. The part of KKAnalysis related to SOM is based on the MATLAB SOM Toolbox 23 (Vesanto et al., 2000). 2.1. Self-organizing maps SOM – also named Kohonen maps after Teuvo Kohonen (see Kohonen, 2001), who introduced them – consist of a number of nodes. Each node represents a number of patterns, in a certain 1

The MathWorks MATLAB R2007b. http://www.mathworks.com/. EarthRef—The website for Earth Science reference data and models. http:// www.earthref.org. 3 SOM Toolbox for MATLAB. http://www.cis.hut.fi/projects/somtoolbox. 2

sense forming a small cluster. The number of nodes must be specified a priori. Typically this choice depends on the number of patterns (Vesanto et al., 2000, see Section 5.2). Each node is a vector containing the same number of features (called weights) as the input patterns. During the training process, the node weights are adjusted, minimizing the sum of the differences between original data and their representing nodes, i.e. qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Dij ¼ ðWi Vj ÞT ðWi Vj Þ ð1Þ where Vj are the normalized input feature vector and Wi are the weights of the nodes. A core step is the identification of the closest node to the actual input vector, i.e., the best matching unit (BMU) for the j-th pattern. Once the BMU is identified, its weights are gradually adjusted according to the so-called learning rate l. Besides, neighboring nodes lying within a certain radius of influence are considered as well. A second parameter j describes the dependence of the upgrade of a node on the distance D to the BMU. j(D,t) is maximum for the BMU, whereas nodes outside the radius of influence are not upgraded at all. Both parameters l and j decrease with increasing training time t. The upgrade of weights follows the following relationship: Wi ðt þ1Þ ¼ Wi ðtÞ þ jðD,tÞlðtÞðWi ðtÞVj ðtÞÞ:

ð2Þ

At the end of the training phase, the nodes are mapped to a low dimensional – here 2D – representation space applying principal component analysis (PCA). The representation space, i.e., the map, is spanned by the two principal axes z1 and z2, corresponding to the two most significant principal components. The topological relationship of the original data is still maintained; i.e., patterns represented by neighboring nodes in the SOM are truly close to each other also in the original data space. SOM become particularly instructive when the weights are represented with a RGB color code. For this purpose we exploit the PCA noted above. z1 and z2 are normalized to a range of [0y.1]. A third value z3 is obtained either as 1 z1 or 1  z2. Then, we assign the principal colors to the zi, for instance, red to z1, green to z2, and blue to z3. Each node receives a mixture of the colors according to its original weight, respectively to the projection of the weights in the representation space. We characterize a pattern by the color of the BMU to which it belongs. Patterns belonging to neighboring BMUs have similar colors and are close to each other in the original data space. 2.2. Clustering All cluster analyses used in KKAnalysis follow the nonhierarchical partitioning strategy of clustering, where the number of desired clusters is chosen a priori. In crisp clustering each pattern is presumed to belong exclusively to one group. Classic k-means cluster analysis is one of the options available in KKAnalysis. The measure for the heterogeneity of the k-th cluster is expressed as the sum of the squared distances of the patterns in the cluster from its centroid vector. In the paper by Langer et al. (2009); see ¨ also Spath (1985), a specific metric, called adaptive determinant criterion and resembling the Mahalanobis distance, was used to account for possible correlations among the features. Clustering with the adaptive determinant criterion represents a second option available in KKAnalysis. In crisp clustering, abrupt or more smooth changes of pattern characteristics cannot be distinguished. A way around this problem is fuzzy clustering, here fuzzy C-means, which is the third option for cluster analysis realized in KKAnalysis. In fuzzy clustering each pattern may belong to a certain degree to all possible classes. Consequently, the class membership of a pattern is given by a vector. The partition is described by an M  K matrix of class

A. Messina, H. Langer / Computers & Geosciences 37 (2011) 953–961

955

membership values, with M being the number of patterns in the entire dataset and K the number of clusters. The optimum partition is determined minimizing the so-called objective function J ¼ Sk Jk

ð3Þ

Jk ¼ Sk mqik d2ik

ð4Þ

where mik is the membership value of pattern i with respect to cluster k, dik its Euclidean distance from the cluster centroid, and q a weighting exponent (here set to q¼2). For more details on fuzzy clustering and the other options, see KKAnalysis Extended Documentation and papers by Brouwer (2009), D’Urso and Maharaj (2009), Tokushige et al. (2007), or Er and Parthasarathi (2005) describing also alternative approaches to the one proposed here.

3. Data The dataset presented here was recorded on Mt. Etna during 2007 and 2008. It was chosen for the high quality of data and stability of continuous data acquisition. At the same time it represents many different stages of volcanic activity: quiescence, lava fountains, a lava effusion during a flank eruption, and all the transitional stages. Six episodes in 2007 are related to lava fountains occurring on 29 March, 11 April, 29 April, 6–7 May, 4–5 September, and 23–24 November, 2007. These fountains reached heights of several hundreds of meters. They were visible in the whole metropolitan area of Catania and caused major concerns for air traffic (see photographic material and videos on the INGV–Sezione di Catania website4). Unlike flank eruptions, lava fountains rise from the summit craters and are rather shortlived. The durations of the events here range from 45 min (29 March 2007) to over 10 h for the other events (Behncke, 2009, personal communication). Each of the corresponding blocks of seismic signals covers 2 days of continuously recorded and processed data (i.e., 28–29 March, 10–11 April, 28–29 April, 6–7 May, 4–5 September, and 23–24 November, 2007). A further data block spans from May 9, 2008, 0:00 GMT to May 15, 2008, 24:00 GMT, with a lava fountain event on May 10, 2008. A lava effusion started on May 13, 2008, and continued, in a mild form, until July 2009. In total, the dataset covers 19 days representing continuously recorded and processed data without any a priori selection by a human operator. Data were recorded with a Nanometrics Trillium seismometer having a corner period of 40 s. We focus on the vertical component of the station ESPC, which belongs to the permanent seismic network of INGV and is situated in the summit area of the volcano (Fig. 1). The dynamic range of the digitizer is 24 bit and the sampling rate is 100 Hz. This station was chosen for the high quality of data and the reliability of acquisition.

4. Preprocessing In KKAnalysis, the input dataset forms a table with rows and columns. A row corresponds to a single pattern or feature vector and columns represent its components. Tremor data are originally given as continuous time series, recorded by Nanometrics instruments, as aforementioned (Fig. 2). Once acquired, they are stored in a Network Attached Storage (NAS) system, becoming available for processing. Time series data are not suitable for use as feature vectors. We therefore developed suitable preprocessing tools in 4 http://www.ct.ingv.it/index.php?option=com_content&view=article&id= 115&Itemid=324.

Fig. 1. Sketch map of the summit area of Mt. Etna, summit seismic stations, and the theater of eruptive events in 2007 and 2008. The effusion of the lava flow shown in red started on May 13, 2008. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 2. Diagram of the seismic data flow. Data are acquired by seismic stations, then transmitted to the central acquisition system, and stored in a NAS system. A further preprocessing phase is necessary to produce a suitable dataset for KKAnalysis.

order to calculate spectrograms following a gliding window scheme. In this scheme amplitude spectra are calculated for short time windows (short time Fourier transform, STFT), which are shifted in discrete time steps over the whole time series. We chose a length of 1024 points for the STFT and a time step of 500 points (5 s). For every 5 min of the time series, we obtained a block of 60 vectors made up by spectral amplitudes measured in equally spaced frequency intervals. Finally, both averages and 10% percentiles of spectral amplitudes were calculated over each of these ensembles, ending up with one pattern every 5 min. Using low percentiles instead of averages we partially removed transients (such as wind gusts and earthquakes), which are considered as disturbance of tremor data (Di Grazia et al., 2006). In order to limit the dimension of the feature vector, frequency bins, representing the spectral content in frequency intervals of 0.29 Hz, were formed. Finally, patterns, given by feature vectors with 62 components (see Table 1, Fig. 3), were written in an ASCII file, representing the input dataset for KKAnalysis.

956

A. Messina, H. Langer / Computers & Geosciences 37 (2011) 953–961

Table 1 Structure of the dataset used in KKAnalysis. Time (s)

0.15 Hz (av)

0.34 Hz (av)

¨

0.15 Hz (10%)

0.34 Hz (10%)

¨

Index (int)

¨ 1200 1500 1800

¨  4.55479  4.50399  4.53464

¨  4.309  4.31217  4.28991

¨  4.78099  4.66573  4.69704

¨  4.43856  4.46139  4.45617

¨

¨

¨

˙

¨ ¨ ¨ ¨ ¨

¨ 4 5 6

¨

¨ ¨ ¨ ¨ ¨

¨

In the sample session columns 64–125 (10% percentiles) are used. Column 126 is a sequence of integers. In Figs. 8–11 these indices govern the position of patterns along the abscissa.

Fig. 4. KKAnalysis ‘‘Welcome window’’. It represents the main window of the software. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 3. Some example patterns. The feature vector components correspond to the power spectral density measured in frequency bands with a width of ca. 0.29 Hz. ‘1’, ‘2’, and ‘3’ are the cluster IDs assigned to patterns during clustering. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

5. Running KKAnalysis KKAnalysis was tested under Microsoft Windows XP, Vista, and 7 as well as under GNU/Linux Ubuntu 9.04. A reasonable performance was achieved with a Pentium 4 processor (singlecore) with a 3 GHz clock frequency and 1 GB RAM. The storage of the software requires a free disk space of ca. 1 MB. 5.1. KKAnalysis installation All files concerning KKAnalysis can be downloaded from EarthRef Digital Archive, among these the Extended Documentation noted earlier. KKAnalysis requires the MATLAB Component Runtime (MCR). Its installation is not necessary when MATLAB (R2007b, v7.5 for Windows or R2008b, v7.7 for Linux) is already installed. Otherwise, MCR must be installed using MCRInstaller. Both versions, ‘‘.exe’’ for Windows and ‘‘.bin’’ for Linux, can be downloaded from the archive. 5.2. Sample session As differences between Windows and Linux versions are minimal, we refer only to the Windows application. On starting, the user is prompted with a Welcome window (Fig. 4). Using the Browse button in the Input File frame, we load a file ‘‘SAMPLE.txt’’ residing in the directory reported in the Input File-Path field (D:\KKAinput\SAMPLE.txt). This file was created during the steps described in Section 4 (Preprocessing). KKAnalysis prompts basic characteristics of the data structure of the chosen file. Our ‘‘SAMPLE.txt’’ consists of a total of 126 columns and 5465 rows. We skip the first 63 columns (a column with time

Fig. 5. KKAnalysis ‘‘Settings window’’. It allows users to configure the software parameters. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

steps and 62 ones with average spectral values) and use the 10% percentiles selecting columns from 64 to 125. We choose to classify all patterns (from 1 to 5465). The X Axis option is selected, i.e., we use the values of column 126 as abscissa. In doing so we can govern the spacing on the abscissa (see also Section 6). Here we graphically separate blocks of patterns, which do not form a continuous time sequence. Seven episodes of volcanic activity in 2007 and 2008 are considered. At the end of a data block belonging to an episode, the index in column 126 jumps 25 units. Thus indices range from 1 to 5615, and 6n25¼150 spaces are void. By default, KKAnalysis plots the results of all data selected for analysis. Nevertheless, the user can zoom on a part of the patterns selecting, in the Show subframe, the IDs of the first and last pattern to be shown. Using the last column for the creation of the abscissa, the IDs visualized in the Show fields reference to those found in that column (1–5615). All files related to a session – alphanumeric and figure files – are written to the directory given in Output Dir-Path field, which can be specified using the Browse button. Among these files are the log files and the result files, containing weights of the SOM nodes, their color codes, and parameters related to the class membership found during the cluster analysis. Parameters governing KKAnalysis operational characteristics can be accessed with the Settings button (bottom of Fig. 4), which opens the Settings window (Fig. 5). On the left-hand side, the user selects parameters concerning the SOM. Here the Variance method for normalization of input data is used, initialization of SOM units is carried out using the Linear option and the weighting function

A. Messina, H. Langer / Computers & Geosciences 37 (2011) 953–961

for the distance measure is Gaussian. The size of the map depends on the number of patterns. Following Vesanto et al. (2000), the number of nodes of a Normal size map corresponds approximately to fivefold the square root of the number of patterns, here 360 SOM nodes. For the configuration of the nodes on the map we select the Hexagonal option. The use of the Long training mode is possible for the short computing time (less than 2 min). Options in the field Color Code allow specifying how colors of the SOM units are mixed. This has no effect on the classification itself. In our example the default option (RGB1) is adopted. The right-hand side of the Settings window displays the three options for cluster analysis, i.e., K-means (Euclidean distance), CA (adaptive determinant criterion), and fuzzy C-means, the option chosen here. The number of clusters must be specified a priori by the user. Commonly one tries to find a reasonable compromise between the desire to have as few clusters as possible and the greatest number of patterns belonging to a cluster with a large membership. Here we choose three clusters and find over 78% of patterns having a prevailing cluster membership of over 0.7 (see also Section 6). The Davies–Bouldin Index (DBI, Davies and Bouldin, 1979) – a criterion for the choice of the optimum number of clusters – applies only to K-means clustering so it is not available in our sample session. Under the Clustering-Algorithm subframe, KKAnalysis provides a panel Data Source. Clustering can be performed using the original patterns or the feature vectors of the SOM nodes. Here we use the original patterns. We use the full dimensionality of the feature vectors (62); i.e., we do not select the option PCA Projection before Clustering, which allows reducing the dimensionality of the pattern by performing a principal component analysis. Exploiting the options of the Figures frame, we choose to save all figures as MATLAB files. Buttons Load, Save As, Ok, and Cancel allow us to manage configurations. The user can load (Load button) an already existing configuration file, modify the displayed parameters, and save

957

them (Save As button) in a new ‘‘.kfg’’ file. Ok confirms the chosen parameters and writes them to the internal configuration file. Cancel abandons the editing of the parameters, returning to the latest saved configuration. 6. Representation of results We show three types of graphical representations provided by KKAnalysis, which well explain its main features. On the right-hand side of the Welcome window (Fig. 4) we check off three options. (i) Map Color Code and PCA-Proj. (Fig. 6) shows basic SOM properties. For its definition, KKAnalysis internally performs a 2D principal component analysis on the covariance matrix of the data vectors. The frame (a) in the figure shows the position of each node in this new system of axes. Our map consists of 360 nodes. The ratio length/width of the map shown in the frame (b), ca. 4.5, corresponds approximately to the ratio of the square root of the two largest eigenvalues of the covariance matrix. Our map with 40  9 nodes matches this in good approximation. The numbers given for each node, in the frame (b), correspond to the number of patterns for which a node was identified as BMU. The blue-green node on the lower left was identified as BMU for 31 patterns. ‘‘Looser’’ nodes, which were never BMUs, are assigned to a 0. The frame (c) schematically shows the dispersion of the patterns represented by a node. The larger the BMU symbol, the less the dispersion of the patterns associated with this unit. Looser nodes are simply marked by a dot. The color coding is the same in all three frames. It follows the concept described in Section 2.1. (ii) Clustering Info (Fig. 7) reports overall statistics about clustering. In our example, using fuzzy clustering, 2844 patterns belong ‘‘preferably’’ to cluster ‘1’, 1752 to cluster ‘2’, whereas the smallest cluster ‘3’ has 869 patterns (Fig. 7a). By convention KKAnalysis always assigns ‘1’ to the cluster with the largest

Fig. 6. SOM representations: (a) 2D projection of weights of the SOM in a system of axes spanned by the two major principal components. Note that principal components are formally dimensionless. (b) 2D map of our 360 SOM nodes, representing neighborhood relations schematically. Numbers inside the hexagons report the number of patterns for which the nodes are ‘‘best matching units’’ (BMU). (c) Representation of homogeneity of the BMU. The larger the symbol the less the dispersion of the patterns belonging to a BMU. Looser nodes (never being BMUs) are represented as dots. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

958

A. Messina, H. Langer / Computers & Geosciences 37 (2011) 953–961

Fig. 7. Overall clustering statistics. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 8. Synopsis of fuzzy clustering and SOM. Labels on the abscissa report the pattern IDs corresponding to the integer values in the 126th column of the input file. Colors of triangles indicate to which of the BMUs displayed in Fig. 6 the pattern belongs. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

number of patterns, ‘2’ to the one with the second largest number, and so on. The pie chart (Fig. 7b) reports a statistics of maximum membership degrees encountered during the classification of the patterns. 0.9–1.0 means crisp or close to crisp membership; o0.5 is rather fuzzy. Here 30% of patterns have a class membership 40.9, for another 48% it ranges between 0.7 and 0.9, and 3% of pattern assignments are truly fuzzy (degreeo0.5). (iii) Plot of Clusters and Kohonen ColorMap (Fig. 8) shows a synoptic representation of clustering and SOM results. Each triangle and

bar represents a single pattern. The position of the triangles on the abscissa is controlled by the values of 126th column of the dataset. We divided the data graphically into seven blocks. Six blocks correspond to the episodes in 2007, each of which lasted 2 days. The last block (from 3600 on) represents data continuously collected from 9 to 15 May 2008. The position of each triangle on the ordinate reports the cluster membership, on the base of the largest component of the membership vector. The dark gray bar fills the light gray area,

A. Messina, H. Langer / Computers & Geosciences 37 (2011) 953–961

according to the degree of prevailing membership. For instance, the triangles around 5000 on the abscissa have a position on the ordinate corresponding to ‘3’. The coverage of the light gray bars by the dark ones is close to 100%, corresponding to an almost crisp membership of these patterns to cluster ‘3’. The complete representation of fuzzy class membership vectors is shown in the color bar at the bottom of the graph. Each color corresponds to a cluster (here ‘1’, ‘2’, and ‘3’), whereas the degree of membership to a certain cluster is expressed by the height the colored field occupies. It is possible to visualize specific time windows using the Show frame in the Welcome window (Fig. 4). With values of ‘3600’ as From row and ‘5615’ as To row, we obtain a synoptic plot only for this range (Fig. 9). Nonetheless, KKAnalysis performs the analysis on the dataset defined by the values specified in From and To fields of Rows and Columns frames. SOM colors may be difficult to distinguish even though the corresponding patterns are found at clearly distant positions on the map. Looking at the panel (a) of Fig. 6 we find deep purple patterns at a position around (5;2) and dark blue ones at (  5;2). These chromatic differences may easily escape the eye of a reader, in particular when many patterns must be plotted in dense time sequences (e.g., Fig. 8). The synoptic visualization of SOM colors and clustering results facilitates the recognition of differences between the patterns. In Fig. 10 we focus on May 13, 2008. Frame (a), showing the SOM colors and the fuzzy class membership values, was obtained by editing the MATLAB version (.fig file) of Fig. 8. The time series shown in frame (b) reports logarithmic RMS (root mean square) values obtained over 5 min. Average amplitudes are given in blue, and the 10% percentile in red. The latter was adopted in order to discard transients such as earthquakes, similar to the procedure described in Section 4. In Fig. 10 we note a slight change of SOM colors and fuzzy membership from 04:00 (here and hereafter UT) on, when triangle colors start to show green tones. From 06:00, the membership degree for cluster ‘2’ increases rapidly, becoming prevalent from ca. 6:30 on. No change can be seen in the average of RMS amplitudes, and

959

changes in the 10% percentile curve are still uncertain. From 07:00 on, SOM colors and fuzzy class membership clearly indicate a new regime, well distinguishable from the one before 06:00. At the same time, we also note a slight increase of the 10% percentile amplitudes, whereas average RMS amplitudes do not change significantly. Both amplitude curves reach clearly higher values only from 08:00 on, i.e., when SOM colors gradually change from green to pink, and class membership to cluster ‘3’ becomes increasingly important. From ca. 09:15 on the colors of the triangles show a strong contribution of red tones, coinciding with the maximum of radiated seismic energy and the beginning of the lava effusion. Class membership values show a prevalence of cluster ‘3’, nonetheless they reveal a smoother development until pattern 5000, ca. 17:00, where the class membership to cluster ‘3’ reaches its maximum. Recall that the maximum of class membership values is reached when a pattern is close to the centroid position of a cluster, rather than for some extreme, marginal positions. Such positions turn out from the SOM colors, which are expected to show a high degree of saturation of one of the RGB colors. Here this happens at 09:15 when tremor radiation reached its maximum. From 21:00 on, class membership to cluster ‘3’ decreases as signal characteristics become increasingly similar to a cluster ‘2’ type pattern.

7. Conclusions Volcanic tremor has proven to be key for monitoring Mt. Etna providing important information about its status in periods when the theater of activity is not accessible for adverse environmental conditions. A 24/7 data acquisition comes with the burden of the accumulation of large data masses, which makes parameter extraction and automatic data processing necessary. Classification has been proposed in various contributions. Langer et al. (2009) applied supervised and unsupervised techniques to tremor data recorded in 2001. They showed that regimes of volcanic activity are clearly mirrored in the characteristics of this signal. Their encouraging results motivated the authors to develop an easy-to-use software package for the unsupervised classification of numerical

Fig. 9. Zooming with KKAnalysis using the ‘‘Show’’ frame in the Welcome window. Labels on the abscissa report the pattern IDs corresponding to the integer values in the 126th column of the input file. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

960

A. Messina, H. Langer / Computers & Geosciences 37 (2011) 953–961

Fig. 10. KKAnalysis typical representation of results (above) together with RMS logarithmic values (below) on May 13, 2008. Average RMS amplitudes are given by a blue line; values corresponding to the 10% percentile are in red. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 11. Example for a real-time monitoring application. Labels on the abscissa report the pattern IDs corresponding to the integer values in the 126th column of the input file. The reference pool of patterns is given by six episodes of 2007 and the data block from 9 to 15 May 2008. ‘‘New patterns’’ were recorded on March 9, 2009. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

patterns, called KKAnalysis. Unsupervised classification is highly user defined and application driven, as every problem requires different methods and configurations. KKAnalysis has been developed at INGV–Sezione di Catania to match these needs. It provides a toolbox of unsupervised classification techniques, ready to be used by a broad public. KKAnalysis can be applied to any kind of pattern provided its features are given by numerical values. For instance, a recent application on geochemical analyses of Mt. Etna eruptive products was presented by

Corsaro et al. (2010). A key feature of the program resides in the synoptic representation of classification results (see Figs. 8–11). Time-dependent behavior of patterns is of particular interest, and the graphical output has been designed with regard to this aspect. In our approach, changes in signal characteristics become evident well before they can be recognized in conventional monitoring, e.g., considering the RMS amplitudes. Here, the program was applied on data continuously recorded at INGV-CT, suitably preprocessed. No a priori inspection of the

A. Messina, H. Langer / Computers & Geosciences 37 (2011) 953–961

Table 2a Data compression achieved with KKAnalysis. Time series (binary)

Spectrogram (ASCII)

ASCII output

Matlab figures

4650 MB

6 MB

0.6 MB

0.3 MB

The original time series considered here are relative to a single station channel (vertical component) and cover a total of 19 days. Each sample requires 4 bytes. The ASCII output accounts for the classification results, the weights of the nodes, and the log-file. We further considered the three figures chosen during the sample session, saved with the *.fig option.

961

the paper. We are grateful to the Director of our institute, Domenico Patane , for permitting the implementation of the on-line classification system in the operative control center of INGV-CT and our colleagues Danilo Reitano, Marcello D’Agostino, and Salvatore Mangiagli, whose collaboration was essential to reach this goal. We also thank Steve Conway for his help concerning linguistic aspects.

Appendix A. Supplementary materials Supplementary materials associated with this article can be found in the on-line version at doi:10.1016/j.cageo.2011.03.015.

Table 2b Runtime performances of KKAnalysis using the various options for map size and training duration.

References Small (18  5 nodes)

Normal (40  9 nodes)

Short Time (s) QE (dl) TE (%)

10 1.79 3.5

14 1.79 3.5

29 1.30 7.7

Normal Time (s) QE (dl) TE (%)

10 1.72 4.5

14 1.41 6.0

58 1.13 3.5

Long Time (s) QE (dl) TE (%)

12 1.66 3.3

20 1.34 4.0

176 1.11 3.3

Training/map size

Big (80  18 nodes)

The quantization error QE is the sum of the Dij in Eq. (1). The topological error TE gives the percentage of nodes which are close to each other on the map but not in the original data space (see Vesanto et al., 2000, and KKAnalysis Extended Documentation).

data was carried out by a human operator. As summarized in Table 2a, a drastic reduction of data amount is achieved, facilitating handling and storage of the information. Since the computing time for a typical run, using all the ca. 5500 patterns (288 per day), is short (see Table 2b), on-line application of KKAnalysis together with the preprocessing steps in the context of volcano monitoring has been implemented in the operative center of INGV-CT. In this context, the dataset described in this paper is used as a reference pool of data. New patterns are added to a daily ring buffer, discarding the eldest patterns. In our specific configuration the daily ring buffer stores 288 patterns, corresponding to an update every 5 min. After each update the ring buffer data are reprocessed together with the reference dataset. In Fig. 11, we show an example of the results obtained for this scheme. Our reference pool of data (ca. 5500 patterns) is depicted together with a set of new data covering a day of seismic data (the 288 patterns from 6000 on). As the new data in the ring buffer are processed together with the data of the reference pool, they can be related to previous unrests in a straightforward way. Immediate information on the current volcano activity regime is thus available.

Acknowledgments We thank two anonymous reviewers for their careful reading of the manuscript and their suggestions, which helped improve

Behncke, B., 2009. Personal communication. Brouwer, R.K., 2009. A method of relational fuzzy clustering based on producing feature vectors using FastMap. Information Sciences 179 (20), 3561–3582. Cannata, A., Catania, A., Alparone, S., 2008. Volcanic tremor at Mt. Etna: inferences on magma dynamics during effusive and explosive activity. Journal of Volcanology and Geothermal Research 178, 19–31. doi:10.1016/j.jvolgeores. 2007.11.027. Corsaro, R.A., Falsaperla, S., Langer, H., 2010. Geochemical patterns classification of recent Mt. Etna volcanic products based on a synopsis of Kohonen maps and fuzzy clustering. Geophysical Research Abstracts 12 EGU2010-10416-1. Davies, D.L., Bouldin, D.W., 1979. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Learning 1 (2), 224–227. Di Grazia, G., Falsaperla, S., Langer, H., 2006. Volcanic tremor location during the 2004 Mt. Etna lava effusion. Geophysical Research Letters 33, L04304. doi:10.1029/2005GL025177. D’Urso, P., Maharaj, E.A., 2009. Autocorrelation-based fuzzy clustering of time series. Fuzzy Sets and Systems 160 (24), 3565–3589. Er, M.J., Parthasarathi, R., 2005. A novel self-organizing neural fuzzy network for automatic generation of fuzzy inference systems, Advances in Neural Networks. Springer, Berlin 434–439. Esposito, A.M., Giudicepietro, F., D’Auria, L., Scarpetta, S., Martini, M., Coltelli, M., Marinaro, M., 2008. Unsupervised neural analysis of very long period events at Stromboli volcano using the self-organizing maps. Bulletin of the Seismological Society of America 98, 2449–2459. doi:10.1785/0120070110. Falsaperla, S., Alparone, S., D’Amico, S., Di Grazia, G., Ferrari, F., Langer, H., Sgroi, T., Spampinato, S., 2005. Volcanic tremor at Mt. Etna, Italy, preceding and accompanying the eruption of July–August, 2001. Pure and Applied Geophysics 162, 2111–2132. doi:10.1007/s00024-005-2710-y. Falsaperla, S., Graziani, S., Nunnari, G., Spampinato, S., 1996. Automatic classification of volcanic earthquakes by using multi-layered neural networks. Natural Hazards 13, 205–228. Kohonen, T., 2001. Self Organizing Maps, 3rd ed. Springer, Berlin, Germany, 501 pp. Langer, H., Falsaperla, S., 1996. Long-term observation of volcanic tremor on Stromboli volcano (Italy): a synopsis. Pure and Applied Geophysics 147, 57–82. Langer, H., Falsaperla, S., 2003. Seismic monitoring at Stromboli volcano (Italy): a case study for data reduction and parameter extraction. Journal of Volcanology and Geothermal Research 128, 233–245. Langer, H., Falsaperla, S., Masotti, M., Campanini, R., Spampinato, S., Messina, A., 2009. Synopsis of supervised and unsupervised pattern classification techniques applied to volcanic tremor data at Mt. Etna, Italy. Geophysical Journal International 178, 1132–1144. doi:10.1111/j.1365-246X.2009.04179.x. Langer, H., Falsaperla, S., Powell, T., Thompson, G., 2006. Automatic classification and a-posteriori analysis of seismic event identification at Soufriere Hills volcano, Montserrat. Journal of Volcanology and Geothermal Research 153 (1–2), 357–369. Scarpetta, S., Giudicepietro, F., Ezin, E.C., Petrosino, S., Del Pezzo, E., Martini, M., Marinaro, M., 2005. Automatic classification of seismic signals at Mt. Vesuvius volcano, Italy, using neural networks. Bulletin of the Seismological Society of America 95, 185–196. doi:10.1785/0120030075. ¨ Spath, H., 1985. Cluster Dissection and Analysis. Horwood, Chichester, 226 pp. Tokushige, S., Yadohisa, H., Inada, K., 2007. Crisp and fuzzy k-means clustering algorithms for multivariate functional data. Computational Statistics 22 (1), 1–16. Vesanto, J., Himberg, J., Alhoniemi, E., Parhankangas, J., 2000. SOM Toolbox for Matlab 5. Report A57, Helsinki University of Technology, Helsinki, Finland.