Accepted Manuscript Efficiency Investigation from Shallow to Deep Neural Network Techniques in Human Activity Recognition Jozsef Suto, Stefan Oniga PII: DOI: Reference:
S1389-0417(18)30053-6 https://doi.org/10.1016/j.cogsys.2018.11.009 COGSYS 787
To appear in:
Cognitive Systems Research
Received Date: Revised Date: Accepted Date:
8 February 2018 2 August 2018 22 November 2018
Please cite this article as: Suto, J., Oniga, S., Efficiency Investigation from Shallow to Deep Neural Network Techniques in Human Activity Recognition, Cognitive Systems Research (2018), doi: https://doi.org/10.1016/ j.cogsys.2018.11.009
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Efficiency Investigation from Shallow to Deep Neural Network Techniques in Human Activity Recognition Jozsef Sutoa,*, Stefan Onigaa a
Department of Informatics Systems and Networks, Faculty of Informatics, University of Debrecen, Kassai street, 26, 4028, Debrecen, Hungary. * Corresponding author. Email:
[email protected]. Tel.: +36 52 512 900/75016. ORCID ID: 0000-0003-3155-5159 Emails:
[email protected],
[email protected]
Abstract: In the last years, several researchers measured different recognition rates with different artificial neural network (ANN) techniques on public data sets in the human activity recognition (HAR) problem. However an overall investigation does not exist in the literature and the efficiency of complex and deeper ANNs over shallow networks is not clear. The purpose of this paper is to investigate the recognition rate and time requirement of different kinds of ANN approaches in HAR. This work examines the performance of shallow ANN architectures with different hyper-parameters, ANN ensembles, binary ANN classifier groups, and convolutional neural networks on two public databases. Although the popularity of binary classifiers, classifier ensembles and deep learning have been significantly increasing, this study shows that shallow ANNs with appropriate hyper-parameters in combination with extracted features can reach similar or higher recognition rate in less time than other artificial neural network methods in HAR. With a well-tuned ANN we outperformed all previous results on two public databases. Consequently, instead of the more complex ANN techniques, the usage of simple ANN with two or three layers can be an appropriate choice for activity recognition.
Keywords: Artificial neural networks, binary classifiers, convolutional networks, ensembles, feature extraction, human activity recognition.
1
1.
Introduction
Today’s, the miniaturized sensor technology accelerated the usage of different kinds of small sensors in different research fields such as in ambient assisted living (Oniga and Suto 2016). It also has a huge impact on human activity recognition (HAR) research where scientists try to recognise physical activities of people from wearable sensors’ signal. More studies have showed that physically active people who live in a healthier lifestyle have lower rates of diseases (Physical Activity Basis 2017). The trend toward sedentary lifestyle is one of the main causes of several dangerous health problems, for example, obesity and cardiovascular diseases (Godfrey et al. 2008). Monitoring and recognizing daily activities of a person can help in the evaluation and treatment his/her health status. In addition, the rapidly growing rate of elderly population also greatly influences the development of HAR systems and other health care services. In the last years, researchers tried different information acquisition techniques for HAR. The two major approaches are based on computer vision and wearable sensor networks. Due to the limitations and disadvantages of vision-based data capture (privacy issue, background change, lighting conditions, special environment, etc.) wearable sensors have received higher attention. Even though many data capture devices exist, activity classification is not an easy task. Beyond data acquisition, an efficient classification model is also necessary for data interpretation. The different types of noise and the incomplete training data sets frequently prevent the correct recognition. In order to the recognition be efficient, researchers applied stable and robust machine learning techniques that can handle noisy data. Previous studies have shown that feed-forward artificial neural networks (ANN) can be a possible classifier for HAR (Yang et al. 2008; Khan et al. 2010; Oniga and Suto 2014). However, the ANN construction depends on many hyper-parameters which strongly influence the performance. Earlier articles did not specify the hyper-parameters or they used a partial grid search (Yang et al. 2008; Khan et al. 2010; Chernbumroong et al. 2013; Oniga and Suto 2015; Kilinc et al. 2015; Zebin et al. 2017). Probably their authors did not pay enough attention for hyper-parameter search. In addition, to the best of our knowledge more complex ANN models such as ANN ensembles and binary ANN classifiers have been poorly tested in HAR. HAR surveys (Ayu et al. 2012; Lara and Labrador 2013; Gao et al. 2014) do not cover those classifiers but they are regularly used in other machine learning problems such as in financial predictions and medical image classification (Zhou, Jiang, Jang and Chen 2002; Tsai and Wu 2008). In the last decade, deep learning generated some great breakthroughs in different machine leaning problems. The feature extraction properties of new layers have become attractive for the HAR community because the static feature extraction state is not necessary in this case. The new (deeper) methods already outperformed some earlier shallow, learning algorithms in natural language processing, document analysis and in object classification (Simard et al. 2003; Collobert et al. 2011). It motivated some researches to the usage of convolutional neural network (CNN) on accelerometer and gyroscope signals in HAR. They claimed that the convolutional layers in CNNs can replace the hand-crafted feature extraction step from shallow methods and CNNs have stronger generalization ability compared to shallow techniques (Jiang and Yin 2015; Sheng et al. 2016; Zeng et al. 2017). Since CNNs do not use conventional feature extraction therefore their usage is more convenient (Zheng et al. 2014). However, the effectiveness of CNNs against shallow ANNs is not clear in HAR. The articles of Yang et al. (2015) and Gjoreski et al. (2016) well reflect this uncertainty. In both articles, the authors compared the efficiency of CNNs to other shallow techniques like k-nearest neighbour or support vector machine and they measured very small differences. According to the above reasons the goal of this work is to examine the efficiency of different feed-forward ANN architectures with distinct parameters, ANN ensembles, binary ANN classifier groups and CNN architectures in HAR. In this study the test framework was written in Java (without any acceleration) and contains a self-developed machine learning library which was designed for real-time applications (Suto et al. 2017). Our purpose was to examine the ANN models on different public databases which have been collected with different data acquisition sensors, ensure different data representations, consists of similar activities, and enough earlier works exist on them. Based on the previous considerations, two datasets have been used. One of them is the wearable action recognition dataset (WARD). The WARD is a benchmark dataset which was acquired by Yang et al. in 2009 to HAR research. They used 5 data acquisition devices to the database construction which have been attached to different parts of the subject’s body. Each device included a 3-axis accelerometer and a 2-axis gyroscope. A detailed description about the database and the measurement conditions
2
can be found on the official website 1. The other dataset was downloaded from the well-known UCI repository2. This database has been created with one data acquisition device which was a smartphone with a built-in 3-axis accelerometer and a 3-axis gyroscope. The database contains both raw data and extracted features. In this paper we will refer to this database as UCI_DB. The work of Anguita et al. (2013) contains additional information about this dataset.
2.
Methods
2.1
Data segmentation and feature extraction
Generally, the raw data which come from the inertial sensors can be seen as discretized data stream. During the training and classification stages, this data stream will be separated into pieces or in other words into windows. The small data sets from a window will be the input of feature extraction algorithms where the goal is to find general descriptors which characterize the complete data set in the window. Extracted features are more preferable than normalized raw data because they are independent of the pattern location inside a window (Suto and Oniga 2017). An appropriate feature set can considerably increase the classifier’s efficiency and makes classification models simpler. In previous articles, different authors extracted different features from the frequency and time domains because the literature does not suggest a generally accepted feature group (Lara and Labrador 2013). Suto et al. (2017) tried to collect all relevant features from the HAR literature. According to their work, we extracted 14 features from the WARD database which can be seen in the following list. In the formulas F denotes the number of frequency components, T refers to the window size, Qj is the jth quartile of an ordered time series, i refers to the axis of a sensor, ai(t) is the tth element of a time series while Ai(f) is the fth frequency component.
Mean Mi
1 i t i t =1
1 i t i t =1
1 2 i t t =1
1 2
(4)
(5)
75th percentile PEi Q3
(3)
Interquartile range IQR i Q3 Q1
(2)
Root mean square RMSi
2
Mean absolute deviation MADi
(1)
Variance Vi
1 i t t =1
(6)
Kurtosis
https://people.eecs.berkeley.edu/~yang/software/WAR/ http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
3
4 1 i t i t 1
KSi
1 2 i t i t 1
Signal magnitude area
1 x t y t z t T t 1
SMA
(7)
2
(8)
Difference between min and max values Di max i t min i t
(9)
Correlation between axes
CORR i, j
ai t Mi a j t M j
t=1
ai t Mi a j t M j 2
(10) 2
t=1
Spectral energy F
SEi Ai f
2
(11)
f =1
Spectral entropy pf
Ai f
2
F
Ai j
(12) 2
j=1
Ei
p f log 2 p f F
f =1
(13)
Spectral centroid F
f Ai f SCi
f =1 F
(14)
Ai j
j=1
Principal frequency
PFi max f Ai f
f 0
(15)
The extracted features were normalized with (16) where Xnorm and Xraw are the normalized and raw data matrices while µ and σ are the features’ mean and standard deviation, respectively. Normalization makes features equally important. After those steps the windows which cover the same time interval on the sensor axes (5 sensor nodes with one 3-axis accelerometer and one 2-axis gyroscope) were represented by 330 normalized features (some features merge information from more axes). The UCI_DB consists of similar features in normalized form so its data does not require any data pre-processing.
X norm i, j X raw i, j μ j / σ j
(16)
4
2.2
Shallow neural network design
The ANN theory gives more advices to the network design and parameter setup but in many cases the ANN construction is application dependent. Despite of the shallow architecture of an ANN the number of adjustable parameters is high. Therefore, this study compared three shallow ANN strategies with different design approaches according to the works of Hagan et al. (2014) and Nielsen (2015). For illustration purpose you can see the general structure of a shallow, two-layer artificial neural network structure on Fig. 1. All the three networks have the following common properties:
learning algorithm: gradient descent with momentum stop condition: no improvement in 10 epoch batch size: 10 momentum: 0.15 epoch limit: 1000 two layers (one hidden and one output) initial biases: 0 initial weights: came from (17) where Wl is the weight matrix and κ is the inputs of a neuron on the lth layer learning decay: exponential as in (18) where α0 is the initial learning rate, φ is the decay factor and ε is the epoch counter.
According to the current state of the art, as cost (error) function the quadratic cost (C1), cross-entropy (C2) and the log-likelihood (C3) have been examined. In each case the cost was normalized by L2 normalization according to (19), (20), and (21) where N and M are the number of samples and output neurons, respectively. yj is the real output, aj is the output neuron’s activation, λ indicates the regularization strength and ω refers to the weights. The three costs have been used with different activation functions (22)-(25) on the layers in different compositions. In the first network (ANN1) the cost was quadratic with tangent (σ1) and linear (σ2) activation functions on the first and second layers, respectively. In the second network (ANN2) the cost function was crossentropy in combination with tangent and sigmoid (σ3) activation functions. Finally, the third network (ANN3) used log-likelihood error function with tangent and soft-max (σ4) activations. 1 W l 0,
(17)
0 e -
(18)
M
2
C1 y j a Lj
j
M
2
(19)
2N
C 2 y j ln a Lj 1 y j ln 1 a Lj j
C3 ln a yL
2
2
(20)
2N
(21)
2N
e e -
1 e e -
(22)
2
(23)
3
1 1 e -
(24)
5
4 i
ei
(25)
j M j e
Finding the right hyper-parameters of an ANN for a specific purpose requires several tests. In this paper we used random hyper-parameter because the work of Bergstra and Bengio (2012) showed that random parameter search is more preferable than grid search. The most common parameters are the learning rate decay factor (φ), initial learning rate (α0), regularization strength (λ), and the number of hidden neurons (h). In this study, the search of the first three parameters took place on a ten base exponential scale where the exponent derives from a uniform distribution as in (26). In the early tests the number of hidden neurons was drawn from a simple uniform distribution U(30,120). However, our experiences showed that, more than 40 neurons on the hidden layer do not increase the recognition rate significantly. Therefore, during tests we used 40 hidden neurons permanently. 0 , , 10U -4,-1
2.3
(26)
Convolutional neural network design
Convolutional neural networks (CNNs) are an outstanding part in the deep learning topic. A good introduction about CNNs and deep learning can be found in the paper of Arel et al. (2010). It is a complex multi-layer neural network where the base network has been extended with new types of layers. Since it particularly has been designed for processing two-dimensional data such as images, it takes care to spatial relationship between them. One of the main advantages of CNNs is the robustness. It means that CNNs resist a wide range of data alteration such as noise, scale, rotation, varying light conditions, etc. Already in the late 1990s LeCun et al. (1998) successfully applied CNNs on handwritten digit classification on the MNIST database. Later several authors and companies used CNNs mainly for object recognition and categorization (e.g. ImageNet challenge) (Krizevsky et al. 2012; Zeiler and Fergus 2014; Simonyan and Zisserman 2015; Szegedy et al. 2015). In those articles the authors developed rather deep CNNs (namely GoogleLeNet, AlexNet, ZF Net, VGGNet) with several convolutional, pooling, and fully connected layers. Essentially, a CNN merges multi-layer neural network with digital filtering where the input data is convolved with a set of small filters. In another approach, the filtering process acts a hierarchical feature extractor role where the layer volumes are feature maps. In this case the components of filters can be seen as weights and their magnitude will be adjusted in the training process. Typically a subsampling (pooling) layer follows a convolutional layer which reduces the dimensionality. This layer pair can be repeated arbitrary number of times and both layers can consist of more volumes with the same size but different filters. The number of volumes determines the depth (3th dimension) of a layer. On a volume each neuron shares on the same filter and bias.
Fig. 1 General structure of a shallow, two-layer artificial neural network
6
More formally, for the i, jth hidden neuron on the lth layer’s vth volume, the output is (27) where the filter size is NxM, V indicates the number of volumes on the previous layer, b is the bias, and ω refers to the weights. Typically, the activation function on the convolutional layer is rectified linear (ReLu) (28); see for example (Krizevsky et al. 2012; Szegedy 2016). Furthermore, on the pooling layer more subsampling possibilities exist. In the literature, the 2x2 max-pooling is the most popular which selects the maximum activation inside a 2x2 region on the convolutional layer (Zeiler and Fergus 2014). Finally at the end of the network the volumes of the last pooling or convolutional layers give the input of a multi-layer feed forward network that will produce the final output. In this study we used ReLu activation function (σ5) on the convolutional layers and soft-max activation function (σ4) with cross-entropy loss (C2) on the final layer such as in the papers of Krizevsky et al. (2012) and Zeiler and Fergus (2014). The concrete structure of the used CNN models will be described in section 3.5. V N M l,v l -1, pv l,v ai,l,vj n, m ai n, j m b pv n m
(27)
5 max0,
(13)
3.
Results
3.1
Investigation of shallow neural networks
In the first test, the extracted features were the inputs of all previously defined ANNs and their pseudorandomly generated hyper-parameters were similar. In all three cases, the random generator cores at network initializations and at hyper-parameter search were the same constants. Thereby we tried to determine the performance of the three ANNs under similar conditions. Fig. 2 and Fig. 3 show the measured recognition rates on both databases where the black, blue and green curves refer to ANN1, 2, 3 respectively. The trials on the xaxis of figures refer to the outcome of training processes with different hyper-parameters. As figures show, the recognition rate deviation between networks is low. Although the convergence of ANN1 gets stuck sometimes, this structure produced the highest recognition rates on both databases. Moreover, the time requirement of ANNs also does not show significant differences as Fig. 4 and Fig. 5 illustrate (colour meaning is the same as on Fig 2.). In this work each trial has been performed on a laptop which contains an i5-2.3GHz processor and 8GB DDR4 memory. According to Fig. 2 – Fig. 5, each ANN structure can be an appropriate choice because they have similar performance. Therefore, only the ANN1 and its extended or slightly modified architectures have been used in the rest of the paper.
Fig. 2 Measured recognition rates on UCI_DB with ANN1, ANN2 and ANN3
7
Fig. 3 Measured recognition rates on WARD with ANN1, ANN2 and ANN3
Fig. 4 Time requirement of ANNs on UCI_DB with longest runtimes. The time unit is second
Fig. 5 Time requirement of ANNs on WARD with longest runtimes. The time unit is second
8
3.2
Extended shallow architecture
In the practice, the usage of more than two layers is uncommon. Its reason is the learning slow down or in other words the vanishing gradient problem (Glorot and Bengio 2010). It means that the earlier layers learn much slower than later layers. Obviously, the time cost of such an extended ANN is slightly higher than in the case of simple ANNs. However, Ciresan et al. (2010) demonstrated that the power of a general neural network can be increased by more layers while Oniga and Suto (2016) measured recognition rate growth by the involvement of an additional hidden layer in HAR. Therefore we also examined the effect of network extension on the ANN1 model. An additional hidden layer were added to the network with tangent (σ1) activation function and 20 neurons. Moreover, in this investigation we increased the epoch limit from 1000 to 1300 due to the learning slow down problem. The recognition rates of the extended architecture with random hyper-parameters can be seen on Fig. 6. On the figure, the black and blue curves indicate the recognition rates on WARD and UCI_DB respectively.
Fig. 6 Recognition rates of the extended ANN1 and its best results on both databases
3.3
Neural network ensembles
Artificial neural network ensembles combine a finite number of ANNs where the networks are trained for a common classification task. In classification problems N networks will be trained with the same training data set in order to approximate a function where is a set of class labels. The idea behind ensembles supposes that each network has a generalization error on different parts of the input (feature) space. Thus the probability of false decisions in the joint output may become smaller than in the output of an individual network. In other words an ensemble combines more imperfect hypothesis in order to generate a better one. Ensembles can improve the generalization with this strategy in many problems. It has already successfully applied in numerous research fields such as image classification, character recognition, financial predictions, medical image processing and decision making (Hansen et al. 1992; Giacinto and Roli 2001; Tsai and Wu 2008; Das et al. 2009). It motivated us to test some ANN ensembles in HAR. In general, an ANN ensemble is constructed in two steps: design and training of individual networks and combining their predictions according to a certain rule. For the design and training of networks more possible solutions exist (Zhao et al. 2005). We used the Bagging approach which is one of the simplest techniques. It creates N training sets from the original data set by sampling with replacement and trains a classifier on one of theme. In the validation and test phases ANNs will give predictions and their combination is the final output of the ensemble. In this work, the prediction combination rule was the majority voting where that prediction will be the winner which gives more votes. Unfortunately a problem occurs when more class labels give the same number of votes. In this case, the choice depends on the activation strength of networks. Although ensembles produce better accuracy than their members in many cases, constructing theme is not an easy job. More authors used complex evolutionary algorithms to ensemble construction (Zhou et al. 2002; Yao and Islam 2008). Obviously those techniques further increase the
9
time requirement and complexity of the classification model. Therefore in this investigation we merged ANNs without any additional ensemble designer algorithm. Four ensembles (E3-E6) have been tested with 3, 4, 5, and 6 ANNs on both databases. Each network in the ensemble was ANN1 architecture with different random weights and the best hyper-parameters from the above random searches: α0 = 0.000612, φ = 0.00002, λ =0.0024 on WARD, and α0 = 0.000465, φ = 0.0000618, λ = 0.0371 on UCI_DB. The final results and the individual rates of the networks in the ensembles can be seen in Table 1. TABLE 1 RECOGNITION RATES OF 4 ENSEMBLES (E3-E6) WITH 3-6 ANNS
WARD ANNs 1 2 3 4 5 6 Ensem.
3.4
E3 0.991 0.990 0.985 0.993
E4 0.993 0.986 0.992 0.992 0.996
UCI_DB
E5 0.986 0.989 0.992 0.989 0.987 0.991
E6 0.992 0.993 0.992 0.990 0.986 0.989 0.995
E3 0.961 0.967 0.966 0.968
E4 0.969 0.966 0.966 0.965 0.969
E5 0.963 0.967 0.955 0.963 0.964 0.966
E6 0.967 0.959 0.960 0.962 0.962 0.963 0.965
Decomposition into binary classification
Classification problems involving multiple classes such as in HAR can be decomposed into several binary classification tasks that can be solved with binary classifiers. In the literature the most widely used binary classifier is the support vector machine but ANN is also applicable for this task (Aly 2005; Ou and Murphey 2007). As the name indicates, it divides the original training data set into two-class subsets. In other words, it follows the “divide and conquer” strategy. Theoretically the decision boundary of a binary classifier is simpler than in a complex classifier and it is easier to create a classifier which distinguishes only two classes. In the literature different data decomposition techniques can be found. The most common methods are the “one-versusone” (OVO) and the “one-versus-all” (OVA). More information about OVO and OVA decompositions can be found in the work of Galar et al. (2011). This investigation is only focused on OVA because Rifkin and Klautau (2004) presented that this simple technique can produce as good performance as other complicated methods if the classifiers are tuned well. It is the simplest decomposition approach which transforms a complex problem with K classes into K binary problems, where each classifier separates a given class from the other K – 1. This approach requires that the number of binary classifiers to be equal to the number of classes where the kth classifier will be trained with positive label if the sample belongs to class k and negative label if the sample comes from the other K – 1 classes. It can be seen as a special version of ensemble where each classifier learns only one class and rejects others. So the binary classifier tries to separate a class from the others and gives a positive output for patterns from that class and negative output for all other examples. Finally, the outputs will be combined to obtain the final decision. In the validation and test phases an unknown sample is presented to all binary classifiers and that classifier will indicate the final output class which gives a positive output. Sometimes the positive output is not unique. Fortunately, more solutions exist for this problem such as maximum confidence strategy (MAX) and dynamically ordered OVA (Lorena et al. 2008). In this study we applied the MAX technique which uses the confidence of the classifiers to the final class prediction. It defines a score vector that consists of the output of OVA classifiers where an element is the confidence of the ith class. This score vector will be utilized in the aggregation stage where the classifier with the largest positive activation will determine the final output. According to the number of classes in the two databases, we defined 6 binary ANN1 classifiers to the UCI_DB and 13 classifiers to the WARD with two output neurons and the same hyper-parameters like in the previous subsection. The result after 4 attempts (training and test phases) on both databases can be seen in Table 2. TABLE 2 RECOGNITION RATES WITH BINARY CLASSIFIERS
WARD Trials
1
2
3
UCI_DB 4
1
2
3
4
10
Recognition rate
3.5
0.989
0.986
0.988
0.987
0.953
0.946
0.949
0.951
Investigation of convolutional neural networks
In the literature already some articles use CNN for sensor data classification in the HAR problem. For example, Sheng et al. (2016) applied two convolutional and pooling layers with 128 and 256 depths and two fully connected layers with 512 and 13 neurons. In the work of Yang et al. (2015) the CNN consists of two convolutional and pooling layers with 50 and 40 feature maps, an own developed unification layer, and two fully connected layers with 400 and 18 neurons. Jiang et al. (2015) tried different constructions and their best architecture has similarly two convolutional and pooling layers with 5 and 10 feature maps and two fully connected layers. Hammerla et al. (2016) tested different architectures with 1-3 convolutional and pooling layers with 3-9 feature maps but they have not described the best architecture. The most detailed description about hyper-parameters settings of CNNs can be found in the article of Ronao and Cho (2016). Unlike the previous articles, in this study the CNN input was 1 dimensional sensor signal thus convolutional layers performed 1 dimensional convolution. The authors have found that after three convolutional layers the performance is decreasing. Moreover, they claimed that after 130 feature maps the performance does not increase. According to the above described works and our experiences, we investigated two CNNs (CNN1 and CNN2) with different layer depths and number of neurons on the first fully connected layer. The structure of CNN1 and CNN2 can be seen on Fig. 7 and Fig. 8, respectively. The input of the networks was the normalized raw sensor data in two-dimensional form where the rows are windows which cover the same time period during data acquisition on sensors’ axes. In both cases the filter sizes on the convolutional and pooling layers were 2x2 with one sample long stride on the convolutional layers and two sample wide strides on the pooling layers.
Fig. 7 Our first CNN architecture with one convolutional, one pooling, and two fully connected layers. The figure shows the different depth sizes and neurons on the first fully connected layer that have been used in the investigation
Fig. 8 Our second CNN architecture with two convolutional, pooling, and fully connected layers. The depth sizes and number of neurons on the first fully connected layer were also illustrated as on Fig. 6
11
Beyond the architecture construction, CNNs similarly require the same hyper-parameters than ANN such as learning rate, learning decay, etc. To find the optimal hyper-parameters to a CNN is still an open question and it is a long process with the previously applied random hyper-parameter search. So at first, we applied the best learning rate and decay factor from the above random search in both CNNs with a higher constant regularization strength (λ = 0.9). In CNNs the regularization is more important than in ANNs because with the growth of the network overfitting is a more significant problem (Krizevsky et al. 2012). Table 3 and 4 contain the average recognition rates and time requirement (in second) of three trials with the two CNNs on both databases. In Table 3 the architecture column indicates the depth of the convolutional layer and the number of neurons on the first fully connected layer while in Table 4 it refers to the depth of the two convolutional layers and the neurons on the first fully connected layer. TABLE 3 RECOGNITION RATES WITH CNN1
Architecture 10-40 20-40 30-100 50-200 50-512 100-1000
WARD Rec. rate Time 0.964 1213 0.970 2210 0.974 6464 0.975 14224 0.977 45327 0.977 163026
UCI_DB Rec. rate Time 0.846 1615 0.853 4071 0.864 6972 0.883 19985 0.898 70671 0.899 286509
TABLE 4 Recognition rates with CNN2
Architecture 20-30-100 20-50-100 30-50-200 30-50-512 50-100-512 80-120-1000
4.
WARD Rec. rate Time 0.956 13010 0.956 17358 0.957 24697 0.972 43183 0.976 59268 0.975 258068
UCI_DB Rec. rate Time 0.888 16142 0.895 19394 0.906 36747 0.924 62380 0.928 90921 0.942 350677
Discussion
At the beginning of the investigation, the random hyper-parameter search clearly illustrated the importance of the hyper-parameter setup in an ANN. With different hyper-parameter combinations an ANN reaches different recognition rates. For example, the recognition rate difference of ANN1 with the best and worst hyper-parameter combination was approximately 60% (99.2% – 39.3%) on WARD and 80% (96.7% – 16.8%) on UCI_DB. Unlike hyper-parameters the network architecture did not affect the ANN performance considerably because ANN1, ANN2, and ANN3 produced similar accuracy within similar time interval. The hyper-parameters were also important in the extended ANN1 architecture. This examination brought 99.5% and 97.3% recognition rates on the WARD and UCI_DB respectively. The additional hidden layer improved the efficiency but the training time also slightly increased. The average training time growth was approximately 200 seconds. In this case, the accuracy difference between the best and worst trials was approximately 63% (99.5% - 36.8%) on WARD and 44% (97.3% - 53.3%) on UCI_DB. The result of ensembles shows some important conclusions. The ensemble outperformed its individual networks in some cases but in other cases an individual network reached better result than the ensemble (e.g. E5). Moreover, the result shows that the involvement of more networks does not ensure performance improvement. It similarly has been noticed previously by Zhou et al. (2002). The key of efficient ensemble construction is based on its individual members that produce uncorrelated output (Yao and Islam 2008). If this condition is not meet, a batch of ANNs does not guarantee performance improvement over a simple network. The recognition rates of binary classifiers in Table 2 are slightly lower than in the earlier ANN approaches. Theoretically the decision boundaries should be simpler in binary classifiers thus we thought that other random
12
hyper-parameters may improve the accuracy. Therefore a random parameter search also has been performed with binary classifiers on the UCI_DB. The result can be seen on Fig. 9. As the figure shows, after parameter search the yield of binary classifier group was 97.2% accuracy which is 0.5% higher than in the case of ANNs. However we have to consider the high time requirement of the training process. For instance the training time of the best ANN on UCI_DB was approximately 1053 seconds while the binary classifier’s training required more than 4556 seconds.
Fig. 9 Recognition rates of the binary classifier group with random hyper-parameters on UCI_DB
Finally, the above defined CNN1 and CNN2 networks in various structures have been tested on both databases. Surprisingly, CNN1 with 50-512 architecture reached higher accuracy than CNN2 on WARD. On the UCI_DB the performance of both CNNs constantly increased by the deepening structure, thus CNN2 was the winner. Our highest recognition rates with CNNs were 97.7% on WARD and 94.2% on UCI_DB. These results are less than in the case of shallow ANN. In addition, the training time of a CNN is enormous in comparison to ANNs. For instance, the training time of the most efficient ANN1 on UCI_DB was 1053 seconds while this time is about 350677 seconds in the case of CNN2 (80-120-1000). For the completeness of this study, we also performed a random hyper-parameter search with CNN1 (50-512) on WARD. The outcome can be seen on Fig. 10. Although this process required approximately 290 hours, it caused only 0.3% improvements (highest recognition rate – 98%).
Fig. 10 Recognition rates of CNN1 (50-512) with random hyper-parameters on WARD
13
As was mentioned earlier, already several scientists used the WARD and the UCI_DB datasets in their research. We collected the most popular works on them independently of the used machine learning technique. Those works can be seen in Table 5. The comparison between our measurements (with shallow, complex and deep ANN models) and the results from Table 5 indicates that the usage of complex or deep neural network approaches is unnecessary when the available dataset is relatively small as in the two databases which were our data sources in this study. The shallow neural network with two layers outperformed CNNs and ensembles while its extended version was better than the binary classifier group. In addition the training time of the shallow techniques was significantly less than the training time of CNNs, ensembles and binary classifiers. Those results show that, a shallow ANN with extracted features and appropriate hyper-parameters can be a right choice for HAR when the available dataset is relatively small (a few thousands instances). TABLE 5 PREVIOUS RESULTS ON THE WARD AND UCI_DB DATABASES
Databases
WARD
HAPT
5.
References Yang et al. (2009) Sheng et al. (2016) Pinardi and Bisiani (2010) Oniga and Suto (2015) Su et al. (2016) Reiss et al. (2013) Jiang and Yin (2015) Ronao and Cho (2016) Anguita et al. (2013) Kastner et al. (2013) Paredes et al. (2013) Ignatov A. (2017)
Machine learning method distributed sparsity convolutional network lexical approach artificial neural network support vector machine conf-AdaBoost convolutional network convolutional network support vector machine learning vector quantization OVO support vector machine convolutional network
Recognition rate 93.50% 95.92% 97.80% 98.10% 98.50% 94.33% 95.18% 95.75% 96.00% 96.23% 96.40% 97.63%
Conclusion
This study tried to find the best neural network approach for HAR. At the beginning of the investigation we described the most common hyper-parameters, their intervals, and their most common combinations. According to it three shallow ANNs have been defined and tested with extracted features and random hyper-parameters. The result demonstrated that hyper-parameters are more important than the network architecture. We measured 99.2% recognition rate on the WARD and 96.7% accuracy on the UCI_DB databases with the most efficient hyper-parameter combinations. Thereafter an extended ANN architecture with two hidden layers has been tested where the hyper-parameters came from the same random search as before. The involvement of the second hidden layer increased the recognition rate but the training time was slightly longer than before. Later, we examined more ANN ensembles and binary ANN classifier groups. In some cases the ensemble slightly outperformed its members but in other cases its members produced higher accuracy. Thus an ensemble does not guarantee performance improvement and requires much more time than a simple ANN. The group of binary classifiers also could not outperform shallow ANNs. In the last test, two CNNs with different depth have be defined and examined. By the involvement of convolutional layer(s), CNNs do not require feature extraction like ANNs therefore their input was the normalized, raw sensor data in two-dimensional form. On the WARD the simpler CNN1 network reached the best result 97.7% while on UCI_DB the deepest CNN2 structure was the most efficient with 94.2% accuracy. However both values are smaller than the highest recognition rates with ANNs. If we compare the complexity of AlexNet or GoogleLeNet from the object recognition problem with the CNNs that were used in previous HAR articles, we will see that the HAR problem does not require as deep CNNs as object recognition. Actually, the main difference between ANN and CNN is the feature extraction capability. CNN substitutes the static feature extraction step from shallow techniques by convolutional layers but this approach works only if the available dataset is abundant enough. Since static features have been extracted from time and the frequency domains while CNNs are focusing only on the time domain signal, it can be another possible reason for the efficiency of timefrequency features against convolution. An extended data set would improve the performance of CNN but it is also true for each machine leaning method. However data acquisition in HAR is a time-consuming and
14
inconvenient process and it is especially true for elder population. In addition the data is generally individual because the postures and movement patterns of a person can be different from others. Finally, we tried to collect each relevant works on both databases and compared them with our measurements. According to our measurements and other previous results, the final conclusions of this study demonstrate that the usage of shallow ANNs with two or three layers, extracted features and appropriate hyper-parameters is a better choice in the HAR problem than more complex and/or deeper ANN approaches when the available dataset is small.
ACKNOWLEDGEMENT This work was supported by the construction EFOP-3.6.3-VEKOP-16-2017-00002. The project was supported by the European Union, co-financed by the European Social Fund.
REFERENCES [1] Aly M. Survey on multiclass classification methods. (2005). http://www.vision.caltech.edu/malaa/publications/aly05multiclass.pdf. Accessed 20 November 2017. [2] Ayu A. M., Ismail A. S., Matin A. F. A, Mantoro T. (2012). A comparison study of classifier algorithms for mobile phone’s accelerometer based activity recognition. Procedia Engineering, 41, 224-229. [3] Anguita D., Ghio A., Oneto L., Parra X., Reyes-Ortiz L. (2013). A public domain dataset for human activity recognition using smartphones. In 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (pp. 437-442), Bruges. [4] Arel I., Rose D. C., Karnowski T. P. (2010). Deep machine learning – a new frontier in artificial intelligence research. IEEE Comput Intell M, 5, 13-18. [5] Bergstra J., Bengio Y. (2012). Random search for hyper-parameter optimization. J Mach Learn Res, 12, 281-305. [6] Chernbumroong S., Cang S., Atkins A., Yu H. (2013). Elderly activities recognition and classification for applications in assisted living. Expert Syst Appl, 40, 1662-1674. [7] Ciresan D. C., Meier U., Gambardella L. M., Schmidhuber J. (2010). Big deep simple neural nets for handwritten digit recognition. Neural Compu, 22, 3207-3220. [8] Collobert R., Weston J., Bottou L., Karlen M., Kavukcuoglu K., Kuksa P. (2011). Natural language processing (almost) from scratch. J Mach Learn Res, 12, 2493-2537. [9] Das R., Turkoglu I., Sengur A. (2009) Effective diagnosis of heath disease through neural networks ensembles. Expert Syst Appl, 36, 7675-7680. [10] Galar M., Fernandez A., Barrenechea E., Bustince H., Herrera F. (2011). An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes. Pattern Recogn, 44, 1761-1776. [11] Gao L., Bourke A. K., Nelson J. (2014). Evaluation of accelerometer based multi-sensor versus single-sensor activity recognition systems. Med Eng Phys, 36, 779-785. [12] Giacinto G., Roli F. (2001). Design of effective neural network ensembles for image classification purposes. Image Vision Comput, 19, 699-707. [13] Gjoreski H., Bizjak J., Gjoreski M., Gams M. (2016). Comparing deep a classical machine learning methods for human activity recognition using wrist accelerometer. In 25th International Joint Conference on Artificial Intelligence (pp. 1-7), New York. [14] Glorot X., Bengio Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In 13th International Conference on Artificial Intelligence and Statistics (pp. 249-256), Sardinia. [15] Godfrey A., Conway R., Meagher D., Olaighin G. (2008). Direct measurement of human movement by accelerometry. Med Eng Phys, 30, 1364-1386. [16] Hagan M. T., Demuth H. B., Beale M. H., Jesus O. D. (2014). Neural network design. (2th ed.). eBook. http://hagan/okstate.edu/NNDesign.pdf. Accessed 20 November 2017. [17] Hammerla N. Y., Halloran S., Plots T. (2016). Deep, convolutional, and recurrent models for human activity recognition using wearables. In 25th International Joint Conference on Artificial Intelligence (pp. 1533-1540), New York. [18] Hansen L. K., Liisberg C., Salamon P. (1992). Ensemble methods for handwritten digit recognition. In IEEE Workshop on Neural Networks for Signal Processing II (pp. 333-342), Helsingoer. [19] Ignatov A. (2017). Real-time human activity recognition from accelerometer data using convolutional neural networks. Appl Soft Comput. https://doi.org/10.1016/j.asoc.2017.09.027. [20] Jiang W., Yin Z. (2015). Human activity recognition using wearable sensors by deep convolutional neural networks. In 23th ACM International Conference on Multimedia (pp. 1307-1310), Brisbane. [21] Kastner M., Strickert M., Villmann T. (2013). A sparse kernelized matrix learning vector quantization model for human activity recognition. In European Symposium of Artificial Neural Networks, Computational Intelligence and Machine Learning (pp. 449-454), Bruges. [22] Khan A. M., Lee Y. K., Lee S. Y., Kim T. S. (2010). A triaxial accelerometer-based physical-activity recognition via augmented-signal features and a hierarchical recognizer. IEEE T Inf Technol B, 14, 1166-1172.
15
[23] Kilinc O., Dalzell A., Uluturk I., Ulysal I. (2015). Inertial based recognition of daily activities with ANNs and spectrotemporal features. In IEEE 14th International Conference on Machine Learning and Applications (pp. 733-738), Miami. [24] Krizevsky A., Sutskerev I., Hinton G. E. (2012). ImageNet classification with deep convolutional neural networks. In Neural Information Processing Systems (pp. 1-9), Nevada. [25] Lara O. D., Labrador M. A. (2013). A survey on human activity recognition using wearable sensors. IEEE Commun Surv Tut, 15, 1192-1209. [26] LeCun Y., Bottou L., Bengio Y., Haffner P. (1998). Gradient-based learning applied to document recognition. P IEEE 86, 2278-2324. [27] Lorena A. C., Carvalho A., Gama J. (2008). A review on the combination of binary classifiers in multiclass problems. Artif Intell Rev, 30, 19-37. [28] Nielsen M. A. (2015). Neural Networks and Deep Learning. Determination Press, Ebook. http://neuralnetworksanddeeplearning.com/. Accessed 20 November 2017. [29] Ou G., Murphey Y. M. (2007). Multi-class pattern classification using neural networks. Pattern Recogn. 40, 4-18. [30] Oniga S., Suto J. (2014). Human activity recognition using neural networks. In 15th International Carpathian Control Conference (pp. 759-762), Velke Karlovice. [31] Oniga S., Suto J. (2015). Optimal recognition method of human activities using artificial neural networks. Meas Sci Rev. 15, 323-327. [32] Oniga S., Suto J. (2016). Activity recognition in adaptive assistive systems using artificial neural networks, Elektron Elektrotech, 22, 68-72. [33] Parades B. R., Aung H., Berthouze N. B. (2013). One-vs-one classifier ensemble with majority voting for activity recognition. In: European Symposium of Artificial Neural Networks, Computational Intelligence and Machine Learning (pp. 443-448), Bruges. [34] Physical Activity Basis. Center for Disease Control and Prevention. (2017). https://www.cdc.gov/physicalactivity/basics/pa-health/index.htm. Accessed 21 November 2017. [35] Pinardi S., Bisiani R. (2010). Movement recognition with intelligent multisensory analysis, a lexical approach. In: 6th international Conference on Intelligent Environment (pp. 170-177), Kuala Lumpur. [36] Reiss A., Hendeby G., Sticker D. (2013). A competitive approach for human activity recognition on smartphones. In: European Symposium of Artificial Neural Networks, Computational Intelligence and Machine Learning (pp. 455-460), Bruges. [37] Rifkin R., Klautau A. (2004). In defense of one-vs-all classification. J Mach Learn Res, 5, 101-141. [38] Ronao C. A., Cho S. B. (2016). Human activity recognition with smartphone sensors using deep learning neural networks. Expert Syst Appl, 59, 235-244. [39] Sheng M., Jiang J., Su B., Tang Q., Yahya A. A., Wang G. (2016). Short-time activity recognition with wearable sensors using convolutional neural networks. In: 15th ACM SIGGRAPH Conference on Virtual-Reality Continuum and Its Applications in Industry (pp. 413-416), Zhuhai. [40] Simard P. Y., Steinkraus D., Platt J. C. (2003). Best practice for convolutional neural networks applied to visual document analysis. In: 7th International Conference on Document Analysis and Recognition (pp. 958-962), Washington. [41] Simonyan K., Zisserman A. (2015). Very deep convolutional networks for large-scale image recognition. In: 5th International Conference on Learning Representations (pp. 1-14), San Diego. [42] Su B., Tang Q., Wang G., Sheng M. (2016). The recognition of human daily actins with wearable motion sensor systems. Transaction on Edutainment XII (pp. 68-77), Springer Berlin Heidelberg, [43] Suto J., Oniga S. (2017). Efficiency investigation of artificial neural networks in human activity recognition. J Ambient Intell Human Comput. https://doi.org/10.1007/s12652-017-0513-5. [44] Suto J., Oniga S., Lung C., Orha I. (2017). Recognition rate difference between real-time and offline human activity recognition. In: International Conference on Internet of Things for the Global Community (pp. 103-109), Funchal. [45] Suto J., Oniga S., Pop-Sitar P. (2017). Feature analysis to human activity recognition. Int J Comp Commun, 12, 116-130. [46] Yang J. Y., Wang J. S., Chen Y. P. (2008). Using acceleration measurements for activity recognition: an effective learning algorithm for constructing neural classifiers. Pattern Recogn Lett, 29, 2213-2220. [47] Szegedy C., Ioffe S., Vanhoucke V., Alemi A. (2016). Inception-v4, inception-res-net and the impact of residual connections on learning. https://arxiv.org/pdf/1602.07261v2.pdf. Accessed 30 October 2017. [48] Szegedy C., Liu W., Jia Y., Sermanet P., Reed S., Anguelov D., Erhan D., Vanhoucker V., Rabinovich R. (2015). Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (pp 1-9), Boston. [49] Tsai C. F., Wu J. W. (2008). Using neural network ensembles for bankruptcy prediction and credit scoring. Expert Syst Appl, 34, 2639-2649. [50] Yang A. Y., Jafari R., Sastry S. S., Bajcsy R. (2009). Distributed recognition of human actions using wearable motion sensor networks. J Amb Intel Smart En, 1, 103-115. [51] Yang J. B., Nguyen M. N., San P. P., Li X. L., Krishnaswamy S. (2015). Deep convolutional neural networks on multichannel time series for human activity recognition. In: 24th International Joint Conference on Artificial Intelligence (pp. 3995-4001), Buenos Aires. [52] Yao X., Islam M. N. (2008). Evolving artificial neural network ensembles. IEEE Comput Intell M, 3, 31-42. [53] Zebin T., Scully J. P., Ozanyan B. K. (2017). Inertial sensor based modelling of human activity classes: feature extraction and multi-sensor data fusion using machine learning algorithms. Lecture Notes of the Institute for Computer Science, Social Informatics and Telecommunication Engineering, 181, 306-314. [54] Zeiler M. D., Fergus M. (2014). Visualizing and understanding convolutional networks. In: European Conference on Computer Vision (pp. 818-833), Zurich.
16
[55] Zeng M., Yu T., Wang X., Nguyen T. L., Mengshoel O. J., Lane I. (2017). Semi-supervised convolutional neural networks for human activity recognition. In: 2017 IEEE International Conference on Big Data (pp. 522-529), Boston. [56] Zhao Y., Gao J., Yang X. (2005). A survey of neural network ensembles. In: International Conference on Neural Networks and Brain (pp. 438-442), Beijing. [57] Zheng M., Nguyen L. T., Yu B., Mengshoel O. J., Zhu J., Wu P., Zhang J. (2014). Convolutional neural networks for human activity recognition using mobile sensors. In: 6th International Conference on Mobile Computing, Applications and Services (pp. 197-205). Austin. [58] Zhou Z. H., Jiang Y., Jang B. Y., Chen F. S. (2002). Lung cancer cell identification based on artificial neural network ensembles. Artif Intell Med, 24, 25-36. [59] Zhou Z. H., Wu J., Tang W. (2002). Ensembling neural networks: many could be better than all. Artif Intell, 137, 239263.
17