Engineering Applications of Artificial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Contents lists available at ScienceDirect
Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai
Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network R. Venkatesh Babu a,n, R. Savitha b, S. Suresh b, Bhuvnesh Agarwal a a b
Video Analytics Lab, Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore, India School of Computer Engineering, Nanyang Technological University, Singapore
art ic l e i nf o
a b s t r a c t
Article history: Received 13 June 2012 Received in revised form 22 May 2013 Accepted 10 July 2013
In this paper, we present a machine learning approach for subject independent human action recognition using depth camera, emphasizing the importance of depth in recognition of actions. The proposed approach uses the flow information of all 3 dimensions to classify an action. In our approach, we have obtained the 2-D optical flow and used it along with the depth image to obtain the depth flow (Z motion vectors). The obtained flow captures the dynamics of the actions in space–time. Feature vectors are obtained by averaging the 3-D motion over a grid laid over the silhouette in a hierarchical fashion. These hierarchical fine to coarse windows capture the motion dynamics of the object at various scales. The extracted features are used to train a Meta-cognitive Radial Basis Function Network (McRBFN) that uses a Projection Based Learning (PBL) algorithm, referred to as PBL-McRBFN, henceforth. PBL-McRBFN begins with zero hidden neurons and builds the network based on the best human learning strategy, namely, self-regulated learning in a meta-cognitive environment. When a sample is used for learning, PBLMcRBFN uses the sample overlapping conditions, and a projection based learning algorithm to estimate the parameters of the network. The performance of PBL-McRBFN is compared to that of a Support Vector Machine (SVM) and Extreme Learning Machine (ELM) classifiers with representation of every person and action in the training and testing datasets. Performance study shows that PBL-McRBFN outperforms these classifiers in recognizing actions in 3-D. Further, a subject-independent study is conducted by leave-one-subject-out strategy and its generalization performance is tested. It is observed from the subject-independent study that McRBFN is capable of generalizing actions accurately. The performance of the proposed approach is benchmarked with Video Analytics Lab (VAL) dataset and Berkeley Multimodal Human Action Database (MHAD). & 2013 Elsevier Ltd. All rights reserved.
Keywords: Action recognition 3-D optical flow Kinect depth sensor Projection based learning Meta-cognition and self-regulated learning
1. Introduction In recent years, recognition of human actions has been a major concern in computer vision due to its immense applications in the field of autonomous video surveillance, video retrieval and human computer interaction. These applications require methods for recognizing human actions and gestures in various scenarios. Given a number of pre-defined actions, the action recognition problem can be stated as that of classifying a new action into one of these pre-existing actions. Many action recognition approaches utilize human silhouette for extracting various features. In one of the initial works by Bobick and Davis (2001), the extracted silhouettes are used to construct binary motion energy image (MEI) and motion history image (MHI) templates for representing action. Yamato et al. (1992) used grid-based silhouette mesh features to form a compact codebook of observations for representing actions
n
Corresponding author. Tel.: +91 80 22932900. E-mail address:
[email protected] (R. Venkatesh Babu).
using hidden Markov model. Extracting silhouettes in real-life videos is challenging and prone to noise. The noisy silhouettes are handled by phase correlation (Ogale et al., 2005), or by constructing space–time volume over silhouette images (Gorelick et al., 2007; Yilmaz and Shah, 2008). Optical flow is another major technique used for recognizing actions. Efros et al. (2003) calculate optical flow in person-centered images in order to model relative motions among different locations of object. Babu et al. (2002) utilized the readily available motion vectors from the compressed video stream for recognizing actions. Ali and Shah (2010) derive 11 kinematic features, like divergence, vorticity, symmetry, gradient tensor features, etc. from the optical flow and principal component analysis is applied to determine the dominant kinematic modes. The actions are classified using multiple instance learning, in which each action video is represented by a bag of kinematic modes. Poppe (2010) and Weinland et al. (2011) have presented a detailed survey on human action recognition. Most of the action recognition algorithms are bench marked using one of the following publicly available datasets: (i) KTH (Schuldt et al., 2004), (ii) Weizmann (Gorelick et al., 2007) and
0952-1976/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.engappai.2013.07.008
Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i
2
R. Venkatesh Babu et al. / Engineering Applications of Artificial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎
(iii) IXMAS (Weinland et al., 2006). KTH and Weizmann datasets are captured from a single camera and IXMAS dataset is created using 5 calibrated and synchronized cameras to capture actions in multiple views. Due to the lack of depth information, the actions were performed parallel to the image plane in order to capture the dynamics with less ambiguity. Hence, the action recognition algorithms developed are mostly view dependent. Since 3-D data acquisition requires special stereo camera setup or expensive image capturing device, rarely 3-D optical flow based techniques are used in computer vision applications. Holte et al. (2010) used an expensive SwissRanger SR4000 camera to capture the RGB– Depth database for action recognition. They have used 3-D optical flow for recognizing 4 different actions. Stereo camera has been used for detecting human in Nakada et al. (2008). However, with the advancement of camera technology, we are now able to capture the depth image that gives us information in the 3rd dimension with which we can represent and recognize actions more accurately, at an affordable cost. In this paper, we propose a method for human action recognition using spatiodepth information. The advantage of this approach is that multiple views are not required to capture the motion in all directions. Thus, using depth information, we are able to recognize actions that are difficult to recognize in spatial domain alone. The 2-D optical flow combined with the depth flow gives us the complete information of motion in an action. The region of interest (silhouette) is extracted easily using the depth image. The 2-D optical flow is calculated on the gray scale images masked by the silhouette. We then have utilized this 2-D optical flow information along with the depth images provided by the depth camera to compute the depth flow for the region of interest. Since the actions are performed over time, the 3-D optical flow between 2 consecutive frames is not sufficient to capture the information about an action. Hence, the 3-D optical flow is accumulated for N frames so that it contains enough details about an action. We have used the hierarchical division of silhouette region to find the average motion of an action at different scales. These average motion vectors are used as features for representing the actions. The proposed algorithm has been evaluated on our Video Analytics Lab (VAL) dataset and another publicly available MHAD dataset, which were captured using kinect sensor with RGB and depth information. The proposed approach can be easily adapted to various applications such as gesture recognition, emotion recognition and gait recognition. These feature vectors are then used to classify the actions using a Projection Based Learning (PBL) algorithm of a Meta-cognitive Radial Basis Function Network (McRBFN). McRBFN emulates the Nelson and Narens model of human meta-cognition (Nelson and Narens, 1980), and has 2 components, namely, a cognitive component and a meta-cognitive component. A radial basis function network with Gaussian activation function at the hidden layer is the cognitive component of McRBFN and a self-regulatory learning mechanism is its meta-cognitive component. McRBFN begins with zero hidden neurons, adds and prunes neurons until an optimum network structure is obtained. The self-regulatory learning mechanism of McRBFN uses the best human learning strategy, namely, self-regulated learning (Wenden, 1998; Rivers, 2001; Isaacson and Fujita, 2006) to decide what-to-learn, when-to-learn and how-to-learn in a meta-cognitive framework. Based on its decision, samples are either deleted (sample deletion strategy) or used in the learning process (sample learn strategy) or reserved for future use (sample reserve strategy). The sample deletion strategy, sample learn strategy and the sample reserve strategy address the what-to-learn, how-to-learn and when-to-learn components of meta-cognition. Thus, the meta-cognitive component continuously assesses the knowledge of the cognitive component, identifies when a new knowledge is required and controls the
learning ability of the cognitive component. Therefore, the network that is finally built is compact, is a more accurate representation of the training data, and is not over-trained. During the sample learn strategy, McRBFN either adds a neuron or updates the parameters of the existing neurons. While adding a neuron, the input/hidden layer parameters of the network are fixed based on the sample overlapping conditions, and the optimal output weights are estimated using a projection based learning algorithm. The problem of estimating the optimal output weights is formulated as a linear programming problem which is then converted to a system of linear equations and solved by the projection based learning algorithm. While solving the system of linear equations, PBL estimates the output weights corresponding to the minimum energy point of the hinge-loss error function. On the other hand, when a sample is used to update the existing network parameters, a recursive least square algorithm is used (Chong and Zak, 2001). The McRBFN using the PBL to address the how-to-learn component of meta-cognition will be hereafter referred as, “Projection Based Learning algorithm of a Metacognitive Radial Basis Function Network (PBL-McRBFN)”. The performance of PBL-McRBFN in recognizing actions is evaluated by a 10-fold cross validation and subject independent recognition. First, a 10-fold cross validation study is conducted with 8 subjects and 8 actions. During this study, it is ensured that there is representation of all the subjects and all the actions in both the training and testing datasets. The datasets thus generated are used to study the action recognition performance of PBLMcRBFN in comparison with Support Vector Machines (SVM) and Extreme Learning Machine (ELM) classifiers. The results show the superior action recognition performance of PBL-McRBFN. The person-independent action recognition performance of the classifiers is studied by training the classifiers using 7 subjects and testing its generalization ability using the actions performed by the remaining subject. The results of this study show that PBLMcRBFN is able to generalize actions, independent of the representation of the subject in the training dataset. The performances of the classifiers are also studied statistically using a one-way ANOVA test, which indicates the superior performance of the PBLMcRBFN. The performances of the classifiers are also verified through the Berkeley Multi-modal Human Action Database (MHAD) (Ofli et al., 2013). The paper is organized as follows: Section 2 presents the overview of the proposed action recognition model using 3-D optical flow features. In Section 3, the data with 3-D optical flow features is described and the performance of PBL-McRBFN is studied in comparison to other classifiers from the literature. Finally, Section 4 summarizes this study on subject-independent human action recognition using 3-D optical flow features.
2. System overview The overview of the proposed approach with VAL database is illustrated in Fig. 1. First, the RGB and depth video feeds are calibrated to map the pixel locations in both frames. The calibrated depth image is represented as 8 bit data by appropriately scaling the depth range. The normalized depth image is used for extracting the silhouette. The 2-D optical flow is extracted for the silhouette region using gray scale images. The 3-D optical flow is obtained using the estimated 2-D optical flow and the corresponding normalized depth images. Finally, 3-D optical flow based features are extracted from hierarchically arranged spatio-temporal windows for representing the actions. These extracted features are then used to train a meta-cognitive radial basis function network using a projection based learning algorithm. We explain each component of Fig. 1 in detail in the following sections.
Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i
R. Venkatesh Babu et al. / Engineering Applications of Artificial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎
2.1. Calibration of kinect The depth camera in kinect has a smaller field of view than the RGB camera. Hence, the obtained depth image is slightly magnified and translated with respect to RGB image. In order to use both the RGB and depth image simultaneously, we have to map the pixels in both the images. For calibrating kinect, we have used a fixed parallelogram. The coordinates of the corners of the parallelogram are obtained for both depth and RGB image. A warp matrix obtained based on this measurement is utilized for mapping depth image to the corresponding RGB image.
2.2. Depth normalization The depth information provided by kinect is represented by 11 bits, but processing all the images in 11 bits is computationally expensive. Hence, we have to scale them down to some lower bit
3
numbers, which results in losing finer depth information in the region of interest. To overcome both the problems, we find 2 flexible threshold values in which all the actions can be described completely for all users. The depth information of the farther background or the very close region to camera does not have any contribution in recognizing the actions. Now we scale down the depth values in this region to an 8-bit number which thereby provides us with the finer details of depth in the desired region. Fig. 2 illustrates the above normalization process. 2.3. Silhouette extraction using depth image Detection and elimination of the background using only 2-dimensional (RGB) image is very difficult and inefficient. However, the background can be easily identified and removed with the help of the depth image. We make use of the fact that the subject is always at a particular distance from the background pixels. We find a suitable depth threshold value for all the subjects above which we classify all the pixels as background. Thus, we easily get the depth silhouette of the subject: ( D′ði; j; tÞ if D′ði; j; tÞ≤ζ; Dði; j; tÞ ¼ ð1Þ 0 otherwise; where i, j denote row and column positions of the pixel in the image, t is the time stamp of the temporal frames, ζ is the background threshold depth value, D′ is the depth image, D is the depth silhouette of the subject: ( 1 if Dði; j; tÞ 4 0; Gði; j; tÞ ¼ ð2Þ 0 otherwise; where (i, j) denotes row and column of the pixel location, t is the time stamp of the temporal frames, mask G is the binary silhouette image, D is the silhouette of the depth image. Fig. 3 shows the extracted the binary (B) and depth (D) silhouettes for a frame. Then, we use this binary mask (G) to extract the silhouette of corresponding RGB image: ( Iði; j; tÞ if Bði; jÞ ¼ 1; I′ði; j; tÞ ¼ ð3Þ 0 otherwise; where I is the RGB image and I′ is the silhouette of RGB image. Executing the aforementioned simple steps, we could easily remove the background and extract only the region of interest for further analysis. 2.4. 3-D optical flow estimation
Fig. 1. System overview.
This section provides details of 3-D optical flow estimation. First, 2-D optical flow between the consecutive images is obtained
Fig. 2. (a) Original depth image. (b) Normalized depth image.
Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i
R. Venkatesh Babu et al. / Engineering Applications of Artificial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎
4
Fig. 3. (a) Binary silhouette. (b) Depth silhouette.
for the silhouette region. The depth flow is obtained from the 2-D optical flow and corresponding depth images.
We then subtract this average value with the depth value of the pixel under consideration. This will provide us the reliable depth (Z direction) motion of that pixel:
2.4.1. 2-D optical flow computation Calculation of 2-D optical flow is a well known problem in computer vision. There are many algorithms to compute the optical flow, of which we use pyramidal Lucas–Kanade algorithm (Lucas and Kanade, 1981) to compute the optical flow due to its speed and robustness. We obtained the gray scale image from the RGB image and then computed the 2-D optical flow by applying the pyramidal Lucas–Kanade algorithm on the silhouette of gray scale images for every action video.
M z ðxo ; yo Þ ¼ Dnavg ðxn ; yn ÞDðn1Þ ðxo ; yo Þ
2.4.2. Depth flow computation The depth image provided by the depth camera enables us to calculate the motion vectors in the Z direction also. The depth motion vectors can be easily computed using the 2-D optical flow and the depth images. The Z motion vectors can then be obtained just by subtracting the depth value of the same point on the subject in 2 consecutive depth frames. Let us consider a point in one of the temporal depth frames. To get the new location of the point in next depth frame, we add the 2-D optical flow of that point to its present location (XY coordinates) in the current frame. This will give us the new location of that point in the next depth image: xn ¼ xo þ M y
ð4Þ
yn ¼ yo þ M x
ð5Þ
where ðxn ; yn Þ is the location of new point in the second depth frame, ðxo ; yo Þ is the location of point in the first depth frame under consideration. Mx and My are the motion vectors along horizontal and vertical directions, respectively, for the current frame (first frame). The precision of optical flow vectors depends on various factors such as surface texture, occlusion, covering and uncovering of image regions. At locations where the optical flow vectors are not correct or locations where IR depth sensor is blocked, subtracting the depth image directly may not give correct depth motion vectors. To tackle this problem, a local neighborhood in the second depth frame around the estimated points is considered. We make a basic practical assumption that the depth of a particular pixel will not vary arbitrarily in its immediate neighborhood, unless it is an edge pixel. Averaging the non-zero pixel in the window to get the average depth value of that part of subject.
ð6Þ
where Mz is the motion along depth (Z direction) for the current frame, Dnavg is the average depth value in the immediate small neighborhood of the pixel in second depth image and D(n 1) is the current depth image under consideration. 2.5. Feature extraction For the extraction of the features, we follow the approach proposed in Babu and Suresh (2011). First, a minimum bounding rectangle box that captures the complete motion of an action is obtained. This bounding box is adaptive and depends on the current silhouette image, and is obtained by accumulating the motion vectors for the sequence of N frames. This will give us the tight bounding box for each action segment. This bounding box is then hierarchically divided into 54 windows placed symmetrically with respect to the subject's center in hierarchical fashion. The bounding box is divided into 6 6 windows, 3 3 windows, 2 2 windows, 2 1 windows, 1 2 windows and finally 1 1 window of equal size. We have then computed the average motion of each window by averaging the non-zero motion vectors of all 3 dimensions in each window. Hence we get the average motion of each window inside the bounding box in all 3 dimensions. Hence we get a feature vector of length 162 (54 3) for every frame, where first 54 features representing the average x motion vectors, next 54 features representing the average y motion vectors and the final 54 features representing the average z motion vectors at different hierarchical levels. The average motion of a single frame does not contain enough information to represent an action. Hence, we used the average motion of N frames to obtain the feature vectors. This contains good amount of information about the dynamics of an action. Hence, we have summed up across 8 frames with an overlap of 4 frames to get the feature vectors for an action. Thus, the dataset can be represented by fðu1 ; c1 Þ; …; ðut ; ct Þ; …; ðuN ; cN Þg, where ut ∈Rm ¼ ½ut1 …utm ; m ¼ 162 are the features along the 3 directions used to represent actions and ct ∈½1; …; A refers to one of the A actions. In the next section, we present a brief description of the McRBFN classifier that is used to map the features to their corresponding actions.
Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i
R. Venkatesh Babu et al. / Engineering Applications of Artificial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎
5
Thus, the response of the output neurons is the output of the network, and the response of the l-th output neuron for the t-th b tl ) is given by sample (o
2.6. Meta-cognitive radial basis function network Let the dataset generated using the procedure described in Section 2.5 be given by fðu1 ; c1 Þ; …; ðut ; ct Þ; …; ðuN ; cN Þg, where ut ∈Rm ¼ ½ut1 ; …; utm T are the m-dimensional input features and ct ∈½1; …; C are its corresponding action class labels. The coded class labels for the action classes are given by ( 1 if ct ¼ l; otl ¼ l ¼ 1; …; C: ð7Þ 1 otherwise;
K
b tl ¼ ∑ wlj htj o
ð9Þ
j¼1
where wlj is the output weight connecting the j-th hidden neuron to the l-th output neuron. The action class label of the t-th sample can be obtained from this output as t b c ¼
The objective of the neural network learning algorithm is to estimate the functional relationship between the action features and its corresponding coded class labels, as accurately as possible. In this paper, we use the PBL-McRBFN developed in Babu et al. (2012). Analogous to the model of human meta-cognition proposed by Nelson and Narens (1980), McRBFN has 2 components, namely, a cognitive component and a meta-cognitive component as shown in Fig. 4. We briefly discuss these components and the projection based fast learning algorithm of McRBFN in this section. For complete details, one may refer to Babu et al. (2012).
t
bl Þ max ðo
ð10Þ
l ¼ 1;2;…;C
Since the hinge loss error function has been shown to estimate the posterior probability more accurately than the mean-square error function in solving classification problems (Zhang, 2004; Suresh et al., 2008), PBL-McRBFN also uses the hinge loss error function. The hinge loss error of the t-th sample is given by 8 btj 4 1; <0 if otj o t ej ¼ j ¼ 1; 2; …; n ð11Þ : ot o btj otherwise; j The maximum absolute hinge error (Et) is given by
2.6.1. Cognitive component A single hidden layer radial basis function network with a Gaussian activation function at its hidden layer is the cognitive component of McRBFN. The neurons in the input and output layers of the RBF network are linear. Without loss of generality, let us assume that the RBF network has K neurons after t1 samples. The neurons in the hidden layer of McRBFN use the Gaussian activation function and the response of the j-th hidden neuron for the t-th sample (hjt) is given by ! ‖ðut cj Þ‖ t hj ¼ exp ð8Þ 2s2j
Et ¼ max jetj j
ð12Þ
j∈1;2;…;n
Projection based learning algorithm: The projection based learning algorithm works on the principle of minimization of energy function and finds the network output parameters for which the energy function is minimum, i.e., the network achieves the minimum of the energy function. The considered energy function is the sum of squared errors at McRBFN output neurons n
bij Þ2 ; J i ¼ ∑ ðoij o
i ¼ 1; …; N
j¼1
where cj ∈Rm is the center of the j-hidden neuron and sj ∈R is the Gaussian width of the j-th hidden neuron. The neurons in the output layer of the radial basis function network obtain the weighted sum of the hidden layer responses.
ð13Þ
For t training samples, the overall energy function is defined as J ðWÞ ¼
1 t 1 t n bi Þ2 ∑ J ¼ ∑ ∑ ðoi o 2i¼1 i 2i¼1j¼1 j j
ð14Þ
Meta−cognitive component Measures
Strategies Sample Delete
Estimated class label
Neuron Growth Maximum hinge error Parameter Update Class−wise significance
Sample Reserve
Control
Monitor Cognitive component
ut
1
Training sample ut
ut 2
h1(ut )
w11
Σ
o^1t
^ ot
t
h2(u ) Σ
t
um
h (u ) t
K
Decision Device
^ ct
o^nt
wΚn
Fig. 4. Schematic diagram of McRBFN classifier.
Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i
R. Venkatesh Babu et al. / Engineering Applications of Artificial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎
6
i
bj Þ from Eq. (9) in Eq. (14), By substituting the predicted output ðo the energy function reduces to !2 K 1 t n i ð15Þ J ðWÞ ¼ ∑ ∑ oij ∑ wkj hk 2i¼1j¼1 k¼1 where hik is the response of the k-th hidden neuron for i-th training sample. The optimal output weights ðWn ∈RKn ) are estimated such that the total energy reaches its minimum: Wn ≔arg min JðWÞ W∈RKn
ð16Þ
Accordingly, the optimal output weights are estimated using (Babu and Suresh, 2013) Wn ¼ A1 B
ð17Þ
where the projection matrix A∈RKK is given by t
i
i
akp ¼ ∑ hk hp ; i¼1
k ¼ 1; …; K; p ¼ 1; …; K
ð18Þ
and the output matrix B∈RKn is t
i
bpj ¼ ∑ hp oij ; i¼1
p ¼ 1; …; K; j ¼ 1; …; n
ð19Þ
Next, we describe the meta-cognitive component of McRBFN and briefly explain the projection based learning algorithm of McRBFN that has been developed using this hinge loss error function. 2.6.2. Meta-cognitive component The meta-cognitive component has a knowledge about the knowledge of the cognitive component and controls the learning of the cognitive component. It contains a dynamic model of the cognitive component and comprises a self-regulatory learning mechanism to decide what-to-learn, when-to-learn and how-tolearn. As mentioned earlier, the cognitive component of McRBFN begins with zero hidden neurons, and the meta-cognitive component adds and prunes or updates existing neurons to the cognitive component until an optimum network structure is obtained. A projection based fast learning algorithm is used to fix the parameters of the neurons. Based on the error and the distance of the sample from the existing neurons, the meta-cognitive component chooses one of the following strategies for each sample in the dataset: Sample delete strategy: If the knowledge contained in a sample is similar to that already present in the network, delete the sample from the training dataset. This strategy uses the following criteria to address the what-to-learn component of meta-cognition: t
If ct ¼ ¼ b c AND Et ≤βd ; then; delete the sample
ð20Þ
where βd is the delete threshold fixed at a desired accuracy. Sample learn strategy: This strategy decides how-to-learn the training sample. Depending on the novelty of knowledge contained in the sample, either the neuron growth strategy or the parameter update strategy is chosen. Neuron growth strategy: When a new training sample has novel knowledge and the estimated class label is different from the actual class label then a new hidden neuron is added to represent the knowledge contained in the sample. The neuron growth criterion is given by t If ðb c ≠ct OR Et ≥βa Þ AND ψ c ðut Þ≤βc ; then; add a neuron
ð21Þ
Here, ψ c is a measure of the class-wise significance (Babu and Suresh, 2013) and is defined as 1 K t ψ c ¼ c ∑ hk ut ; μck K k¼1 c
ð22Þ
where Kc is the number of neurons associated with class c, hkt is the hidden layer response as defined in Eq. (8) and ut is the input feature of the t-th sample. The threshold βc is the meta-cognitive knowledge measurement threshold and βa is the self-adaptive meta-cognitive neuron addition threshold. These thresholds select samples with significant knowledge for building the network so that the other samples can be used to fine tune the network parameters. The neuron addition threshold is self-adapted according to βa ≔δβa þ ð1δÞEt
ð23Þ
where δ is the slope that controls rate of self-adaptation and is set close to 1. A training sample that is used to add a neuron may overlap with neurons in other classes or will form a distinct cluster far away from the nearest neuron in the same class. These conditions might affect the classification performance of a classifier significantly. Hence, McRBFN measures the distance from the current sample to the nearest neuron in the inter and intra class while assigning the new neuron parameters. Thus, the parameters of a new hidden neuron are initialized based on the overlapping and distinct cluster criterion. The nearest hidden neuron in the intra class (nrS) and the nearest hidden neuron in the inter class (nrI) are defined as nrS ¼ arg
min ‖ut μlk ‖;
l ¼ ¼ c;∀k
nrI ¼ arg min ‖ut μlk ‖ l≠c;∀k
ð24Þ
The Euclidean distances between the new training sample to nrS and nrI are given as dS ¼ ‖ut μcnrS ‖;
dI ¼ ‖ut μlnrI ‖
ð25Þ
Using the nearest neuron distances, we determine the center and width of the new neuron based on the overlapping/non-overlapping conditions as defined in Babu and Suresh (2013) so as to avoid misclassification. When there is no overlap of the sample with any neuron in any class, the center and width of the new neuron is initialized as pffiffiffiffiffiffiffiffiffiffiffiffi μcKþ1 ¼ ut ; scKþ1 ¼ κ ut T ut ð26Þ Then, the output weights are estimated using the projection based learning algorithm described below: The size of matrix A is increased from K K to ðK þ 1Þ ðK þ 1Þ: ð27Þ t
t
t
t
where h ¼ ½h1 ; h2 ; …; hK is a vector of the existing K hidden neurons response for the t-th training sample. aKþ1 ∈R1K is assigned as t
i
i
aKþ1;p ¼ ∑ hKþ1 hp ;
p ¼ 1; …; K
i¼1
ð28Þ
and aKþ1;Kþ1 ∈Reþ value assigned as t
i
i
aKþ1;Kþ1 ¼ ∑ hKþ1 hKþ1
ð29Þ
i¼1
The size of matrix B is increased from K n to ðK þ 1Þ n: " # BKn BðKþ1Þn ¼ bKþ1
ð30Þ
where matrix B∈RKn is updated as t
B ¼ B þ ðh ÞT ðot ÞT 1n
and bKþ1 ∈R t
i
is a row vector assigned as
bKþ1;j ¼ ∑ hKþ1 oij ; i¼1
ð31Þ
j ¼ 1; …; n
ð32Þ
Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i
R. Venkatesh Babu et al. / Engineering Applications of Artificial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎ t
The vector h in Eqs. (27) and (31) contains very small values, since t-th sample is added as a hidden neuron which is significantly t different from the existing hidden neurons. After neglecting h vector in Eqs. (27) and (31) the output weights are estimated finally as ð33Þ
where WK is the output weight matrix for K hidden neurons, and wKþ1 is the vector of output weights for new hidden neuron. It must be noted that the first sample is used as the first neuron of the network. Parameter update strategy: The current (t-th) training sample is used for updating the output weights of the cognitive component (WK ¼ ½w1 ; w2 ; …; wK T ) if the following criterion is satisfied: t
c ct ¼ b
Et ≥βu
AND
ð34Þ
where βu is the self-adaptive meta-cognitive parameter update threshold. The βu is self-adapted based on the prediction error as βu ≔δβu þ ð1δÞEt
t
4. The cognitive component executes the above selected strategy. 5. Continue steps 1–4 until there are no more samples in the training dataset. 3. Results and discussions In this section, we evaluate the performance of PBL-McRBFN in recognizing actions using 3-dimensional features. Two different studies are conducted: a 10-fold cross-validation study and a subject-independent action recognition study. In both these studies, the performance of PBL-McRBFN is compared with that of a SVM classifier and an ELM classifier. In all the experiments, the optimal number of support vectors of SVM is obtained by optimizing c and γ in LIBSVM and the number of hidden neurons in ELM is obtained by the constructive-destructive procedure described in Suresh et al. (2003). The following measures are used to compare the performances of these classifiers:
Average classification efficiency ðηa Þ: ηa ¼
ð35Þ
where δ is the slope that controls the rate of self-adaption during parameter update and it is typically set close to 1. When a sample is used for updating the output weight parameters, the PBL algorithm updates the output weight as given below. The matrices A∈RKK and B∈RKn are updated as t
A ¼ A þ ðh ÞT h
7
1 n qll ∑ 100% n l ¼ 1 Nl
ð39Þ
where qll is the total number of correctly classified samples in the training/testing dataset. Overall classification efficiency ðηo Þ: ηo ¼
∑nl¼ 1 qll 100% N
ð40Þ
ð36Þ
t
B ¼ B þ ðh ÞT ðot ÞT
ð37Þ
ηg ¼
and the output weights are updated as WK ¼ WK þ A1 ðh ÞT ðet ÞT t
Geometric mean efficiency ðηg Þ: sffiffiffiffiffiffiffiffiffiffiffiffiffi n q n ∏ ll 100% l ¼ 1 Nl
ð41Þ
ð38Þ
Sample reserve strategy: If the t-th sample does not satisfy any of the above criterion, then the sample is pushed to the rear of the training sequence. Since McRBFN modifies the strategies based on current sample knowledge, these samples may be used in later stage. We summarize the PBL-McRBFN below: 1. For each new training sample input ðut Þ, compute the output of b t Þ using Eqs. (9) and (8). the cognitive component ðo 2. Estimate the predicted class label of the cognitive component t ðb c Þ, maximum hinge error (Et) and class-wise significance measures ðψ c Þ for the new training sample ðut Þ using Eqs. (10), (12) and (22). 3. The meta-cognitive component selects one of the following strategies based on the above computed measures: t (a) Sample delete strategy: If ct ¼ b c AND Et ≤βd , then delete the sample from the training dataset without learning. t (b) Neuron growth strategy: If ðb c ≠ct OR Et ≥βa Þ AND ψ c ðut Þ≤βc , then allocate a new hidden neuron in the cognitive component. New hidden neuron's width and center parameters are determined based on the intra and inter class nearest neuron distances. Output weight parameters for all hidden neurons are estimated based on PBL algorithm using Eq. (33). Also, update the self-adaptive meta-cognitive addition threshold using Eq. (23). t (c) Parameters update strategy: If ct ¼ ¼ b c AND Et ≥βu , then update the cognitive component output weight parameters based on PBL algorithm using Eq. (38). Also, update the selfadaptive meta-cognitive update threshold using Eq. (35). (d) Sample reserve strategy: When the new sample does not satisfy deletion, growth and update criterion, then push the sample to the reserve to be used later for learning.
First, we describe the datasets used in the study. Next, we present the results on the 10-fold cross-validation study, and then evaluate the generalization ability of the classifiers through a subjectindependent action recognition study. In these studies, the performances of the classifiers are compared using the performance measures defined in Eqs. (39)–(41). Next, we conduct the one-way ANOVA test (Japkowicz and Shah, 2011) to compare the performances of these classifiers in the 10-fold cross validation tests. The ANOVA measure compares the mean of individual experimental condition and ensures that these means differ significantly from the aggregate mean across all conditions. If the F-score is greater than the F-statistic at 95% confidence level, then the hypothesis of equality of means (i.e., the classifiers perform similarly on all the datasets) is rejected. If the equality hypothesis is rejected in the one-way ANOVA test, then pair-wise post hoc should be conducted to test which classifier is significantly different from the others. In this paper, a parametric Dunnett test is used to conduct the pair-wise comparison using the PBL-McRBFN classifier as the control. Finally, to highlight the essence of 3-D features, the performance of the best performing PBL-McRBFN classifier using the 3-D features is compared with its performance using 2-D features. 3.1. Dataset The proposed approach is evaluated using 2 datasets, namely, the Video Analytics Lab (VAL) database1 and the Berkeley Multimodal Human Action Database (MHAD) (Ofli et al., 2013). The VAL 1
http://val.serc.iisc.ernet.in:8080/project/VAL_Depth_Database.zip
Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i
8
R. Venkatesh Babu et al. / Engineering Applications of Artificial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Fig. 5. Actions considered in VAL database. Rows 1–2 (left to right): bending, bowling, boxing and jumping; rows 3–4 (left to right): kicking, stretching, swimming and waving.
database, recorded using the kinect in static surrounding conditions, has been generated by us. Here, both the depth and RGB images are recorded at an average rate of 30 frames per second. The depth images are available as 11 bit images, but stored as 16 bit images. The resolutions of both depth and RGB images are 640 480. The kinect is placed at a fixed height from the floor so as to capture the subject's entire body. The subjects are asked to perform the given task freely in front of kinect. The VAL database consists of 8 actions, namely, swimming, bending, waving, kicking, bowling, jumping, boxing, stretching. Each action is performed by 8 subjects for approximately 3 times. The number of frames varies depending upon the speed of the person. Fig. 5 shows the snapshot of some of the actions from our database. The MHAD database that contains 11 actions performed by 12 female subjects is the other dataset used in the study. The 11 actions include: jumping, jumping jacks, bending, punching, waving 2 hands, waving one hand, clapping, throwing, sit down/ stand up, sit down and stand up. The database was captured by 5 different systems: optical motion capture system, 4 multi-view
stereo vision camera arrays, 2 Microsoft Kinect cameras, 6 wireless accelerometers and 4 microphones. In our experiments, we have used only the information obtained from a single kinect camera for recognizing actions. 3.2. Performance study: 10-fold cross-validation study In the 10-fold cross validation test, 10 trials of experiments are conducted. In each of these trials, 75% of samples in each action of all the subjects are randomly selected for developing the classifier and the remaining 25% of the samples in each action are used for testing the classifier. This approach is referred to as, “10-fold cross validation study”. In this section, we present the results of the 10-fold cross validation study for the VAL database and the MHAD. 3.2.1. VAL database We present the results of the 10-fold cross validation study for the VAL database in Table 1. It can be observed from the table that
Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i
R. Venkatesh Babu et al. / Engineering Applications of Artificial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎
9
Table 1 VAL database: performance results of 10-fold cross validation test. Classifier
K
1589.1a 90 117.1
SVM ELM PBL-McRBFN a
Training
Testing
ηo
ηa
ηg
ηo
ηa
ηg
96.9 7 2 93.17 0.6 99.92 7 0.12
96.85 7 0.4 88.42 7 1.37 99.92 7 0.12
96.8 7 0.45 87.067 1.92 99.89 7 0.14
96.8 7 0.54 93.2 7 0.5 99.84 7 0.22
96.4 71.3 88.97 71.2 99.79 70.34
96.27 71.37 87.777 1.83 99.79 7 0.35
Support vectors.
100
Table 2 MHAD: performance results of 10-fold cross validation test.
90
Number of neurons (K)
80
Classifier
K
Training ηo
Testing ηo
SVM ELM PBL-McRBFN
98.6a 71.5 43.6
95.157 1.24 93.647 1.5 99.89 7 0.32
87.2 7 4.99 78.79 7 5.34 91.82 7 2.49
70 60 50 a
40
Support vectors.
30 20 10 0 0
500
1000
1500
2000
2500
Sample instance Fig. 6. Neuron history for one trial.
35
The F-score based on the one-way ANOVA test on the 3 classifiers ðηo Þ for the 10-fold cross validation study using 3-D features is 550.1177. This is greater than the F-statistic at 95% confidence level (F 2;18;0:05 is 4.560), i.e., 550:1177 44:560. Hence, the equality hypothesis of ANOVA test can be rejected at 95% confidence level. The observed t obtained from the Dunnett test by comparing against SVM and ELM are 15.1705 and 33.1307, respectively. However, the critical t value is 2:40ðt 3;18;0:05 Þ. Thus, the observed t values are much greater than the critical t value and, hence, it can be inferred that the PBL-McRBFN classifier outperforms SVM and ELM classifiers, significantly.
Sample Deletion History
30
25
20
15
10
5
0 0
500
1000
1500
2000
2500
Sample Instance
Fig. 7. Sample deletion history for one trial.
the PBL-McRBFN classifier outperforms ELM and SVM classifiers in recognizing actions using 3-dimensional features. It is at least 3% better than SVM classifier, and at least 6% better than ELM classifier in recognizing the human actions. Figs. 6 and 7 give the neuron history and the sample deletion history for one trial. From Fig. 6, it can be seen that the metacognitive component adds neurons to PBL-McRBFN during the training process. Further, it can be seen from Fig. 7 that the PBLMcRBFN deletes 34 samples that are similar to the knowledge acquired by the network. It can also be seen that the sample deletion is more pronounced towards the end of the training. Hence, it can be observed that PBL-McRBFN has approximated the knowledge dynamics in the training dataset efficiently.
3.2.2. MHAD Table 2 presents the results of the 10-fold cross validation study for the MHAD. From the table, it can be observed that the PBL-McRBFN classifier outperforms ELM and SVM classifiers in recognizing actions using 3-D features by at least 4% and 13%, respectively. The F-score based on the one-way ANOVA test on the 3 classifiers ðηo Þ for the 10-fold cross validation study using 3-D features is 26.8546, which is greater than the F-statistic at 95% confidence level (F 2;18;0:05 is 4.560), i.e., 26:8546 4 4:560. Hence, the equality hypothesis of ANOVA test can be rejected at 95% confidence level. The observed t obtained from the Dunnett test by comparing against SVM and ELM are 2.4263 and 6.8557, respectively. However, the critical t value is 2:40ðt 3;18;0:05 Þ. Thus, the observed t values are greater than the critical t value and, hence, it can be inferred that the PBL-McRBFN classifier outperforms SVM and ELM classifiers, significantly. 3.3. Performance study: subject-independent action recognition study In the subject-independent action recognition study, the actions performed by all subjects except one are used to develop the classifiers and the generalization ability of the classifiers is tested using the actions performed by the untrained subject. 3.3.1. VAL database Table 3 presents the testing efficiencies of the 3 classifiers, namely, SVM, ELM and PBL-McRBFN for the subject-independent action recognition study. From the performance results, it can be observed that the overall efficiency of PBL-McRBFN classifier is
Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i
R. Venkatesh Babu et al. / Engineering Applications of Artificial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎
10
Table 3 VAL database: performance results of subject-independent action recognition study. Test sub.
SVM
1 2 3 4 5 6 7 8 Av. a
ELM
PBL-McRBFN
Ka
ηo
ηa
ηg
K
ηo
ηa
ηg
K
ηo
ηa
ηg
1784 1653 1823 1754 1691 1661 1749 1670 1723
89.45 76.64 99.42 92.94 96.64 79.53 95.35 85.78 89.47
94.24 79.12 98.53 87.98 89.24 77.9 93.55 88.97 88.69
93.23 69.19 98.45 84.85 86.01 0 93.06 87.42 76.52
90 90 90 90 90 90 90 90 90
85.78 73.14 92.13 92.06 84.36 87.06 86.34 84.15 85.63
86.3 73.99 80.55 90.57 70.43 85.89 86.29 88.44 93.89
84.43 70.3 0 90.08 0 80.52 85.57 86.92 76.52
158 207 218 178 187 202 115 123 174
100 78.38 100 92.06 99.42 89.88 97.38 99.77 94.61
100 85.46 100 87.87 99.38 87.03 96.11 99.88 94.47
100 83.73 100 85.58 99.36 85.22 96 99.88 92.72
Support vectors.
Table 4 MHAD: performance results of subject-independent action recognition study. Test sub.
SVM
a
PBL-McRBFN
Train ηa
Test ηa
K
Train ηa
Test ηa
K
Train ηa
Test ηa
120 119 120 120 120 119 119 120 119 120 119 118 119
95.04 94.21 97.52 96.69 95.87 95.04 96.69 95.04 96.69 95.87 95.87 95.83 95.86
100 81.82 81.82 90.91 90.91 100 90.91 100 81.82 90.91 90.91 91.67 90.97
50 70 50 65 50 60 55 70 55 50 65 75 60
78.51 81.82 79.34 83.47 80.99 85.95 81.82 84.3 82.65 83.47 83.47 90.91 83.05
81.82 90.91 90.91 72.73 81.82 81.82 81.82 91.91 81.82 90.91 90.91 81.82 84.99
58 35 51 59 51 51 51 50 54 34 53 55 50
100 98.35 96.69 100 95.04 97.52 94.22 94.22 99.17 94.22 97.52 97.52 97.03
100 90.91 90.91 81.82 100 100 100 90.91 90.91 90.91 90.91 90.91 93.18
K
1 2 3 4 5 6 7 8 9 10 11 12 Av.
ELM
a
Support vectors.
better than SVM and ELM classifiers, by at least 5.14% and 1.41%, respectively. Further, the testing geometric mean accuracy of the SVM is 0 when the subject 6 is eliminated in the training dataset, and that of the ELM classifiers is 0 when subjects 3 and 5 are eliminated from the training dataset. It was observed that in these cases, these classifiers failed to recognize kicking action due to fewer samples in this class. However, the PBL-McRBFN classifier is able to recognize all the 8 actions, even when the subject is not represented in the training dataset and when the sample imbalance is high. Hence, it can be inferred that PBL-McRBFN can perform person independent action recognition using 3-D features, efficiently. The F-score based on the one-way ANOVA test on the 3 classifiers ðηo Þ for the leave-one out cross validation test using 3-D features is 9.5073, which is greater than the F-statistic at 95% confidence level (F 2;14;0:05 is 3.739), i.e., 9:5073 4 3:739. Hence, the equality of means hypothesis can be rejected at 95% confidence level. As the equality hypothesis is rejected, we conduct the Dunnett test using the PBL-McRBFN classifier as the control. Based on this test, the observed t obtained by comparing against SVM and ELM are 2.4874 and 4.3454, respectively, while the critical t value is 2.46 ðt 3;14;0:05 Þ. Hence, it can be inferred from the leaveone out cross validation study that the PBL-McRBFN classifier performs significantly better than the SVM and ELM classifiers.
3.3.2. MHAD Table 4 presents the testing efficiencies of the 3 classifiers, namely, SVM, ELM and PBL-McRBFN for the subject-independent
action recognition study using the MHAD. From the performance results, it can be observed that the overall efficiency of PBLMcRBFN classifier is better than SVM and ELM classifiers, by at least 2.21% and 8.19%, respectively. The F-score based on the oneway ANOVA test on the 3 classifiers ðηo Þ for the leave-one out cross validation test using 3-D features is 6.2211, which is greater than the F-statistic at 95% confidence level (F 2;22;0:05 is 3.44), i.e., 6:2211 4 3:44. Hence, the equality of means hypothesis can be rejected at 95% confidence level. As the equality hypothesis is rejected, we conduct the Dunnett test using the PBL-McRBFN classifier as the control. Based on this test, the observed t obtained by comparing against SVM and ELM are 0.8736 and 3.2620, respectively, while the critical t value is 2.36 ðt 3;22;0:05 Þ. Hence, it can be inferred from the leave-one out cross validation study that the PBL-McRBFN classifier performs significantly better than the ELM classifier. However, although the overall efficiency of PBLMcRBFN is greater than that of SVM by 2.21%, the statistical difference in the performances of these classifiers is not very significant. However, it must be noted that in the SVM classifier, all the training samples are used as support vectors, which might affect the generalization performance significantly. 3.4. Performance study: comparison using 2-D and 3-D features Next, to show the advantage of using 3-D features, we conduct the 10-fold cross validation study and the subject-independent action recognition study on the best performing PBL-McRBFN using 3-D and 2-D features. 3.4.1. VAL database The average of the overall, average and geometric mean efficiencies of the studies using 2-D and 3-D features of the VAL database are presented in Table 5. From the table, it can be seen that the performance of PBL-McRBFN is better while using 3-D features than that using 2-D feature set. The performance of the action recognition task using 3-D features of VAL database has improved at least by 4%, compared to that obtained by 2-D features, in the 10-fold cross validation study. Moreover, there is substantial improvement in performance while using 3-D features over 2-D features in the subject-independent action recognition study. The improvement in performance is at least 17%. From the performance results of both the studies, it can be inferred that action recognition using 3-D features is more efficient, and is less sensitive to the appearance of the person involved, compared to using only 2-D features for action recognition. 3.4.2. MHAD The overall efficiency of the performance study of PBL-McRBFN using 2-D and 3-D features of the MHAD is presented in Table 6.
Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i
R. Venkatesh Babu et al. / Engineering Applications of Artificial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎
11
Table 5 VAL database: performance study on PBL-McRBFN with 2-D and 3-D features. Test
Subject-independent action recognition 10-fold cross validation
2-D features
3-D features
K
ηo
ηa
ηg
K
ηo
ηa
ηg
340 292.6
78.96 96.92
78.67 95.9
67.01 95.79
174 117.1
94.61 99.84
94.47 99.79
92.72 99.79
Table 6 MHAD: performance study on PBL-McRBFN with 2-D and 3-D features. Test
Subject-independent action recognition 10-fold cross validation
2-D features K
ηo
K
ηo
39.177 4.84 47.1 712.4
80.98 7 8.18 87.88 7 6.55
50.177 7.86 43.6 7 7.68
93.187 5.65 91.82 72.5
From the table, it can be observed that the PBL-McRBFN performs efficient classification using 3-D features, compared to its classification using 2-D features. The improvement in performance using 3-D features over 2-D features is at least 12.12% and 3.94% in the subject independent action recognition and 10-fold cross validation, respectively. Thus, the following observations can be made from the performance results presented in this section:
PBL-McRBFN outperforms SVM and ELM in recognizing actions using 3-D features.
The PBL-McRBFN shows better performance in recognizing actions
3-D features
using 3-D optical flow based features in subject-independent scenario. The action recognition performance of PBL-McRBFN is much better while using 3-D features, as compared to that of using 2-D features.
4. Conclusion This paper presents an approach for action recognition using 3-D features obtained from the kinect sensor. The 3-D optical flow is estimated from 2-D optical flow and the depth information. Thus, the 3-D optical flow feature captures the dynamics of the actions in space–time. The 3-D features are then used to train support vector machine, extreme learning machine and a metacognitive radial basis function classifier using a projection based learning algorithm. The performances of these classifiers are compared using a 10-fold cross-validation study and a subjectindependent action recognition study. Performance study on these classifiers shows that the PBL-McRBFN classifier outperforms the SVM and ELM classifiers. A statistical analysis using a one-way ANOVA test confirms the results from the quantitative analysis. Further, the significance of depth information is shown by training the best performing PBL-McRBFN classifiers with and without the depth flow features. It is observed that the classifier performs substantially better with 3-D optical flow features, especially in the leave-one out cross validation study and the 10-fold cross validation study. The proposed approach is evaluated using publicly available VAL and MHAD databases. The results indicate that the depth flow features help to make the action recognition task independent of the person. The proposed approach can be adapted to various other applications including gesture recognition, emotion recognition and gait recognition.
Acknowledgements The authors wish to express grateful thanks to the referees for their useful comments and suggestions to improve the presentation of this paper. References Ali, S., Shah, M., 2010. Human action recognition in videos using kinematic features and multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (2), 288–303. Babu, R.V., Suresh, S., 2011. Fully complex-valued elm classifiers for human action recognition. In: Proceedings of the International Joint Conference on Neural Networks. Babu, G.S., Suresh, S., 2013. Meta-cognitive RBF network and its projection based learning algorithm for classification problems. Applied Soft Computing 13 (1), 654–666. Babu, R.V., Anantharaman, B., Ramakrishnan, K.R., Srinivasan, S.H., 2002. Compressed domain action classification using HMM. Pattern Recognition Letters 23 (10), 1203–1213. Babu, G.S., Savitha, R., Suresh, S., 2012. A projection based learning in metacognitive radial basis function network for classification problems. In: Proceedings of the International Joint Conference on Neural Networks, Brisbane, Australia. Bobick, A.F., Davis, J.W., 2001. The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (3), 257–267. Chong, E.K.P., Zak, S.H.Y., 2001. An Introduction to Optimization. Wiley, New York. (ISBN 0471391263, pp. 9–24). Efros, A.A., Berg, A.C., Mori, G., Malik, J., 2003. Recognizing action at a distance. In: Proceedings of the International Conference on Computer Vision, vol. 2, pp. 726–733. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R., 2007. Actions as space– time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (12), 2247–2253. Holte, M., Moeslund, T., Fihl, P., 2010. View-invariant gesture recognition using 3D opticalflow and harmonic motion context. Computer Vision and Image Understanding 114 (12), 1353–1361. Isaacson, R., Fujita, F., 2006. Metacognitive knowledge monitoring and selfregulated learning: academic success and reflections on learning. Journal of the Scholarship of Teaching and Learning 6 (1), 39–55. Japkowicz, N., Shah, M., 2011. Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press. (ISBN 9780521196000, pp. 9–24). Lucas, B.D., Kanade, T., 1981. An iterative image registration technique with an application in stereo vision. In: IJCAI, pp. 674–679. Nakada, T., Kagami, S., Mizoguchi, H., 2008. Pedestrian detection using 3D optical flow sequences for a mobile robot. In: Proceedings of IEEE SENSORS, pp. 776–779. Nelson, T.O., Narens, L., 1980. Metamemory: a theoretical framework and new findings. In: Nelson, T.O. (Ed.), Metacognition: Core Readings. Allyn and Bacon, Boston, pp. 9–24. Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R., 2013. Berkeley mhad: a comprehensive multimodal human action database. In: In Proceedings of the IEEE Workshop on Applications on Computer Vision (WACV). Ogale, A.S., Karapurkar, A., Aloimonos, Y., 2005. View-invariant modeling and recognition of human actions using grammars. In: Workshop on Dynamical Vision at ICCV'05, WDV.
Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i
12
R. Venkatesh Babu et al. / Engineering Applications of Artificial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Poppe, R., 2010. A survey on vision-based human action recognition. International Journal of Computer Vision 28 (2/3), 976–990. Rivers, W.P., 2001. Autonomy at all costs: an ethnography of meta-cognitive selfassessment and self-management among experienced language learners. The Modern Language Journal 85 (2), 279–290. Schuldt, C., Laptev, L., Caputo, B., 2004. Recognizing human actions: a local svm approach. In: IEEE International Conference on Pattern Recognition, vol. 3, 2004, pp. 32–36. Suresh, S., Omkar, S.N., Mani, V., Prakash, T.N.G., 2003. Lift coefficient prediction at high angle of attack using recurrent neural network. Aerospace Science and Technology 7 (8), 595–602. Suresh, S., Sundararajan, N., Saratchandran, P., 2008. Risk-sensitive loss functions for sparse multi-category classification problems. Information Sciences 178 (12), 2621–2638.
Weinland, D., Ronfard, R., Boyer, E., 2006. Free viewpoint action recognition using motion history volumes. Computer Vision and Image Understanding 104 (2), 249–257. Weinland, D., Ronfard, R., Boyer, E., 2011. A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 224–241. Wenden, A.L., 1998. Meta-cognitive knowledge and language learning. Applied Linguistics 19 (4), 515–537. Yamato, J., Ohya, J., Ishii, K., 1992. Recognizing human action in timesequential images using hidden Markov model. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 379–385. Yilmaz, A., Shah, M., 2008. A differential geometric approach to representing the human actions. Computer Vision and Image Understanding 119 (3), 335–351. Zhang, T., 2004. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics 32 (1), 56–85.
Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i