Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network

Engineering Applications of Artiﬁcial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Engineering Applications of Artiﬁcial ...

Download PDF

1000KB Sizes 0 Downloads 51 Views

Report

PDF Reader
Full Text

Engineering Applications of Artiﬁcial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Engineering Applications of Artiﬁcial Intelligence journal homepage: www.elsevier.com/locate/engappai

Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network R. Venkatesh Babu a,n, R. Savitha b, S. Suresh b, Bhuvnesh Agarwal a a b

Video Analytics Lab, Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore, India School of Computer Engineering, Nanyang Technological University, Singapore

art ic l e i nf o

a b s t r a c t

Article history: Received 13 June 2012 Received in revised form 22 May 2013 Accepted 10 July 2013

In this paper, we present a machine learning approach for subject independent human action recognition using depth camera, emphasizing the importance of depth in recognition of actions. The proposed approach uses the ﬂow information of all 3 dimensions to classify an action. In our approach, we have obtained the 2-D optical ﬂow and used it along with the depth image to obtain the depth ﬂow (Z motion vectors). The obtained ﬂow captures the dynamics of the actions in space–time. Feature vectors are obtained by averaging the 3-D motion over a grid laid over the silhouette in a hierarchical fashion. These hierarchical ﬁne to coarse windows capture the motion dynamics of the object at various scales. The extracted features are used to train a Meta-cognitive Radial Basis Function Network (McRBFN) that uses a Projection Based Learning (PBL) algorithm, referred to as PBL-McRBFN, henceforth. PBL-McRBFN begins with zero hidden neurons and builds the network based on the best human learning strategy, namely, self-regulated learning in a meta-cognitive environment. When a sample is used for learning, PBLMcRBFN uses the sample overlapping conditions, and a projection based learning algorithm to estimate the parameters of the network. The performance of PBL-McRBFN is compared to that of a Support Vector Machine (SVM) and Extreme Learning Machine (ELM) classiﬁers with representation of every person and action in the training and testing datasets. Performance study shows that PBL-McRBFN outperforms these classiﬁers in recognizing actions in 3-D. Further, a subject-independent study is conducted by leave-one-subject-out strategy and its generalization performance is tested. It is observed from the subject-independent study that McRBFN is capable of generalizing actions accurately. The performance of the proposed approach is benchmarked with Video Analytics Lab (VAL) dataset and Berkeley Multimodal Human Action Database (MHAD). & 2013 Elsevier Ltd. All rights reserved.

Keywords: Action recognition 3-D optical ﬂow Kinect depth sensor Projection based learning Meta-cognition and self-regulated learning

1. Introduction In recent years, recognition of human actions has been a major concern in computer vision due to its immense applications in the ﬁeld of autonomous video surveillance, video retrieval and human computer interaction. These applications require methods for recognizing human actions and gestures in various scenarios. Given a number of pre-deﬁned actions, the action recognition problem can be stated as that of classifying a new action into one of these pre-existing actions. Many action recognition approaches utilize human silhouette for extracting various features. In one of the initial works by Bobick and Davis (2001), the extracted silhouettes are used to construct binary motion energy image (MEI) and motion history image (MHI) templates for representing action. Yamato et al. (1992) used grid-based silhouette mesh features to form a compact codebook of observations for representing actions

n

Corresponding author. Tel.: +91 80 22932900. E-mail address: [email protected] (R. Venkatesh Babu).

using hidden Markov model. Extracting silhouettes in real-life videos is challenging and prone to noise. The noisy silhouettes are handled by phase correlation (Ogale et al., 2005), or by constructing space–time volume over silhouette images (Gorelick et al., 2007; Yilmaz and Shah, 2008). Optical ﬂow is another major technique used for recognizing actions. Efros et al. (2003) calculate optical ﬂow in person-centered images in order to model relative motions among different locations of object. Babu et al. (2002) utilized the readily available motion vectors from the compressed video stream for recognizing actions. Ali and Shah (2010) derive 11 kinematic features, like divergence, vorticity, symmetry, gradient tensor features, etc. from the optical ﬂow and principal component analysis is applied to determine the dominant kinematic modes. The actions are classiﬁed using multiple instance learning, in which each action video is represented by a bag of kinematic modes. Poppe (2010) and Weinland et al. (2011) have presented a detailed survey on human action recognition. Most of the action recognition algorithms are bench marked using one of the following publicly available datasets: (i) KTH (Schuldt et al., 2004), (ii) Weizmann (Gorelick et al., 2007) and

0952-1976/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.engappai.2013.07.008

Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i

2

R. Venkatesh Babu et al. / Engineering Applications of Artiﬁcial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎

(iii) IXMAS (Weinland et al., 2006). KTH and Weizmann datasets are captured from a single camera and IXMAS dataset is created using 5 calibrated and synchronized cameras to capture actions in multiple views. Due to the lack of depth information, the actions were performed parallel to the image plane in order to capture the dynamics with less ambiguity. Hence, the action recognition algorithms developed are mostly view dependent. Since 3-D data acquisition requires special stereo camera setup or expensive image capturing device, rarely 3-D optical ﬂow based techniques are used in computer vision applications. Holte et al. (2010) used an expensive SwissRanger SR4000 camera to capture the RGB– Depth database for action recognition. They have used 3-D optical ﬂow for recognizing 4 different actions. Stereo camera has been used for detecting human in Nakada et al. (2008). However, with the advancement of camera technology, we are now able to capture the depth image that gives us information in the 3rd dimension with which we can represent and recognize actions more accurately, at an affordable cost. In this paper, we propose a method for human action recognition using spatiodepth information. The advantage of this approach is that multiple views are not required to capture the motion in all directions. Thus, using depth information, we are able to recognize actions that are difﬁcult to recognize in spatial domain alone. The 2-D optical ﬂow combined with the depth ﬂow gives us the complete information of motion in an action. The region of interest (silhouette) is extracted easily using the depth image. The 2-D optical ﬂow is calculated on the gray scale images masked by the silhouette. We then have utilized this 2-D optical ﬂow information along with the depth images provided by the depth camera to compute the depth ﬂow for the region of interest. Since the actions are performed over time, the 3-D optical ﬂow between 2 consecutive frames is not sufﬁcient to capture the information about an action. Hence, the 3-D optical ﬂow is accumulated for N frames so that it contains enough details about an action. We have used the hierarchical division of silhouette region to ﬁnd the average motion of an action at different scales. These average motion vectors are used as features for representing the actions. The proposed algorithm has been evaluated on our Video Analytics Lab (VAL) dataset and another publicly available MHAD dataset, which were captured using kinect sensor with RGB and depth information. The proposed approach can be easily adapted to various applications such as gesture recognition, emotion recognition and gait recognition. These feature vectors are then used to classify the actions using a Projection Based Learning (PBL) algorithm of a Meta-cognitive Radial Basis Function Network (McRBFN). McRBFN emulates the Nelson and Narens model of human meta-cognition (Nelson and Narens, 1980), and has 2 components, namely, a cognitive component and a meta-cognitive component. A radial basis function network with Gaussian activation function at the hidden layer is the cognitive component of McRBFN and a self-regulatory learning mechanism is its meta-cognitive component. McRBFN begins with zero hidden neurons, adds and prunes neurons until an optimum network structure is obtained. The self-regulatory learning mechanism of McRBFN uses the best human learning strategy, namely, self-regulated learning (Wenden, 1998; Rivers, 2001; Isaacson and Fujita, 2006) to decide what-to-learn, when-to-learn and how-to-learn in a meta-cognitive framework. Based on its decision, samples are either deleted (sample deletion strategy) or used in the learning process (sample learn strategy) or reserved for future use (sample reserve strategy). The sample deletion strategy, sample learn strategy and the sample reserve strategy address the what-to-learn, how-to-learn and when-to-learn components of meta-cognition. Thus, the meta-cognitive component continuously assesses the knowledge of the cognitive component, identiﬁes when a new knowledge is required and controls the

learning ability of the cognitive component. Therefore, the network that is ﬁnally built is compact, is a more accurate representation of the training data, and is not over-trained. During the sample learn strategy, McRBFN either adds a neuron or updates the parameters of the existing neurons. While adding a neuron, the input/hidden layer parameters of the network are ﬁxed based on the sample overlapping conditions, and the optimal output weights are estimated using a projection based learning algorithm. The problem of estimating the optimal output weights is formulated as a linear programming problem which is then converted to a system of linear equations and solved by the projection based learning algorithm. While solving the system of linear equations, PBL estimates the output weights corresponding to the minimum energy point of the hinge-loss error function. On the other hand, when a sample is used to update the existing network parameters, a recursive least square algorithm is used (Chong and Zak, 2001). The McRBFN using the PBL to address the how-to-learn component of meta-cognition will be hereafter referred as, “Projection Based Learning algorithm of a Metacognitive Radial Basis Function Network (PBL-McRBFN)”. The performance of PBL-McRBFN in recognizing actions is evaluated by a 10-fold cross validation and subject independent recognition. First, a 10-fold cross validation study is conducted with 8 subjects and 8 actions. During this study, it is ensured that there is representation of all the subjects and all the actions in both the training and testing datasets. The datasets thus generated are used to study the action recognition performance of PBLMcRBFN in comparison with Support Vector Machines (SVM) and Extreme Learning Machine (ELM) classiﬁers. The results show the superior action recognition performance of PBL-McRBFN. The person-independent action recognition performance of the classiﬁers is studied by training the classiﬁers using 7 subjects and testing its generalization ability using the actions performed by the remaining subject. The results of this study show that PBLMcRBFN is able to generalize actions, independent of the representation of the subject in the training dataset. The performances of the classiﬁers are also studied statistically using a one-way ANOVA test, which indicates the superior performance of the PBLMcRBFN. The performances of the classiﬁers are also veriﬁed through the Berkeley Multi-modal Human Action Database (MHAD) (Oﬂi et al., 2013). The paper is organized as follows: Section 2 presents the overview of the proposed action recognition model using 3-D optical ﬂow features. In Section 3, the data with 3-D optical ﬂow features is described and the performance of PBL-McRBFN is studied in comparison to other classiﬁers from the literature. Finally, Section 4 summarizes this study on subject-independent human action recognition using 3-D optical ﬂow features.

2. System overview The overview of the proposed approach with VAL database is illustrated in Fig. 1. First, the RGB and depth video feeds are calibrated to map the pixel locations in both frames. The calibrated depth image is represented as 8 bit data by appropriately scaling the depth range. The normalized depth image is used for extracting the silhouette. The 2-D optical ﬂow is extracted for the silhouette region using gray scale images. The 3-D optical ﬂow is obtained using the estimated 2-D optical ﬂow and the corresponding normalized depth images. Finally, 3-D optical ﬂow based features are extracted from hierarchically arranged spatio-temporal windows for representing the actions. These extracted features are then used to train a meta-cognitive radial basis function network using a projection based learning algorithm. We explain each component of Fig. 1 in detail in the following sections.

Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i

R. Venkatesh Babu et al. / Engineering Applications of Artiﬁcial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎

2.1. Calibration of kinect The depth camera in kinect has a smaller ﬁeld of view than the RGB camera. Hence, the obtained depth image is slightly magniﬁed and translated with respect to RGB image. In order to use both the RGB and depth image simultaneously, we have to map the pixels in both the images. For calibrating kinect, we have used a ﬁxed parallelogram. The coordinates of the corners of the parallelogram are obtained for both depth and RGB image. A warp matrix obtained based on this measurement is utilized for mapping depth image to the corresponding RGB image.

2.2. Depth normalization The depth information provided by kinect is represented by 11 bits, but processing all the images in 11 bits is computationally expensive. Hence, we have to scale them down to some lower bit

3

numbers, which results in losing ﬁner depth information in the region of interest. To overcome both the problems, we ﬁnd 2 ﬂexible threshold values in which all the actions can be described completely for all users. The depth information of the farther background or the very close region to camera does not have any contribution in recognizing the actions. Now we scale down the depth values in this region to an 8-bit number which thereby provides us with the ﬁner details of depth in the desired region. Fig. 2 illustrates the above normalization process. 2.3. Silhouette extraction using depth image Detection and elimination of the background using only 2-dimensional (RGB) image is very difﬁcult and inefﬁcient. However, the background can be easily identiﬁed and removed with the help of the depth image. We make use of the fact that the subject is always at a particular distance from the background pixels. We ﬁnd a suitable depth threshold value for all the subjects above which we classify all the pixels as background. Thus, we easily get the depth silhouette of the subject: ( D′ði; j; tÞ if D′ði; j; tÞ≤ζ; Dði; j; tÞ ¼ ð1Þ 0 otherwise; where i, j denote row and column positions of the pixel in the image, t is the time stamp of the temporal frames, ζ is the background threshold depth value, D′ is the depth image, D is the depth silhouette of the subject: ( 1 if Dði; j; tÞ 4 0; Gði; j; tÞ ¼ ð2Þ 0 otherwise; where (i, j) denotes row and column of the pixel location, t is the time stamp of the temporal frames, mask G is the binary silhouette image, D is the silhouette of the depth image. Fig. 3 shows the extracted the binary (B) and depth (D) silhouettes for a frame. Then, we use this binary mask (G) to extract the silhouette of corresponding RGB image: ( Iði; j; tÞ if Bði; jÞ ¼ 1; I′ði; j; tÞ ¼ ð3Þ 0 otherwise; where I is the RGB image and I′ is the silhouette of RGB image. Executing the aforementioned simple steps, we could easily remove the background and extract only the region of interest for further analysis. 2.4. 3-D optical ﬂow estimation

Fig. 1. System overview.

This section provides details of 3-D optical ﬂow estimation. First, 2-D optical ﬂow between the consecutive images is obtained

Fig. 2. (a) Original depth image. (b) Normalized depth image.

Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i

R. Venkatesh Babu et al. / Engineering Applications of Artiﬁcial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

Fig. 3. (a) Binary silhouette. (b) Depth silhouette.

for the silhouette region. The depth ﬂow is obtained from the 2-D optical ﬂow and corresponding depth images.

We then subtract this average value with the depth value of the pixel under consideration. This will provide us the reliable depth (Z direction) motion of that pixel:

2.4.1. 2-D optical ﬂow computation Calculation of 2-D optical ﬂow is a well known problem in computer vision. There are many algorithms to compute the optical ﬂow, of which we use pyramidal Lucas–Kanade algorithm (Lucas and Kanade, 1981) to compute the optical ﬂow due to its speed and robustness. We obtained the gray scale image from the RGB image and then computed the 2-D optical ﬂow by applying the pyramidal Lucas–Kanade algorithm on the silhouette of gray scale images for every action video.

M z ðxo ; yo Þ ¼ Dnavg ðxn ; yn ÞDðn1Þ ðxo ; yo Þ

2.4.2. Depth ﬂow computation The depth image provided by the depth camera enables us to calculate the motion vectors in the Z direction also. The depth motion vectors can be easily computed using the 2-D optical ﬂow and the depth images. The Z motion vectors can then be obtained just by subtracting the depth value of the same point on the subject in 2 consecutive depth frames. Let us consider a point in one of the temporal depth frames. To get the new location of the point in next depth frame, we add the 2-D optical ﬂow of that point to its present location (XY coordinates) in the current frame. This will give us the new location of that point in the next depth image: xn ¼ xo þ M y

ð4Þ

yn ¼ yo þ M x

ð5Þ

where ðxn ; yn Þ is the location of new point in the second depth frame, ðxo ; yo Þ is the location of point in the ﬁrst depth frame under consideration. Mx and My are the motion vectors along horizontal and vertical directions, respectively, for the current frame (ﬁrst frame). The precision of optical ﬂow vectors depends on various factors such as surface texture, occlusion, covering and uncovering of image regions. At locations where the optical ﬂow vectors are not correct or locations where IR depth sensor is blocked, subtracting the depth image directly may not give correct depth motion vectors. To tackle this problem, a local neighborhood in the second depth frame around the estimated points is considered. We make a basic practical assumption that the depth of a particular pixel will not vary arbitrarily in its immediate neighborhood, unless it is an edge pixel. Averaging the non-zero pixel in the window to get the average depth value of that part of subject.

ð6Þ

where Mz is the motion along depth (Z direction) for the current frame, Dnavg is the average depth value in the immediate small neighborhood of the pixel in second depth image and D(n 1) is the current depth image under consideration. 2.5. Feature extraction For the extraction of the features, we follow the approach proposed in Babu and Suresh (2011). First, a minimum bounding rectangle box that captures the complete motion of an action is obtained. This bounding box is adaptive and depends on the current silhouette image, and is obtained by accumulating the motion vectors for the sequence of N frames. This will give us the tight bounding box for each action segment. This bounding box is then hierarchically divided into 54 windows placed symmetrically with respect to the subject's center in hierarchical fashion. The bounding box is divided into 6 6 windows, 3 3 windows, 2 2 windows, 2 1 windows, 1 2 windows and ﬁnally 1 1 window of equal size. We have then computed the average motion of each window by averaging the non-zero motion vectors of all 3 dimensions in each window. Hence we get the average motion of each window inside the bounding box in all 3 dimensions. Hence we get a feature vector of length 162 (54 3) for every frame, where ﬁrst 54 features representing the average x motion vectors, next 54 features representing the average y motion vectors and the ﬁnal 54 features representing the average z motion vectors at different hierarchical levels. The average motion of a single frame does not contain enough information to represent an action. Hence, we used the average motion of N frames to obtain the feature vectors. This contains good amount of information about the dynamics of an action. Hence, we have summed up across 8 frames with an overlap of 4 frames to get the feature vectors for an action. Thus, the dataset can be represented by fðu1 ; c1 Þ; …; ðut ; ct Þ; …; ðuN ; cN Þg, where ut ∈Rm ¼ ½ut1 …utm ; m ¼ 162 are the features along the 3 directions used to represent actions and ct ∈½1; …; A refers to one of the A actions. In the next section, we present a brief description of the McRBFN classiﬁer that is used to map the features to their corresponding actions.

Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i

R. Venkatesh Babu et al. / Engineering Applications of Artiﬁcial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎

5

Thus, the response of the output neurons is the output of the network, and the response of the l-th output neuron for the t-th b tl ) is given by sample (o

2.6. Meta-cognitive radial basis function network Let the dataset generated using the procedure described in Section 2.5 be given by fðu1 ; c1 Þ; …; ðut ; ct Þ; …; ðuN ; cN Þg, where ut ∈Rm ¼ ½ut1 ; …; utm T are the m-dimensional input features and ct ∈½1; …; C are its corresponding action class labels. The coded class labels for the action classes are given by ( 1 if ct ¼ l; otl ¼ l ¼ 1; …; C: ð7Þ 1 otherwise;

K

b tl ¼ ∑ wlj htj o

ð9Þ

j¼1

where wlj is the output weight connecting the j-th hidden neuron to the l-th output neuron. The action class label of the t-th sample can be obtained from this output as t b c ¼

The objective of the neural network learning algorithm is to estimate the functional relationship between the action features and its corresponding coded class labels, as accurately as possible. In this paper, we use the PBL-McRBFN developed in Babu et al. (2012). Analogous to the model of human meta-cognition proposed by Nelson and Narens (1980), McRBFN has 2 components, namely, a cognitive component and a meta-cognitive component as shown in Fig. 4. We brieﬂy discuss these components and the projection based fast learning algorithm of McRBFN in this section. For complete details, one may refer to Babu et al. (2012).

t

bl Þ max ðo

ð10Þ

l ¼ 1;2;…;C

Since the hinge loss error function has been shown to estimate the posterior probability more accurately than the mean-square error function in solving classiﬁcation problems (Zhang, 2004; Suresh et al., 2008), PBL-McRBFN also uses the hinge loss error function. The hinge loss error of the t-th sample is given by 8 btj 4 1; <0 if otj o t ej ¼ j ¼ 1; 2; …; n ð11Þ : ot o btj otherwise; j The maximum absolute hinge error (Et) is given by

2.6.1. Cognitive component A single hidden layer radial basis function network with a Gaussian activation function at its hidden layer is the cognitive component of McRBFN. The neurons in the input and output layers of the RBF network are linear. Without loss of generality, let us assume that the RBF network has K neurons after t1 samples. The neurons in the hidden layer of McRBFN use the Gaussian activation function and the response of the j-th hidden neuron for the t-th sample (hjt) is given by ! ‖ðut cj Þ‖ t hj ¼ exp ð8Þ 2s2j

Et ¼ max jetj j

ð12Þ

j∈1;2;…;n

Projection based learning algorithm: The projection based learning algorithm works on the principle of minimization of energy function and ﬁnds the network output parameters for which the energy function is minimum, i.e., the network achieves the minimum of the energy function. The considered energy function is the sum of squared errors at McRBFN output neurons n

bij Þ2 ; J i ¼ ∑ ðoij o

i ¼ 1; …; N

j¼1

where cj ∈Rm is the center of the j-hidden neuron and sj ∈R is the Gaussian width of the j-th hidden neuron. The neurons in the output layer of the radial basis function network obtain the weighted sum of the hidden layer responses.

ð13Þ

For t training samples, the overall energy function is deﬁned as J ðWÞ ¼

1 t 1 t n bi Þ2 ∑ J ¼ ∑ ∑ ðoi o 2i¼1 i 2i¼1j¼1 j j

ð14Þ

Meta−cognitive component Measures

Strategies Sample Delete

Estimated class label

Neuron Growth Maximum hinge error Parameter Update Class−wise significance

Sample Reserve

Control

Monitor Cognitive component

ut

1

Training sample ut

ut 2

h1(ut )

w11

Σ

o^1t

^ ot

t

h2(u ) Σ

t

um

h (u ) t

K

Decision Device

^ ct

o^nt

wΚn

Fig. 4. Schematic diagram of McRBFN classiﬁer.

Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i

R. Venkatesh Babu et al. / Engineering Applications of Artiﬁcial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

i

bj Þ from Eq. (9) in Eq. (14), By substituting the predicted output ðo the energy function reduces to !2 K 1 t n i ð15Þ J ðWÞ ¼ ∑ ∑ oij ∑ wkj hk 2i¼1j¼1 k¼1 where hik is the response of the k-th hidden neuron for i-th training sample. The optimal output weights ðWn ∈RKn ) are estimated such that the total energy reaches its minimum: Wn ≔arg min JðWÞ W∈RKn

ð16Þ

Accordingly, the optimal output weights are estimated using (Babu and Suresh, 2013) Wn ¼ A1 B

ð17Þ

where the projection matrix A∈RKK is given by t

i

i

akp ¼ ∑ hk hp ; i¼1

k ¼ 1; …; K; p ¼ 1; …; K

ð18Þ

and the output matrix B∈RKn is t

i

bpj ¼ ∑ hp oij ; i¼1

p ¼ 1; …; K; j ¼ 1; …; n

ð19Þ

Next, we describe the meta-cognitive component of McRBFN and brieﬂy explain the projection based learning algorithm of McRBFN that has been developed using this hinge loss error function. 2.6.2. Meta-cognitive component The meta-cognitive component has a knowledge about the knowledge of the cognitive component and controls the learning of the cognitive component. It contains a dynamic model of the cognitive component and comprises a self-regulatory learning mechanism to decide what-to-learn, when-to-learn and how-tolearn. As mentioned earlier, the cognitive component of McRBFN begins with zero hidden neurons, and the meta-cognitive component adds and prunes or updates existing neurons to the cognitive component until an optimum network structure is obtained. A projection based fast learning algorithm is used to ﬁx the parameters of the neurons. Based on the error and the distance of the sample from the existing neurons, the meta-cognitive component chooses one of the following strategies for each sample in the dataset: Sample delete strategy: If the knowledge contained in a sample is similar to that already present in the network, delete the sample from the training dataset. This strategy uses the following criteria to address the what-to-learn component of meta-cognition: t

If ct ¼ ¼ b c AND Et ≤βd ; then; delete the sample

ð20Þ

where βd is the delete threshold ﬁxed at a desired accuracy. Sample learn strategy: This strategy decides how-to-learn the training sample. Depending on the novelty of knowledge contained in the sample, either the neuron growth strategy or the parameter update strategy is chosen. Neuron growth strategy: When a new training sample has novel knowledge and the estimated class label is different from the actual class label then a new hidden neuron is added to represent the knowledge contained in the sample. The neuron growth criterion is given by t If ðb c ≠ct OR Et ≥βa Þ AND ψ c ðut Þ≤βc ; then; add a neuron

ð21Þ

Here, ψ c is a measure of the class-wise signiﬁcance (Babu and Suresh, 2013) and is deﬁned as 1 K t ψ c ¼ c ∑ hk ut ; μck K k¼1 c

ð22Þ

where Kc is the number of neurons associated with class c, hkt is the hidden layer response as deﬁned in Eq. (8) and ut is the input feature of the t-th sample. The threshold βc is the meta-cognitive knowledge measurement threshold and βa is the self-adaptive meta-cognitive neuron addition threshold. These thresholds select samples with signiﬁcant knowledge for building the network so that the other samples can be used to ﬁne tune the network parameters. The neuron addition threshold is self-adapted according to βa ≔δβa þ ð1δÞEt

ð23Þ

where δ is the slope that controls rate of self-adaptation and is set close to 1. A training sample that is used to add a neuron may overlap with neurons in other classes or will form a distinct cluster far away from the nearest neuron in the same class. These conditions might affect the classiﬁcation performance of a classiﬁer signiﬁcantly. Hence, McRBFN measures the distance from the current sample to the nearest neuron in the inter and intra class while assigning the new neuron parameters. Thus, the parameters of a new hidden neuron are initialized based on the overlapping and distinct cluster criterion. The nearest hidden neuron in the intra class (nrS) and the nearest hidden neuron in the inter class (nrI) are deﬁned as nrS ¼ arg

min ‖ut μlk ‖;

l ¼ ¼ c;∀k

nrI ¼ arg min ‖ut μlk ‖ l≠c;∀k

ð24Þ

The Euclidean distances between the new training sample to nrS and nrI are given as dS ¼ ‖ut μcnrS ‖;

dI ¼ ‖ut μlnrI ‖

ð25Þ

Using the nearest neuron distances, we determine the center and width of the new neuron based on the overlapping/non-overlapping conditions as deﬁned in Babu and Suresh (2013) so as to avoid misclassiﬁcation. When there is no overlap of the sample with any neuron in any class, the center and width of the new neuron is initialized as pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ μcKþ1 ¼ ut ; scKþ1 ¼ κ ut T ut ð26Þ Then, the output weights are estimated using the projection based learning algorithm described below: The size of matrix A is increased from K K to ðK þ 1Þ ðK þ 1Þ: ð27Þ t

t

t

t

where h ¼ ½h1 ; h2 ; …; hK is a vector of the existing K hidden neurons response for the t-th training sample. aKþ1 ∈R1K is assigned as t

i

i

aKþ1;p ¼ ∑ hKþ1 hp ;

p ¼ 1; …; K

i¼1

ð28Þ

and aKþ1;Kþ1 ∈Reþ value assigned as t

i

i

aKþ1;Kþ1 ¼ ∑ hKþ1 hKþ1

ð29Þ

i¼1

The size of matrix B is increased from K n to ðK þ 1Þ n: " # BKn BðKþ1Þn ¼ bKþ1

ð30Þ

where matrix B∈RKn is updated as t

B ¼ B þ ðh ÞT ðot ÞT 1n

and bKþ1 ∈R t

i

is a row vector assigned as

bKþ1;j ¼ ∑ hKþ1 oij ; i¼1

ð31Þ

j ¼ 1; …; n

ð32Þ

Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i

R. Venkatesh Babu et al. / Engineering Applications of Artiﬁcial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎ t

The vector h in Eqs. (27) and (31) contains very small values, since t-th sample is added as a hidden neuron which is signiﬁcantly t different from the existing hidden neurons. After neglecting h vector in Eqs. (27) and (31) the output weights are estimated ﬁnally as ð33Þ

where WK is the output weight matrix for K hidden neurons, and wKþ1 is the vector of output weights for new hidden neuron. It must be noted that the ﬁrst sample is used as the ﬁrst neuron of the network. Parameter update strategy: The current (t-th) training sample is used for updating the output weights of the cognitive component (WK ¼ ½w1 ; w2 ; …; wK T ) if the following criterion is satisﬁed: t

c ct ¼ b

Et ≥βu

AND

ð34Þ

where βu is the self-adaptive meta-cognitive parameter update threshold. The βu is self-adapted based on the prediction error as βu ≔δβu þ ð1δÞEt

t

4. The cognitive component executes the above selected strategy. 5. Continue steps 1–4 until there are no more samples in the training dataset. 3. Results and discussions In this section, we evaluate the performance of PBL-McRBFN in recognizing actions using 3-dimensional features. Two different studies are conducted: a 10-fold cross-validation study and a subject-independent action recognition study. In both these studies, the performance of PBL-McRBFN is compared with that of a SVM classiﬁer and an ELM classiﬁer. In all the experiments, the optimal number of support vectors of SVM is obtained by optimizing c and γ in LIBSVM and the number of hidden neurons in ELM is obtained by the constructive-destructive procedure described in Suresh et al. (2003). The following measures are used to compare the performances of these classiﬁers:

Average classiﬁcation efﬁciency ðηa Þ: ηa ¼

ð35Þ

where δ is the slope that controls the rate of self-adaption during parameter update and it is typically set close to 1. When a sample is used for updating the output weight parameters, the PBL algorithm updates the output weight as given below. The matrices A∈RKK and B∈RKn are updated as t

A ¼ A þ ðh ÞT h

7

1 n qll ∑ 100% n l ¼ 1 Nl

ð39Þ

where qll is the total number of correctly classiﬁed samples in the training/testing dataset. Overall classiﬁcation efﬁciency ðηo Þ: ηo ¼

∑nl¼ 1 qll 100% N

ð40Þ

ð36Þ

t

B ¼ B þ ðh ÞT ðot ÞT

ð37Þ

ηg ¼

and the output weights are updated as WK ¼ WK þ A1 ðh ÞT ðet ÞT t

Geometric mean efﬁciency ðηg Þ: sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ n q n ∏ ll 100% l ¼ 1 Nl

ð41Þ

ð38Þ

Sample reserve strategy: If the t-th sample does not satisfy any of the above criterion, then the sample is pushed to the rear of the training sequence. Since McRBFN modiﬁes the strategies based on current sample knowledge, these samples may be used in later stage. We summarize the PBL-McRBFN below: 1. For each new training sample input ðut Þ, compute the output of b t Þ using Eqs. (9) and (8). the cognitive component ðo 2. Estimate the predicted class label of the cognitive component t ðb c Þ, maximum hinge error (Et) and class-wise signiﬁcance measures ðψ c Þ for the new training sample ðut Þ using Eqs. (10), (12) and (22). 3. The meta-cognitive component selects one of the following strategies based on the above computed measures: t (a) Sample delete strategy: If ct ¼ b c AND Et ≤βd , then delete the sample from the training dataset without learning. t (b) Neuron growth strategy: If ðb c ≠ct OR Et ≥βa Þ AND ψ c ðut Þ≤βc , then allocate a new hidden neuron in the cognitive component. New hidden neuron's width and center parameters are determined based on the intra and inter class nearest neuron distances. Output weight parameters for all hidden neurons are estimated based on PBL algorithm using Eq. (33). Also, update the self-adaptive meta-cognitive addition threshold using Eq. (23). t (c) Parameters update strategy: If ct ¼ ¼ b c AND Et ≥βu , then update the cognitive component output weight parameters based on PBL algorithm using Eq. (38). Also, update the selfadaptive meta-cognitive update threshold using Eq. (35). (d) Sample reserve strategy: When the new sample does not satisfy deletion, growth and update criterion, then push the sample to the reserve to be used later for learning.

First, we describe the datasets used in the study. Next, we present the results on the 10-fold cross-validation study, and then evaluate the generalization ability of the classiﬁers through a subjectindependent action recognition study. In these studies, the performances of the classiﬁers are compared using the performance measures deﬁned in Eqs. (39)–(41). Next, we conduct the one-way ANOVA test (Japkowicz and Shah, 2011) to compare the performances of these classiﬁers in the 10-fold cross validation tests. The ANOVA measure compares the mean of individual experimental condition and ensures that these means differ signiﬁcantly from the aggregate mean across all conditions. If the F-score is greater than the F-statistic at 95% conﬁdence level, then the hypothesis of equality of means (i.e., the classiﬁers perform similarly on all the datasets) is rejected. If the equality hypothesis is rejected in the one-way ANOVA test, then pair-wise post hoc should be conducted to test which classiﬁer is signiﬁcantly different from the others. In this paper, a parametric Dunnett test is used to conduct the pair-wise comparison using the PBL-McRBFN classiﬁer as the control. Finally, to highlight the essence of 3-D features, the performance of the best performing PBL-McRBFN classiﬁer using the 3-D features is compared with its performance using 2-D features. 3.1. Dataset The proposed approach is evaluated using 2 datasets, namely, the Video Analytics Lab (VAL) database1 and the Berkeley Multimodal Human Action Database (MHAD) (Oﬂi et al., 2013). The VAL 1

http://val.serc.iisc.ernet.in:8080/project/VAL_Depth_Database.zip

Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i

8

R. Venkatesh Babu et al. / Engineering Applications of Artiﬁcial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 5. Actions considered in VAL database. Rows 1–2 (left to right): bending, bowling, boxing and jumping; rows 3–4 (left to right): kicking, stretching, swimming and waving.

database, recorded using the kinect in static surrounding conditions, has been generated by us. Here, both the depth and RGB images are recorded at an average rate of 30 frames per second. The depth images are available as 11 bit images, but stored as 16 bit images. The resolutions of both depth and RGB images are 640 480. The kinect is placed at a ﬁxed height from the ﬂoor so as to capture the subject's entire body. The subjects are asked to perform the given task freely in front of kinect. The VAL database consists of 8 actions, namely, swimming, bending, waving, kicking, bowling, jumping, boxing, stretching. Each action is performed by 8 subjects for approximately 3 times. The number of frames varies depending upon the speed of the person. Fig. 5 shows the snapshot of some of the actions from our database. The MHAD database that contains 11 actions performed by 12 female subjects is the other dataset used in the study. The 11 actions include: jumping, jumping jacks, bending, punching, waving 2 hands, waving one hand, clapping, throwing, sit down/ stand up, sit down and stand up. The database was captured by 5 different systems: optical motion capture system, 4 multi-view

stereo vision camera arrays, 2 Microsoft Kinect cameras, 6 wireless accelerometers and 4 microphones. In our experiments, we have used only the information obtained from a single kinect camera for recognizing actions. 3.2. Performance study: 10-fold cross-validation study In the 10-fold cross validation test, 10 trials of experiments are conducted. In each of these trials, 75% of samples in each action of all the subjects are randomly selected for developing the classiﬁer and the remaining 25% of the samples in each action are used for testing the classiﬁer. This approach is referred to as, “10-fold cross validation study”. In this section, we present the results of the 10-fold cross validation study for the VAL database and the MHAD. 3.2.1. VAL database We present the results of the 10-fold cross validation study for the VAL database in Table 1. It can be observed from the table that

Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i

R. Venkatesh Babu et al. / Engineering Applications of Artiﬁcial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎

9

Table 1 VAL database: performance results of 10-fold cross validation test. Classiﬁer

K

1589.1a 90 117.1

SVM ELM PBL-McRBFN a

Training

Testing

ηo

ηa

ηg

ηo

ηa

ηg

96.9 7 2 93.17 0.6 99.92 7 0.12

96.85 7 0.4 88.42 7 1.37 99.92 7 0.12

96.8 7 0.45 87.067 1.92 99.89 7 0.14

96.8 7 0.54 93.2 7 0.5 99.84 7 0.22

96.4 71.3 88.97 71.2 99.79 70.34

96.27 71.37 87.777 1.83 99.79 7 0.35

Support vectors.

100

Table 2 MHAD: performance results of 10-fold cross validation test.

90

Number of neurons (K)

80

Classiﬁer

K

Training ηo

Testing ηo

SVM ELM PBL-McRBFN

98.6a 71.5 43.6

95.157 1.24 93.647 1.5 99.89 7 0.32

87.2 7 4.99 78.79 7 5.34 91.82 7 2.49

70 60 50 a

40

Support vectors.

30 20 10 0 0

500

1000

1500

2000

2500

Sample instance Fig. 6. Neuron history for one trial.

35

The F-score based on the one-way ANOVA test on the 3 classiﬁers ðηo Þ for the 10-fold cross validation study using 3-D features is 550.1177. This is greater than the F-statistic at 95% conﬁdence level (F 2;18;0:05 is 4.560), i.e., 550:1177 44:560. Hence, the equality hypothesis of ANOVA test can be rejected at 95% conﬁdence level. The observed t obtained from the Dunnett test by comparing against SVM and ELM are 15.1705 and 33.1307, respectively. However, the critical t value is 2:40ðt 3;18;0:05 Þ. Thus, the observed t values are much greater than the critical t value and, hence, it can be inferred that the PBL-McRBFN classiﬁer outperforms SVM and ELM classiﬁers, signiﬁcantly.

Sample Deletion History

30

25

20

15

10

5

0 0

500

1000

1500

2000

2500

Sample Instance

Fig. 7. Sample deletion history for one trial.

the PBL-McRBFN classiﬁer outperforms ELM and SVM classiﬁers in recognizing actions using 3-dimensional features. It is at least 3% better than SVM classiﬁer, and at least 6% better than ELM classiﬁer in recognizing the human actions. Figs. 6 and 7 give the neuron history and the sample deletion history for one trial. From Fig. 6, it can be seen that the metacognitive component adds neurons to PBL-McRBFN during the training process. Further, it can be seen from Fig. 7 that the PBLMcRBFN deletes 34 samples that are similar to the knowledge acquired by the network. It can also be seen that the sample deletion is more pronounced towards the end of the training. Hence, it can be observed that PBL-McRBFN has approximated the knowledge dynamics in the training dataset efﬁciently.

3.2.2. MHAD Table 2 presents the results of the 10-fold cross validation study for the MHAD. From the table, it can be observed that the PBL-McRBFN classiﬁer outperforms ELM and SVM classiﬁers in recognizing actions using 3-D features by at least 4% and 13%, respectively. The F-score based on the one-way ANOVA test on the 3 classiﬁers ðηo Þ for the 10-fold cross validation study using 3-D features is 26.8546, which is greater than the F-statistic at 95% conﬁdence level (F 2;18;0:05 is 4.560), i.e., 26:8546 4 4:560. Hence, the equality hypothesis of ANOVA test can be rejected at 95% conﬁdence level. The observed t obtained from the Dunnett test by comparing against SVM and ELM are 2.4263 and 6.8557, respectively. However, the critical t value is 2:40ðt 3;18;0:05 Þ. Thus, the observed t values are greater than the critical t value and, hence, it can be inferred that the PBL-McRBFN classiﬁer outperforms SVM and ELM classiﬁers, signiﬁcantly. 3.3. Performance study: subject-independent action recognition study In the subject-independent action recognition study, the actions performed by all subjects except one are used to develop the classiﬁers and the generalization ability of the classiﬁers is tested using the actions performed by the untrained subject. 3.3.1. VAL database Table 3 presents the testing efﬁciencies of the 3 classiﬁers, namely, SVM, ELM and PBL-McRBFN for the subject-independent action recognition study. From the performance results, it can be observed that the overall efﬁciency of PBL-McRBFN classiﬁer is

Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i

R. Venkatesh Babu et al. / Engineering Applications of Artiﬁcial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎

10

Table 3 VAL database: performance results of subject-independent action recognition study. Test sub.

SVM

1 2 3 4 5 6 7 8 Av. a

ELM

PBL-McRBFN

Ka

ηo

ηa

ηg

K

ηo

ηa

ηg

K

ηo

ηa

ηg

1784 1653 1823 1754 1691 1661 1749 1670 1723

89.45 76.64 99.42 92.94 96.64 79.53 95.35 85.78 89.47

94.24 79.12 98.53 87.98 89.24 77.9 93.55 88.97 88.69

93.23 69.19 98.45 84.85 86.01 0 93.06 87.42 76.52

90 90 90 90 90 90 90 90 90

85.78 73.14 92.13 92.06 84.36 87.06 86.34 84.15 85.63

86.3 73.99 80.55 90.57 70.43 85.89 86.29 88.44 93.89

84.43 70.3 0 90.08 0 80.52 85.57 86.92 76.52

158 207 218 178 187 202 115 123 174

100 78.38 100 92.06 99.42 89.88 97.38 99.77 94.61

100 85.46 100 87.87 99.38 87.03 96.11 99.88 94.47

100 83.73 100 85.58 99.36 85.22 96 99.88 92.72

Support vectors.

Table 4 MHAD: performance results of subject-independent action recognition study. Test sub.

SVM

a

PBL-McRBFN

Train ηa

Test ηa

K

Train ηa

Test ηa

K

Train ηa

Test ηa

120 119 120 120 120 119 119 120 119 120 119 118 119

95.04 94.21 97.52 96.69 95.87 95.04 96.69 95.04 96.69 95.87 95.87 95.83 95.86

100 81.82 81.82 90.91 90.91 100 90.91 100 81.82 90.91 90.91 91.67 90.97

50 70 50 65 50 60 55 70 55 50 65 75 60

78.51 81.82 79.34 83.47 80.99 85.95 81.82 84.3 82.65 83.47 83.47 90.91 83.05

81.82 90.91 90.91 72.73 81.82 81.82 81.82 91.91 81.82 90.91 90.91 81.82 84.99

58 35 51 59 51 51 51 50 54 34 53 55 50

100 98.35 96.69 100 95.04 97.52 94.22 94.22 99.17 94.22 97.52 97.52 97.03

100 90.91 90.91 81.82 100 100 100 90.91 90.91 90.91 90.91 90.91 93.18

K

1 2 3 4 5 6 7 8 9 10 11 12 Av.

ELM

a

Support vectors.

better than SVM and ELM classiﬁers, by at least 5.14% and 1.41%, respectively. Further, the testing geometric mean accuracy of the SVM is 0 when the subject 6 is eliminated in the training dataset, and that of the ELM classiﬁers is 0 when subjects 3 and 5 are eliminated from the training dataset. It was observed that in these cases, these classiﬁers failed to recognize kicking action due to fewer samples in this class. However, the PBL-McRBFN classiﬁer is able to recognize all the 8 actions, even when the subject is not represented in the training dataset and when the sample imbalance is high. Hence, it can be inferred that PBL-McRBFN can perform person independent action recognition using 3-D features, efﬁciently. The F-score based on the one-way ANOVA test on the 3 classiﬁers ðηo Þ for the leave-one out cross validation test using 3-D features is 9.5073, which is greater than the F-statistic at 95% conﬁdence level (F 2;14;0:05 is 3.739), i.e., 9:5073 4 3:739. Hence, the equality of means hypothesis can be rejected at 95% conﬁdence level. As the equality hypothesis is rejected, we conduct the Dunnett test using the PBL-McRBFN classiﬁer as the control. Based on this test, the observed t obtained by comparing against SVM and ELM are 2.4874 and 4.3454, respectively, while the critical t value is 2.46 ðt 3;14;0:05 Þ. Hence, it can be inferred from the leaveone out cross validation study that the PBL-McRBFN classiﬁer performs signiﬁcantly better than the SVM and ELM classiﬁers.

3.3.2. MHAD Table 4 presents the testing efﬁciencies of the 3 classiﬁers, namely, SVM, ELM and PBL-McRBFN for the subject-independent

action recognition study using the MHAD. From the performance results, it can be observed that the overall efﬁciency of PBLMcRBFN classiﬁer is better than SVM and ELM classiﬁers, by at least 2.21% and 8.19%, respectively. The F-score based on the oneway ANOVA test on the 3 classiﬁers ðηo Þ for the leave-one out cross validation test using 3-D features is 6.2211, which is greater than the F-statistic at 95% conﬁdence level (F 2;22;0:05 is 3.44), i.e., 6:2211 4 3:44. Hence, the equality of means hypothesis can be rejected at 95% conﬁdence level. As the equality hypothesis is rejected, we conduct the Dunnett test using the PBL-McRBFN classiﬁer as the control. Based on this test, the observed t obtained by comparing against SVM and ELM are 0.8736 and 3.2620, respectively, while the critical t value is 2.36 ðt 3;22;0:05 Þ. Hence, it can be inferred from the leave-one out cross validation study that the PBL-McRBFN classiﬁer performs signiﬁcantly better than the ELM classiﬁer. However, although the overall efﬁciency of PBLMcRBFN is greater than that of SVM by 2.21%, the statistical difference in the performances of these classiﬁers is not very signiﬁcant. However, it must be noted that in the SVM classiﬁer, all the training samples are used as support vectors, which might affect the generalization performance signiﬁcantly. 3.4. Performance study: comparison using 2-D and 3-D features Next, to show the advantage of using 3-D features, we conduct the 10-fold cross validation study and the subject-independent action recognition study on the best performing PBL-McRBFN using 3-D and 2-D features. 3.4.1. VAL database The average of the overall, average and geometric mean efﬁciencies of the studies using 2-D and 3-D features of the VAL database are presented in Table 5. From the table, it can be seen that the performance of PBL-McRBFN is better while using 3-D features than that using 2-D feature set. The performance of the action recognition task using 3-D features of VAL database has improved at least by 4%, compared to that obtained by 2-D features, in the 10-fold cross validation study. Moreover, there is substantial improvement in performance while using 3-D features over 2-D features in the subject-independent action recognition study. The improvement in performance is at least 17%. From the performance results of both the studies, it can be inferred that action recognition using 3-D features is more efﬁcient, and is less sensitive to the appearance of the person involved, compared to using only 2-D features for action recognition. 3.4.2. MHAD The overall efﬁciency of the performance study of PBL-McRBFN using 2-D and 3-D features of the MHAD is presented in Table 6.

Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i

R. Venkatesh Babu et al. / Engineering Applications of Artiﬁcial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎

11

Table 5 VAL database: performance study on PBL-McRBFN with 2-D and 3-D features. Test

Subject-independent action recognition 10-fold cross validation

2-D features

3-D features

K

ηo

ηa

ηg

K

ηo

ηa

ηg

340 292.6

78.96 96.92

78.67 95.9

67.01 95.79

174 117.1

94.61 99.84

94.47 99.79

92.72 99.79

Table 6 MHAD: performance study on PBL-McRBFN with 2-D and 3-D features. Test

Subject-independent action recognition 10-fold cross validation

2-D features K

ηo

K

ηo

39.177 4.84 47.1 712.4

80.98 7 8.18 87.88 7 6.55

50.177 7.86 43.6 7 7.68

93.187 5.65 91.82 72.5

From the table, it can be observed that the PBL-McRBFN performs efﬁcient classiﬁcation using 3-D features, compared to its classiﬁcation using 2-D features. The improvement in performance using 3-D features over 2-D features is at least 12.12% and 3.94% in the subject independent action recognition and 10-fold cross validation, respectively. Thus, the following observations can be made from the performance results presented in this section:

PBL-McRBFN outperforms SVM and ELM in recognizing actions using 3-D features.

The PBL-McRBFN shows better performance in recognizing actions

3-D features

using 3-D optical ﬂow based features in subject-independent scenario. The action recognition performance of PBL-McRBFN is much better while using 3-D features, as compared to that of using 2-D features.

4. Conclusion This paper presents an approach for action recognition using 3-D features obtained from the kinect sensor. The 3-D optical ﬂow is estimated from 2-D optical ﬂow and the depth information. Thus, the 3-D optical ﬂow feature captures the dynamics of the actions in space–time. The 3-D features are then used to train support vector machine, extreme learning machine and a metacognitive radial basis function classiﬁer using a projection based learning algorithm. The performances of these classiﬁers are compared using a 10-fold cross-validation study and a subjectindependent action recognition study. Performance study on these classiﬁers shows that the PBL-McRBFN classiﬁer outperforms the SVM and ELM classiﬁers. A statistical analysis using a one-way ANOVA test conﬁrms the results from the quantitative analysis. Further, the signiﬁcance of depth information is shown by training the best performing PBL-McRBFN classiﬁers with and without the depth ﬂow features. It is observed that the classiﬁer performs substantially better with 3-D optical ﬂow features, especially in the leave-one out cross validation study and the 10-fold cross validation study. The proposed approach is evaluated using publicly available VAL and MHAD databases. The results indicate that the depth ﬂow features help to make the action recognition task independent of the person. The proposed approach can be adapted to various other applications including gesture recognition, emotion recognition and gait recognition.

Acknowledgements The authors wish to express grateful thanks to the referees for their useful comments and suggestions to improve the presentation of this paper. References Ali, S., Shah, M., 2010. Human action recognition in videos using kinematic features and multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (2), 288–303. Babu, R.V., Suresh, S., 2011. Fully complex-valued elm classiﬁers for human action recognition. In: Proceedings of the International Joint Conference on Neural Networks. Babu, G.S., Suresh, S., 2013. Meta-cognitive RBF network and its projection based learning algorithm for classiﬁcation problems. Applied Soft Computing 13 (1), 654–666. Babu, R.V., Anantharaman, B., Ramakrishnan, K.R., Srinivasan, S.H., 2002. Compressed domain action classiﬁcation using HMM. Pattern Recognition Letters 23 (10), 1203–1213. Babu, G.S., Savitha, R., Suresh, S., 2012. A projection based learning in metacognitive radial basis function network for classiﬁcation problems. In: Proceedings of the International Joint Conference on Neural Networks, Brisbane, Australia. Bobick, A.F., Davis, J.W., 2001. The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (3), 257–267. Chong, E.K.P., Zak, S.H.Y., 2001. An Introduction to Optimization. Wiley, New York. (ISBN 0471391263, pp. 9–24). Efros, A.A., Berg, A.C., Mori, G., Malik, J., 2003. Recognizing action at a distance. In: Proceedings of the International Conference on Computer Vision, vol. 2, pp. 726–733. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R., 2007. Actions as space– time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (12), 2247–2253. Holte, M., Moeslund, T., Fihl, P., 2010. View-invariant gesture recognition using 3D opticalﬂow and harmonic motion context. Computer Vision and Image Understanding 114 (12), 1353–1361. Isaacson, R., Fujita, F., 2006. Metacognitive knowledge monitoring and selfregulated learning: academic success and reﬂections on learning. Journal of the Scholarship of Teaching and Learning 6 (1), 39–55. Japkowicz, N., Shah, M., 2011. Evaluating Learning Algorithms: A Classiﬁcation Perspective. Cambridge University Press. (ISBN 9780521196000, pp. 9–24). Lucas, B.D., Kanade, T., 1981. An iterative image registration technique with an application in stereo vision. In: IJCAI, pp. 674–679. Nakada, T., Kagami, S., Mizoguchi, H., 2008. Pedestrian detection using 3D optical ﬂow sequences for a mobile robot. In: Proceedings of IEEE SENSORS, pp. 776–779. Nelson, T.O., Narens, L., 1980. Metamemory: a theoretical framework and new ﬁndings. In: Nelson, T.O. (Ed.), Metacognition: Core Readings. Allyn and Bacon, Boston, pp. 9–24. Oﬂi, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R., 2013. Berkeley mhad: a comprehensive multimodal human action database. In: In Proceedings of the IEEE Workshop on Applications on Computer Vision (WACV). Ogale, A.S., Karapurkar, A., Aloimonos, Y., 2005. View-invariant modeling and recognition of human actions using grammars. In: Workshop on Dynamical Vision at ICCV'05, WDV.

Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i

12

R. Venkatesh Babu et al. / Engineering Applications of Artiﬁcial Intelligence ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Poppe, R., 2010. A survey on vision-based human action recognition. International Journal of Computer Vision 28 (2/3), 976–990. Rivers, W.P., 2001. Autonomy at all costs: an ethnography of meta-cognitive selfassessment and self-management among experienced language learners. The Modern Language Journal 85 (2), 279–290. Schuldt, C., Laptev, L., Caputo, B., 2004. Recognizing human actions: a local svm approach. In: IEEE International Conference on Pattern Recognition, vol. 3, 2004, pp. 32–36. Suresh, S., Omkar, S.N., Mani, V., Prakash, T.N.G., 2003. Lift coefﬁcient prediction at high angle of attack using recurrent neural network. Aerospace Science and Technology 7 (8), 595–602. Suresh, S., Sundararajan, N., Saratchandran, P., 2008. Risk-sensitive loss functions for sparse multi-category classiﬁcation problems. Information Sciences 178 (12), 2621–2638.

Weinland, D., Ronfard, R., Boyer, E., 2006. Free viewpoint action recognition using motion history volumes. Computer Vision and Image Understanding 104 (2), 249–257. Weinland, D., Ronfard, R., Boyer, E., 2011. A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 224–241. Wenden, A.L., 1998. Meta-cognitive knowledge and language learning. Applied Linguistics 19 (4), 515–537. Yamato, J., Ohya, J., Ishii, K., 1992. Recognizing human action in timesequential images using hidden Markov model. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 379–385. Yilmaz, A., Shah, M., 2008. A differential geometric approach to representing the human actions. Computer Vision and Image Understanding 119 (3), 335–351. Zhang, T., 2004. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. Annals of Statistics 32 (1), 56–85.

Please cite this article as: Venkatesh Babu, R., et al., Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network. Eng. Appl. Artif. Intel. (2013), http://dx.doi.org/10.1016/j.engappai.2013.07.008i

Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network

Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network

Recommend Documents