Journal Pre-proofs A Two-fold Transformation Model for Human Action Recognition using Decisive Pose Dinesh Kumar Vishwakarma PII: DOI: Reference:
S1389-0417(19)30522-4 https://doi.org/10.1016/j.cogsys.2019.12.004 COGSYS 921
To appear in:
Cognitive Systems Research
Received Date: Revised Date: Accepted Date:
10 June 2019 9 November 2019 22 December 2019
Please cite this article as: Kumar Vishwakarma, D., A Two-fold Transformation Model for Human Action Recognition using Decisive Pose, Cognitive Systems Research (2019), doi: https://doi.org/10.1016/j.cogsys. 2019.12.004
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2019 Published by Elsevier B.V.
A Two-fold Transformation Model for Human Action Recognition using Decisive Pose Dinesh Kumar Vishwakarma Biometric Research Laboratory, Department of Information Technology, Delhi Technological University, Bawana Road, Delhi-110042, India
[email protected] Abstract: Human action recognition in videos is a tough task due to the complex background, geometrical transformation and an enormous volume of data. Hence, to address these issues, an effective algorithm is developed, which can identify human action in videos using a single decisive pose. To achieve the task, a decisive pose is extracted using optical flow, and further, feature extraction is done via a two-fold transformation of wavelet. The two-fold transformation is done via Gabor Wavelet Transform (GWT) and Ridgelet Transform (RT). The GWT produces a feature vector by calculating first-order statistics values of different scale and orientations of an input pose, which have robustness against translation, scaling and rotation. The orientation-dependent shape characteristics of human action are computed using RT. The fusion of these features gives a robust unified algorithm. The effectiveness of the algorithm is measured on four publicly datasets i.e. KTH, Weizmann, Ballet Movement, and UT Interaction and accuracy reported on these datasets are 96.66%, 96%, 92.75% and 100%, respectively. The comparison of accuracies with similar state-of-the-arts shows superior performance. Keywords: Gabor Wavelet Transform; Human Action and Activity Recognition; Decisive Pose Estimation; Ridgelet Transform. 1. Introduction Automatic human action recognition using computer vision is becoming very popular due to the critical applications such as security and surveillance [1] [2], daily living activities [3], robotics, elderly health care system [4] [5] [6], human-computer interaction [7], physical sciences [8], human object tracking and detection [9] [10], Industrial academic [11], Smart Home [12], and human-machine interface [13]. Surveillance system plays a significant role in the monitoring of a person such as a patient in hospitals or a group of individuals in public places like railway station or airport. The human action recognition task is to identify human activity in an unknown video sequence, and this task is very challenging and complex due to the challenges involved such as the
similarity between different actions, lighting variations, viewpoint variations, temporal variations, clustered background and occlusion of the human body [1] [14] [15]. Therefore, our key goal is to cultivate a unique process, which can proficiently characterize and tag human action and activity under the challenging environmental conditions such as illumination variation, which is very crucial for the recognition of indoor and outdoor activity. In the case of outdoor activity, the recognition is profoundly affected by the lighting variation while indoor environment activity recognition is quite accessible due to uniform illumination. The proposed work is capable of recognizing indoor and outdoor activities, which can be seen from the experimental work section, where KTH [16], Weizmann human action [17] and UT Interaction [18] datasets are recorded in outdoor while Ballet [19] dataset is recorded in indoor. The recognition of human activity depends on the images and videos acquisition, and the captured images/videos type are grayscale, colour, and depth [16] [17] [22]. The detailed analysis [18] [19] of these image sequences gives us an understanding of the scene. The process of recognition of human action in a binary image is straightforward, but it is very much prone to noise, and also segmentation is very challenging when the background is cluttered. The colour images have more information and may be more effective as compared to a binary image, but the processing colour images are very difficult and challenging due to the significant interclass variations, interference from the irrelevant part of body, and imperfection in preprocessing [24]. However, due to the advent of the Microsoft Kinect sensor, the human action recognition using depth videos become very popular and comfortable due to depth and Skelton evidence, also robust under low illumination due to IR sensor integration. The segmentation of an object from a cluttered background is beneficial due to the use of depth evidence. A single still image can be used for action recognition by choosing the decisive pose of action form the videos [20]. There is a number of robust features used for understanding and tracking of objects. In [29], a real-time approach is developed, which uses the continuous features for detection of rivers, road, and highways in satellite images. A depth dataset based HAR is developed that utilizes the dense RGB-D Map features [30] and this approach is very reliable for a various realistic scene. In [31], an efficient approach is outlined that uses the Bag of Features for detection of a person in videos collected from different cameras and approach is robust against the low resolution, occlusion and pose, viewpoint and lighting variations. A hybrid features [32] are used for effective recognition of human activities in challenging depth images and these images are very effective when the
background is complex [33] [34]. To address the problem of view in vision, A Multiview algorithm is developed [35] which utilizes the dataset, captured from multiple cameras at multiple points. The features extracted from a video sequence are divided into two extensive groups, i.e. shapebased features and motion-based features. The shape-based feature consists of the contour or silhouette information of the human body [21]. These features can be extracted using a simple operation like the energy of the silhouette or different local descriptors like HOG, SIFT, DIFT, etc. Histogram of the gradient (HOG) gives the histogram of the direction of the gradient of the body contour while Scale-invariant feature transform (SIFT) offers features that invariant to scale and viewpoint. These descriptors provide a considerable amount of classification between different activities but require proper segmentation process to extract the human body from the background. Motion-based features are obtained between two successive video frames. Thus the class can be identified by using the information about the previous frames. There are numerous kinds of motion-based features used for HAR such as motion energy images (MEI), motion history images (MHI), and trajectories, optical flow [1] [22] [23]. The MHI and MEI are 2D templates deduced from video sequences, and these templates have spatiotemporal information. A recent development in 3D to 2D models is the AESI [24] (Average Energy Silhouette Images). The evidence of time and shape make a spatiotemporal feature, but it is hard to correlate the point in space. As these methods require past video sequences to determine the action of the human body, hence, it is impossible to decide on the action at any instantaneous video sequences. The method presented in this work does not require any segmentation, thus reducing the complexity. Therefore, it provides an instant way of determining the class of human action. The main contributions of work are summarized as follows:
A single decisive pose is decided based on the most significant spatial change in the subsequent frames, and spatial variation is estimated using optical flow.
A piece of visually inspired evidence GWT and RT features are computed from the decisive pose.
A novel and the robust hybrid descriptor is developed by concatenation of orientation reliant shape characteristics with first-order statistics values for different scale and orientations.
The performance of proposed descriptor is measured on public human action datasets, and comparison with earlier state-of-the-arts exhibit superior performance.
The rest of the paper is arranged as: section 2 gives the earlier state-of-the-art of human action and activity recognition, the details of feature extraction approach and the proposed model is explained in Section 3. Section 4 provides the experimental works and discussion of results. Finally, the paper is summarized. 2. Related Work Initially, the Gabor filter was introduced by Dennis Gabor in 1946, and later on, Daugman [25] utilized Gabor filter to represent the texture of an image. The pieces of evidence extracted through the GWT filter for an image at a different scale, and orientation and these features are the energy content of an image. It can also be said that the outline of all the GWT extents at a precise scale, and orientation gives the energy content. The maximum energy regulates the maximum evidence at that scale and orientation. An image is represented using 2D Gabor wavelet [26]. Lee [26], outlines the technique to get a filter bank for Gabor wavelets and to regulate the frame bound for it. Lee [26] also concluded that a close-fitting frame signifies high-resolution images as it performs experiments with cruelly quantized Gabor coefficient. Jiang et al. [27] presents, Gabor wavelets to identify the edges in the image and appearances the practice as exceptionally useful in terms of detection accuracy and computationally efficient. Arivazhagan et al. [28] describe a method for investigation of different texture patterns using Gabor wavelet. The scale and rotation invariant features are obtained by determining first-order statistics of the filtered image. The finite Ridgelet Transform is developed by Do and Vetterli [36], it computes the shape information of the image. It also offers the orientation based proof existing in the image and hence being orientation reliant. RT is achieved by applying a discrete wavelet transform to each of the Radon’s projection. Arivazhagan et al. [37] describe the Ridgelet based texture grouping and its invariant rotation stuff. The features such as energy, contrast, and similarity were developed from the sub-bands of the Ridgelet decomposition. The co-occurrence matrix for each of the Ridgelet decomposition is determined to calculate the features easily. In [38], the Ridgelet transform is used for face recognition, and the features are acquired from the decomposed face image. A clip-based approach for the recognition of human activity in video sequences is offered by [39] and subsequently obtained three layers of the segmented videos. Further, these are organized into a binary tree. The inspiration from earlier works, a decisive pose based approach is presented which do not require segmentation and the visual evidence are extracted using twofold feature extraction using GWT and RT.
The methodology described in this paper consists of a unique manner of incorporating GWT and RT on the decisive posed image of the action. The performance of the developed approach is computed on four public datasets. The effectiveness of the approach is demonstrated through the appraisal of accuracy with earlier techniques and shows worthy outcomes. 3. Proposed Methodology The anticipated method is built on the decisive pose inference from the video sequence and based on the multiple feature-based approaches [40] [41]. The video sequence that gives knowledge about the action performed in the video is considered as a decisive pose. The decisive pose is determined using optical flow approach. The features are extracted after the selection of decisive pose from the video. The scale and rotation invariant characteristics are derived from GWT, and RT gives the shape based contour information. The flow of framework is shown in Fig. 1.
Decisive Pose Extraction
Input Video Sequence
Bounding Box & Normalization
Pose Feature Extraction The Finite Transform
Ridgelet
Gabor Wavelet Transform Activity Classification
Fig. 1 Flow Diagram of Proposed Methodology A video is a collection of multiple frames, and many of them are redundant. Hence, the decisive pose is selected from the video to distinctly represent the action, and then the feature extraction process is performed. The features obtained from the two different transforms are concatenated to develop a new feature vector. A bank of features is created for all the different classes, and the classification is performed using K-NN classifier. The raw videos captured through the cameras may have low resolution and noisy. Hence, for the effective representation of the object and extraction of features, the preprocessing is required. The use of single pose offers less
preprocessing in comparison with whole video sequences, and median filtering is used to remove the noisy element from the decisive pose and which results in the enhanced decisive pose. 3.1 Decisive Pose Extraction Video consists of several frames and most the frames have the same information, or it can be said that redundant set of frames are available. A concept of decisive pose selection is taken into consideration to avoid the heavy processing of videos because most of the activities can be decided with a single frame through visual information. It is also important to mention that there may be some frame, where the human body part or whole human body is present that why it becomes essential to select a keyframe. The selection of the most optimised frame that has the whole body part and extracting information from it gives the boost of the descriptor performance. Optical flow is a prevalent and effective method [42] used to determine the motion of the object in videos. Hence, an optical flow approach is used to choose a decisive pose of an actioned image. Consider an image sequences 𝑓(𝑥,𝑦,𝑡) in a video, where (𝑥,𝑦) is the location of the pixel and 𝑡 denotes a time or frame number. The optical flow vector between two frames is computed by considering the brightness consistency and expressed as in Eq. 1 𝑓(𝑥,𝑦,𝑡) = 𝑓(𝑥 + 𝑑𝑥, 𝑦 + 𝑑𝑦, 𝑡 + 𝑑𝑡)
(1)
where 𝑥 + 𝑑𝑥, 𝑦 + 𝑑𝑦 and 𝑡 + 𝑑𝑡 are the position pixel in a subsequent frame. To compute the optical flow vector, the Eq. 1, is expanded using Taylor series expansion of first order and differentiated with respect to time, which is expressed as in Eq. 2. 𝑓𝑥𝑢 + 𝑓𝑦𝑣 + 𝑓𝑡 = 0
(2)
In Eq. 2, the subscript denotes the partial derivatives, and 𝑢 and 𝑣 are the two unknown term and these values are determined with only one equation i.e. Eq. 2 and which is referred as aperture problem. Nonvanishing image gradients or flow components are computed, which are parallel to ∇ 𝑓≔(𝑓𝑥, 𝑓𝑦)𝑇 and normal to image edges. This is also called normal flow and is given by Eq. 3. 𝑓𝑡 ∇𝑓
ℕ = ― |∇𝑓||∇𝑓|
(3)
To deal with the aperture problem Kanade (1981) and Lucas (1984) developed an approach which determines the value of 𝑢, and 𝑣 at some location using least square methods [42]. In our work, the flow of motion of pixel is determined in the subsequent frames of a video, and a single pose is decided based on the maximum number of flow pixel in a set of frames. Further, the
presence of the human body is determined using optical flow, as demonstrated in Fig. 2 and whether the human body is not occluded is obtained by thresholding the optical flow vectors. If the object in the threshold image lies along the boundary of the frame, we consider the frame as redundant. Thus, a keyframe is considered where the object is entirely inside the boundary of the frame. In videos where the human body position is not changing as in case of hand waving, boxing, clapping, jumping, etc., we obtain the keyframes as the frames with a maximum energy of the thresholded image. The efficacy of the system is significantly reliant on the decisive pose captured from a video during training as well as a testing phase. The keyframes obtained for training develop clear criteria for constructing the classifier. JOGGING WHOLE BODY IS IN MOTION
HAND WAVING PART OF THE BODY IS IN MOTION Decesive Pose
Consecutive Video Sequence
Optical Flow Representation Threshold Image Threshold Image lying on the side boundary
Threshold Image not lying on the boundary
Threshold Image with Threshold Image with high-energy content frame low energy content
Fig.2: Decisive pose extraction using optical flow
3.2 Gabor Wavelet Transform Gabor filter has been widely used to determine texture feature in a grayscale image. The Gabor filter consists of optimal localization in 2D space and spatial frequency. Its 2D function forms a non-orthogonal basis set, which is considered as mother wavelet. It can compute the evidence at different scales and orientation by creating a filter bank. The geometrical properties such as spatial locality, orientation selectivity and spatial frequency characteristics are conserved in GWT [37] [43]. A 2D Gabor filter is defined as a Gaussian function curbed by an oscillating wave of frequency (𝑓) and expressed as Eq. 4. Ԍ(𝑓,𝜃)(𝑥,𝑦) =
(
1
)
2𝜋𝜎𝑥𝜎𝑦
( ( 𝑒 ―1
𝑥2
2
𝜎2 𝑥
+
𝑦2 𝜎2 𝑦
))𝑒
( 2𝜋𝑗𝑓(𝑥𝑐𝑜𝑠𝜃 + 𝑦𝑠𝑖𝑛𝜃))
(4)
The oscillating frequency (𝑓) is used to compute the critical parameter (𝜎, 𝜃) of GWT. A filter bank of GWT is developed using these parameters and from the rotation and dilation of the mother wavelet. Consider an input image ℑ(𝑥,𝑦) is passed to these filter banks then a set of output images are obtained at different scale and orientations. The output of filter banks is described using Eq. 5. Ԍ𝑚𝑛(𝑥,𝑦) = ℑ(𝑥,𝑦) ∗ 𝛼 ―𝑚 Ԍ(𝑓,𝜃)(𝑥,𝑦)
(5)
where 𝑚, 𝑛 takes the values of an entire number of scale (𝑀) and (𝑁) is the orientation, and α is the scaling aspect of the wavelet. The values of 𝑥 and 𝑦 in the function is given as in Eq. 6. 𝑥 = α ―m(x.cosθ + y.sinθ) and 𝑦 = α ―m( ― x.sinθ + y.cosθ), for α > 1 and θ =
nπ N
(6)
Gabor Feature Representation: The magnitude response at various scale and orientation are computed with the obtained output of Gabor filter responses for a given input image. The response is consisting of local frequencies of different scales and orientations. The feature vector is formulated through the computation of mean and variance of the filtered image as in [44] [43]. The filtered images preserve the energy content, and it is computed for duos of 𝑚 and 𝑛 using Eq. 7. 𝜉(𝑚,𝑛) = ∑𝑥∑𝑦⃒Ԍ𝑚𝑛(𝑥,𝑦)⃒
(7)
The variation in transformed images indicates the variation of energy content and to represent these images the mean (𝜇𝑚𝑛) and standard deviation (𝜎𝑚𝑛) are computed using Eq. 8 and 9. µ𝑚𝑛 =
𝜉(𝑚,𝑛) 𝑃𝑄
𝜎𝑚𝑛 =
(8)
𝜉(𝑚,𝑛) ― µ𝑚𝑛 𝑃𝑄
(9)
A feature vector (FԌ) can constitute, the filtered image using these statistical parameters [37] and can be represented using Eq. 10. 𝐹Ԍ = [µ00,µ01,……µ𝙼 ― 1 𝑁 ― 1,𝜎00,𝜎01,……𝜎𝙼 ― 1 𝑁 ― 1]
(10)
The robustness of the Gabor features is demonstrated in Fig. 3, where the GWT features are invariant to scale, view and rotation. Hence, the developed features shall have robustness against these variations.
Original Image
(a)
Rotated Image
(c)
Scaled Image
(b)
Translate d Image
(d)
View Change
(e) Fig.3: Gabor feature invariant characteristics (a) original image (b) scaled (c) rotated (d) translated (e) view change
3.3 Ridgelet Transform Ridgelet transform is used to determine the evidence along with a line at a particular direction, whereas wavelet is used to determine the point information. The RT gives the information along a specific orientation. Thus, the obtaining point information in terms of projected lines, a method of Do and Vetterli [36] is used. The RT coefficients are obtained by applying DWT on each of the Radon’s projection. The properties of RT is rotation invariant, and directional sensitive and thus providing shape information of the object of interest in the image as in [36] [45]. 3.3.1
The Finite Ridgelet Transform A discrete form of RT is termed as the Finite Ridgelet transform (FRIT), which is
determined by a fixed number of orientations. It is based on the size of the image and in many cases, orientation is downsampled for a large scale image. The process for calculating RT is outlined in Fig. 4. Initially, the Finite Radon Transform (FRAT) is computed for an image, and later DWT is computed for each of the Radon projection as in [37] [46] [47].
i
p
j
DWT
l
k
FRAT
IMAGE
p
FRIT
Fig. 4: FRIT Deduction The total pixel’s values with a group of fixed lines at a specific angle of projection is obtained by FRAT. Consider the 𝑚 × 𝑚 is the size of an image, which is characterized by 𝑍𝑚, and where 𝑚 is a prime number, and 𝑚 + 1 projections are mandatory to regulate the FRAT. The FRAT of an image ℑ[𝑖,𝑗] of a grid of size 𝑚 × 𝑚 is defined using Eq. 11. 𝑟𝑝[𝑘] = 𝐹𝑅𝐴𝑇ℎ(𝑝,𝑘) =
1 ∑ 𝑚 (𝑖,𝑗) ∈ 𝐿𝑝,𝑘ℑ[𝑖,𝑗]
(11)
where 𝐿𝑝,𝑘 denotes the collection of points that defines the line on the image ℑ[𝑖,𝑗] and expressed as Eq. 12 and 13. 𝐿𝑝,𝑘 = { (𝑖,𝑗):𝑗 = 𝑝𝑖 + 𝑘 (𝑚𝑜𝑑 𝑚), 𝑖 𝜖 𝑍𝑚}, 0 ≤ 𝑝 < 𝑚,
(12)
𝐿𝑚,𝑘 = { (𝑘,𝑗): 𝑗 𝜖 𝑍𝑚}
(13)
where 𝐿𝑚,𝑘 represents to the cases of the vertical line. The FRIT factor is computed by applying 1D-DWT on each of the FRAT sequences of each direction (𝑟𝑝[0],𝑟𝑝[1],𝑟𝑝[2],𝑟𝑝[3]……..𝑟𝑝[𝑚 ― 1] ) and the same can be seen from Fig. 4 and expressed as Eq. 14. 𝐹𝑅𝐼𝑇ℎ(𝑝,𝑙) = 𝐷𝑊𝑇(𝑟𝑝) 3.3.2
(14)
Ridgelet Feature Extraction An image can be represented using 2D coefficients of the FRIT, and it can be crushed to
1D space and helps in representing features. The feature vector is formulated through the square and summation of FRIT coefficient along with each projection. Hence, for a specific projection (𝑝) , the feature point is expressed as Eq. 15. 𝐹𝑅(𝑝) = ∑𝑙𝐹𝑅𝐼𝑇2(𝑝,𝑙)
(15)
The number of projections in FRIT is 𝑚 + 1, which means that the input image is of 𝑚 × 𝑚 size and feature vector is of 𝑚 + 1 size. The feature vector obtained is orientation-dependent and able to discriminate different types of human actions which can be seen in Fig. 5. The RT variations
can be observed for dissimilar types of actions and also it can be correlated with the pattern of RT. Hence, it can be said by concatenation of RT with GWT features; a hybrid feature vector can be developed, which is capable of representing different human actions.
Action 1
Action 2
Action 3
Action 4
Action 5
Action 6
Action 7
Action 7
Fig. 5: Demonstrate the RT discriminative characteristic for different human actions
3.4 Hybrid Feature Formation The effective representation of human action and activity using videos sequences are always a puzzling task due to scaling, translation, and rotation [15]. Hence, inspiring from these glitches, an effective descriptor is developed through the fusion of two robust features such as GWT and RT. Also, it is well evident that multiples features [48] give better performance than the single feature for the representation of human activity [49] [41]. As it is seen from Fig.3, the GWT features are robust against scaling, translation, and rotations and similarly from Fig.5, it is seen that is very effective in representation of different activities. Therefore, the features extracted from both the transform are united using concatenation and a final feature vector is obtained for a single
image. Eq. 10 and Eq. 15 describes the feature vector for GWT and RT, respectively. The GWT gives a characteristic length of 80 × 1, and RT gives a feature vector of size 68 × 1. The final feature vector is given as Eq. 16. (16)
𝐅 = [ 𝐅Ԍ 𝐅𝐑 ] 4. Experiment and Result
An experiment is conducted to evaluate the performance of the anticipated algorithm using publicly available human action datasets, and these datasets are chosen based on challenges involved such as view variation, lightning conditions, clothing variations, etc. The datasets are KTH [50], Weizmann [51], Ballet [52] and UT Interaction dataset [53]. The performance on these datasets is measured in terms of recognition accuracy using K-nearest Neighbour classifier [54]. The recognition accuracy is defined as given in Eq.17. 𝑇𝑃 + 𝑇𝑁
(17)
𝑅𝑒𝑐𝑜𝑔𝑛𝑡𝑖𝑜𝑛 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑁 + 𝐹𝑃 × 100
A confusion matrix is obtained for each dataset in terms of TP, TN, FP, and FN. The TP, TN, FP, and FN are a true positive, true negative, false positive and false negative, respectively. The average recognition accuracy (ARA) for each dataset is computed and compared with the similar state-of-the-art. The experimental setting used for computation of GWT feature is five scale and eight orientations. The size of images used for RT computation is 67 × 67. The classification strategy used for computing ARA is leave-one-out-cross-validation (LOOCV). 4.1 KTH Dataset Schuldt et al. [50] developed KTH dataset which consists of six types of human actions i.e. “running,” “jogging,” “walking,” “hand clapping,” “hand waving” and “boxing.” Twenty persons in four different scenarios perform each action types. The sample frames of the activities are as displayed in Fig. 6.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 6: Sample frames of KTH human action dataset (a) Walking; (b) Running; (c) Jogging; (d) Hand Clapping; (e) Hand Waving; and (f) Hand Boxing
For this dataset, 20 keyframes are inferred from each video. The training image set is obtained by realising the decisive pose from the videos. The classification result of this dataset is shown in the confusion matrix as in Fig. 7. The prediction of different actions is varied and it can be observed that there is a high prediction rate for some actions such as Hand Clapping, Hand Waving and Boxing. The reason for high prediction is the less inter-class similarity, and these actions do have any similarity. The fewer prediction rates are also observed for some actions such as walking, running, and jogging and the reason for less prediction rate is the high interclass similarity between these actions.
Fig.7. Classification Result on KTH Human Action Dataset The ARA achieved on this is 96.66%, which is compared with the similar state-of-the-arts and shown in Table 1. The ARA of 96.66% is significantly higher than the many earlier approaches [55], [56], [57] on similar state-of-the-art.
Table 1: Comparison of ARA with the techniques of others on KTH Dataset Method
Features Used
ARA %
Tian et al. [55]
Hierarchical Filtered Motion
94.50
Modarres and Soryani [56]
Body Posture Graph
94.60
Sheng et al. [57]
HOG Feature Directional Pairs
94.99
Our Method
Gabor-Ridgelet Transform
96.66
Tian et al. [55] proposed the method using motion feature and Harris points to detect the movement of the human body. Farid et al. [56] proposed a graph-based approach in which vertices and edges of a graph model the body posture. Sheng et al. [57] proposed a methodology that relates the local descriptors such as HOG, Cuboidal, and SIFT features. The results are compared with the other techniques in Table 1. Table 1 shows that our method is superior concerning the other methods. 4.2 Weizmann Dataset The dataset is provided by Weizmann Institute of Science [51] and is based on Space-Time shapes. The background in the scene is relatively modest, and only one person is acting in each video sequence. The dataset consist of 10 human actions such as bending (W1), jumping jack (W2) , jumping (W3), jumping in place (W4), running (W5), galloping sideways (W6), skipping (W7), walking (W8), one-hand waving (W9), and two-hands waving (W10). These actions are performed by 9 actors thus; there are ten videos for each action. To classify these action a leave one out classification strategy is used where nine deceives poses of each action are used to train and one action pose is used as test iteratively. The sample frames of the dataset are as given in Fig. 8.
Fig. 8: Sample Frames of Weizmann Human Action dataset
The detail classification result of this dataset is presented in Table 3 as a confusion matrix.
The overall ARA achieved on this dataset is 96 %, which is much higher than the other approaches [58] [59] [60]. The lower recognition in a skip is due to its similarity in a pose with a run, walk and jump.
Fig. 9. Classification Result on Weizmann Human Action Dataset
Table 2: Comparison of ARR with the techniques of others on Weizmann Dataset Method Bebars and Hemayed [58] Arunnehru and Geetha [59] Li et al. [60] Our method
Features Used STIP Features Frame Differencing Motion Feature Spatio Temporal Descriptors Gabor-Ridgelet Transform
ARA % 91.9 92 92.53 96
Bebars et al. [58] proposed a methodology using spatiotemporal interest points based on motion and SIFT features and creating codebook for classification. The classification rate obtained by it is 91.9%. Arunnehru et al. [59] proposed a methodology that extracts motion information from the ROI and the accuracy rate achieved is 92%. Li et al. [60] proposed 3D SURF features to determine the activity of the human and the recognition rate obtained is 92.53%. Table 2 shows that the proposed method accuracy is approximately 4% higher than Li et al. [60], which is an
inspiration. 4.3 Ballet Dataset The dataset is found from instructional Ballet DVD [52]. The background of the data set is modest, but the actions accomplished are multifarious, and each video sequence consists of only one actor. There are 44 video sequences in the dataset, and each sequence is pre-labelled into eight different actions such as left-to-right hand opening, right-to-left hand opening, standing hand opening, leg swinging, jumping, turning, hopping, and standing still. The sample frames of the dataset are shown in Fig.10.
Fig. 10. Sample Images of the Ballet dataset
Fig.11. Classification Result on Ballet Human Movement Dataset
Table 3: Comparison of ARA with the techniques of others on Ballet Human Movement dataset Method Xia et al. [61] Guha and Ward [62] Iosifidis et al. [63] Our method
Features Used Skeleton graph matching Spatio-temporal features HOG/HOF on STIP Gabor-Ridgelet Transform
ARA % 90.88 91.1 91.1 92.75
The confusion matrix obtained is shown in Table 3. The confusion for jumping, turning and hopping is higher is due to the similarity in the poses. The drop of ARA is due to the inter-class similarity of the activities such as jumping, turning and hopping. The ARA received is 92.75 and is much higher than the other techniques developed for ballet dataset. Guha et al. [62] proposed a methodology based on spatiotemporal interest points to describe the motion and the accuracy rate obtained is 91.1%. Iosifidis et al. [63] proposed a BoW and DBoW based method using HOG/HOF features and the recognition rate achieved is 91.1%. Table 6 shows the comparison of our approach with the techniques of others and shows that our method is higher to the present state-the-of-art. 4.4. UT Interaction Dataset The UT Interaction dataset [53] consists of six different actions groups (Handshaking-U1, Hugging-U2, Kicking-U3, Pointing-U4, Punching-U5, and Pushing-U6) and ten different pairs of actors perform each action. The example frames of the dataset are as shown in Fig. 12. The actions are carried out in two distinct scenarios. In the first scenario, the background is still with little camera jitter, and the second scenario, the lawn background on windy situations is used to collect dataset. Thus, the background is moving marginally and contains extra camera jitter. The recognition rate is obtained for both the scenarios differently. The confusion matrix for both the set is given in Table 13 (a) and 13 (b). The ARA achieved is 100% for set1 and 90% for set2.
Fig. 12: UT Interaction Dataset
(a) (b) Fig. 13. (a) Classification Result on UT Interaction Dataset Set 1 (static background with little camera jitter), (b) Classification Result on UT Interaction Dataset Set 2 (With more camera jitter and airy environment)
Table 4. Comparison of ARA with the techniques of others for UT Interaction dataset Method
Features Used
Zhang et al. [64] Nijun et al. [65] Meng et al. [66] Liu et al. [67] Wang and Ji [68] Our Method
Spatio-Temporal Feature STIP Semantic Spatial Relation 3D SIFT Deep hierarchical context model Gabor-Ridgelet Transform
ARA % Set1 73 92 92 92.17 95 100
Set2 53 85 84 85.3 90
Zang et al. [64] anticipated a methodology based on local spatiotemporal features and codebook generation for this dataset and accuracy is 73% and 53%, which is the lowest among all. Li et al. [65] proposed a method based on motion context features and spatiotemporal Interest points (STIPs) for determining the interaction between humans. Meng et al. [66] proposed a methodology based on the semantic relationships between objects to determine the class of interaction. The accuracy achieved through the proposed algorithms is significantly higher than the earlier approaches [64] [65] [67] [68]. The main reason for higher accuracy is the robustness lies in our descriptor as demonstrated. Overall the effectiveness of our algorithms can be seen from the experimental works as well as the comparison of our result with the earlier state-of-the-art. 5. Conclusion In this work, a model of the human pose is represented is developed using a multi-feature fusion-based approach. The computation of GWT gives the orientation and scale-invariant characteristic and it is further fused with shape characteristic computed through RT. The
unification of these features provides a new model for human activity recognition, and the attainment of the unified descriptor is assessed on different openly available human action dataset, which contributes to improved performance. The data sets tested using these methods are KTH, Weizmann, Ballet and UT interaction dataset. The benefit of this method is that it does not require segmentation of the human silhouette, thus reducing the computational cost. The KNN classifier provides excellent inter-class discrimination, as it is a kernel-based method. Increasing the number of keyframes per video may increase the redundancy and will make the system slow as well as less efficient and hence to improve the efficiency dimensionality reduction techniques needs to be applied. In future, the work can be used to develop an intelligent and autonomous system for real-time applications such as sports actions analysis, yoga analysis, and control appliances through human body pose etc. Also, the developed approach may be tested under more realistic challenges such as complex background, lightening variations, zoom in zoom out etc. 6. References [1] J. K. Aggarwal and M. S. Ryoo, "Human Activity Analysis: A Review," ACM Computing Surveys (CSUR), vol. 43, no. 3, pp. 16-43, 2011. [2] G. Tripathi, K. Singh and D. K. Vishwakarma, "Convolutional neural networks for crowd behaviour analysis: a survey," The Visual Computer, vol. 35, no. 5, pp. 753-776, 2019. [3] H. Wu, W. Pan, X. Xiong and S. Xu, "Human activity recognition based on the combined SVM&HMM," in International Conference on Information and Automation, Hailar, 2014. [4] C. Dhiman and D. K. Vishwakarma, "A review of state-of-the-art techniques for abnormal human activity recognition," Engineering Applications of Artificial Intelligence, vol. 77, pp. 21-45, 2018. [5] C. Dhiman and D. K. Vishwakarma, "A Robust Framework for Abnormal Human Action Recognition using RTransform and Zernike Moments in Depth Videos," IEEE Sensors Journal, vol. 19, no. 13, pp. 5195-5203, 2019. [6] A. Jalal, M. Quaid and A. S. Hasan, "Wearable Sensor-Based Human Behavior Understanding and Recognition in Daily Life for Smart Environments," in International Conference on Frontiers of Information Technology (FIT), Islamabad, Pakistan, 2018. [7] M. S. Bakli, M. A. Sakr and T. H. Soliman, "A spatiotemporal algebra in Hadoop for moving objects," Geospatial Information Science, vol. 21, no. 2, pp. 102-114, 2016. [8] M. . M. Awad, "Forest mapping: a comparison between hyperspectral and multispectral images and technologies," Journal of Forestry Research, vol. 29, no. 5, pp. 1395-1405, 2018. [9] W. Zhao, L. Yan and Y. Zhang, "Geometric-constrained multi-view image matching method based on semiglobal optimization," Geo-spatial Information Science, vol. 21, no. 2, pp. 115-26, 2018. [10] Z. . S. Abdallah, M. M. Gaber, B. Srinivasan and S. Krishnaswamy, "Adaptive mobile activity recognition system with evolving data streams," Neurocomputing, vol. 150, pp. 304-317, 2015. [11] Y. Lee , T. J. Choi and C. W. Ahn, "Multi-objective evolutionary approach to select security solutions," CAAI Transactions on Intelligence Technology, vol. 2, no. 2, pp. 64-67, 2017. [12] A. Jalal, M. A. K. Quaid and M. A. Sidduqi, "A Triaxial Acceleration-based Human Motion Detection for Ambient Smart Home System," in International Bhurban Conference on Applied Sciences and Technology, Islamabad, Pakistan, 2019. [13] D. K. Vishwakarma, R. Kapoor and A. Dhiman, "A proposed unified framework for the recognition of human
[14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36]
activity by exploiting the characteristics of action dynamics," Robotics and Autonomous Systems, vol. 77, pp. 25-38, 2016. R. Poppe, "A survey on vision-based human action recognition," Image and Vision Computing, vol. 28, pp. 976990, 2010. T. Singh and D. K. Vishwakarma, "Video benchmarks of human action datasets: a review," Artificial Intelligence Review, pp. 1-48, 2018. C. Schuldt, I. Laptev and B. Caputo, "Recognizing human actions: a local SVM approach," in International Conference on Pattern Recognition, 2004. M. Blank, L. Gorelick, E. Shechtman, M. Irani and R. Basri, "Actions as space-time shapes," in Tenth IEEE International Conference on Computer Vision (ICCV'05), Beijing,, 2005. M. S. Ryoo and J. K. Aggrawal, "Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities," in International Conference on Computer Vision, Kyoto, 2009. A. Fathi and G. Mori, "Action recognition by learning mid-level motion features," in IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 2008. S. Kamal, A. Jalal and D. Kim, "Depth Images-based Human Detection, Tracking and Activity Recognition Using Spatiotemporal Features and Modified HMM," Journal of Electrical Engineering and Technology, vol. 11, no. 6, pp. 1857-1862, 2016. K. Buys, C. Cagniart, A. Baksheev, T. . D. Laet, J. D. Schutter and C. Pantofaru, "An adaptable system for RGBD based human body detection and pose estimation," Journal of Visual Communication and Image Representation, vol. 25, no. 1, pp. 39-52, 2014. S. Arivazhagan, R. N. Shebiah, R. Harini and S. Swetha, "Human action recognition from RGB-D data using complete local binary pattern," Cognitive Systems Research, vol. 58, pp. 94-104, 2019. L. M. G. Fonseca, L. M. Namikawa and E. F. Castejo, "Digital Image Processing in Remote Sensing," in Tutorials of the XXII Brazilian Symposium on Computer Graphics and Image Processing, Rio de Janeiro, 2009. A. Prochazka, M. Kolinova, J. Fiala, P. Hampl and K. Hlavaty, "Satellite image processing and air pollution detection," in IEEE International Conference on Acoustics, Speech, and Signal Processing, Istanbul, Turkey. Y. Zhang, C. Chou, S. Yu and T. Chen, "Object color categorization in surveillance videos," in IEEE International Conference on Image Processing, Brussels, 2011. X. Zhao, G. Xu, D. Liu and X. Zuo, "Second-order DE algorithm," CAAI Transactions on Intelligence Technology, vol. 2, no. 2, pp. 80-92, 2017. M. M. Rathore, A. Ahmad, A. Paul and J. Wu, "Real-time continuous feature extraction in large size satellite images," Journal of Systems Architecture, vol. 64, pp. 122-132, 2016. A. Farooq, A. Jalal and S. Kamal, "Dense RGB-D Map-Based Human Tracking and Activity Recognition using Skin Joints Features and Self-Organizing Map," KSII Transactions on Internet and Information Systems, vol. 9, no. 5, pp. 1856-1869, 2015. Q. Huang, J. Yang and Y. Qiao, "Person re-identification across multi-camera system based on local descriptors," in International Conference on Distributed Smart Cameras, Hong Kong, 2012. S. Kamal and A. Jalal, "A Hybrid Feature Extraction Approach for Human Detection, Tracking and Activity Recognition Using Depth Sensors," Arabian Journal for Science and Engineering, vol. 41, no. 3, pp. 1043-1051, 2016. A. Jalal, S. Kamal and D. Kim, "Individual detection-tracking-recognition using depth activity images," in International Conference on Ubiquitous Robots and Ambient Intelligence, Goyang, 2015. L. Piyathilaka and S. Kodagoda, "Gaussian mixture based HMM for human daily activity recognition using 3D skeleton features," in Conference on Industrial Electronics and Applications, Melbourne, 2013. H. Yoshimoto, N. Date and S. Yonemoto, "Vision-based real-time motion capture system using multiple cameras," in IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, Tokyo, Japan. D. K. Vishwakarma and R. Kapoor, "Hybrid classifier based human activity recognition using the silhouette and cells," Expert Systems with Applications, vol. 42, no. 20, pp. 6957-6965, 2015. A. F. Bobick and J. W. Davis, "The recognition of human movements using temporal templates," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 3, pp. 257-267, 2001. Y. Sheik, M. Sheik and M. Shah, "Exploring the space of a human action," in IEEE International conference on computer vision, 2005.
[37] D. K. Vishwakarma and K. Singh, "Human Activity Recognition based on Spatial Distribution of Gradients at Sub-levels of Average Energy Silhouette Images," IEEE Transactions on Cognitive and Developmental Systems, vol. 9, no. 4, pp. 316-327, 2017. [38] J. Daugman, "Two-dimensional spectral analysis of cortical receptive field profiles," Vision Research, vol. 20, no. 10, pp. 847-856, 1980. [39] T. S. Lee, "Image representation using 2D Gabor wavelets," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 10, pp. 959-971, 1996. [40] W. Jiang, K. M. Lam and T. Z. Shen, "Efficient Edge Detection Using Simplified Gabor Wavelets," IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, no. 4, pp. 1036-1047, 2009. [41] S. Arivazhagan, L. Ganesh and S. P. Priyal, "Texture classification using Gabor wavelets based rotation invariant features," Pattern Recognition Letters, vol. 27, pp. 1976-1982, 2006. [42] M. N. Do and M. Vetterli, "The Finite Ridgelet Transform for Image Representation," IEEE Transactions on Image Processing, vol. 12, no. 1, pp. 16-28, 2003. [43] S. Arivazhagan, L. Ganesan, T. Kumar and G Subash, "Texture classification using ridgelet transform," Pattern Recognition Letters, vol. 27, no. 16, pp. 1875-1883, 2006. [44] S. Kautkar, R. Koche, T. Keskar, A. Pande, M. Rane and G. A. Atkinson, "Face Recognition Based on Ridgelet Transforms," Procedia Computer Science, vol. 2, pp. 35-43, 2010. [45] Y. Zheng, H. Yao, X. Sun, X. Jiang and F. Porikli, "Breaking video into pieces for action recognition," Multimedia Tools and Applications, pp. 1-18, 2017. [46] H. Aggrawal and D. K. Vishwakarma, "Covariate conscious approach for Gait recognition based upon Zernike moment invariants," IEEE Transactions on Cognitive and Developmental Systems, vol. 10, no. 2, pp. 397-407, 2018. [47] D. K. Vishwakarma, P. Rawat and R. Kapoor, "Human Activity Recognition Using Gabor Wavelet Transform and Ridgelet Transform," Procedia Computer Science, vol. 57, pp. 630-636, 2015. [48] A. Bruhn, J. Weickert and C. Schnörr, "Lucas/Kanade Meets Horn/Schunck: Combining Local and Global Optic Flow Methods," International Journal of Computer Vision, vol. 61, no. 3, pp. 211-231, 2005. [49] A. Jain and G. Healey, "A multiscale representation including opponent color features for texture recognition," IEEE Transactions on Image Processing, vol. 7, no. 1, pp. 124-128, 1998. [50] X. Liu, Q. Liu and S. Oe, "Recognizing Non-rigid Human Actions using Joints tracking in space-time.," in IEEE International confrence on Information Technology:Coding and Computing, 2004. [51] A. G. Zuniga, J. B. Florindo and O. M. Bruno, "Gabor wavelets combined with volumetric fractal dimension applied to texture analysis," Pattern Recognition Letters, vol. 36, pp. 135-143, 2014. [52] W. Pana, T. D. Buib and C. Y. Suena, "Rotation invariant texture classification by ridgelet transform and frequency-orientation space decomposition," Signal Processing, vol. 88, no. 1, pp. 189-199, 2008. [53] S. Yang, W. Min, L. Zhao and Z. Wang, "Image noise reduction via geometric multiscale ridgelet support vector transform and dictionary learning," IEEE Transactions on Image Processing, vol. 22, no. 11, pp. 4161-4169, 2013. [54] B. Huang, G. Tian and F. Zhou, "Human typical action recognition using gray scale image of silhouette sequence," Computers & Electrical Engineering, vol. 38, no. 5, pp. 1177-1185, 2012. [55] D. K. Vishwakarma, R. Kapoor and A. Dhiman, "Unified framework for human activity recognition: An approach using spatial edge distribution and R-transform," AEU-International Journal of Electronics and Communications, vol. 70, no. 3, pp. 341-353, 2016. [56] T. Cover and P. Hart, "Nearest neighbour pattern classification," IEEE Transactions on Information Theory , vol. 13, no. 1, p. 1967, 21-27. [57] Y. L. Tian, L. Cao, Z. Liu and Z. Zang, "Hierarchical Filtered Motion for Action Recognition in Crowded Videos," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 3, pp. 313-323, 2012. [58] A. F. A. Modarres and M. Soryani, "Body posture graph: a new graph-based posture descriptor for human behaviour recognition," IET Computer Vision, vol. 7, no. 6, pp. 448-499, 2013. [59] B. Sheng, W. Yang and C. Sun, "Action recognition using direction-dependent feature pairs and non-negative low rank sparse model," Neurocomputing, vol. 158, pp. 73-80, 2015. [60] A. A. Bebars and E. E. Hemayed, "Comparative study for feature detectors in human activity recognition," in 9th International Computer Engineering Conference, Giza, 2013.
[61] J. Arunnehru and M. K. Geetha, "Behavior recognition in surveillance video using temporal features," in Fourth International Conference on Computing, Communications and Networking Technologies, Tiruchengode, 2013. [62] C. Li, S. Bailiang, Y. Liu, H. Wang and J. Wang, "Human action recognition using spatio-temoporal descriptor," in 6th International Congress on Image and Signal Processing (CISP), Hangzhou, 2013. [63] L.-m. Xia, J.-x. Huang and L.-z. Tan, "Human action recognition based on chaotic invariants," Journal of Central South University, vol. 20, no. 11, pp. 3171-3179, 2013. [64] T. Guha and R. Ward, "Learning Sparse Representations for Human Action Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 8, pp. 1576-1588, 2012. [65] A. Iosifidis, A. Tefas and I. Pitas, "Discriminant Bag of Words based representation for human action recognition," Pattern Recognition Letters, vol. 49, pp. 185-192, 2014. [66] X. Zhang, J. Cui, L. Tian and H. Zha, "Local spatio-temporal feature based voting framework for complex human activity detection and localization," in The First Asian Conference on Pattern Recognition, Beijing, 2011. [67] N. Li, X. Cheng, H. Guo and Z. Wu, "A hybrid method for human interaction recognition using spatio-temporal interest points," in International Conference on Pattern Recognition, Stockholm, 2014. [68] L. Meng, Q. P. Yang, J. Miao, X. Chen and D. N. Metaxas, "Activity recognition based on semantic spatial relation," in International Conference on Pattern Recognition , 2012. [69] H. Liu, Q. Zhang and Q. Sun, "Human action classification based on sequential bag-of-words model," in IEEE International Conference on Robotics and Biomimetics, Bali, 2014. [70] X. Wang and Q. Ji, "Video event recognition with deep hierarchical context model," in IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015.
CONFLICT OF INTEREST It is declared that there is no conflict of interest with this work.