RGB-D action recognition using linear coding

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom RGB-D act...

Download PDF

1MB Sizes 0 Downloads 87 Views

Report

PDF Reader
Full Text

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

RGB-D action recognition using linear coding Huaping Liu a,b,c,n, Mingyi Yuan a,b,c, Fuchun Sun a,b,c a b c

Department of Computer Science and Technology, Tsinghua University, Beijing, China State Key Laboratory of Intelligent Technology and Systems, Beijing, China Tsinghua National Laboratory of Information Science and Technology, Beijing, China

art ic l e i nf o

a b s t r a c t

Article history: Received 1 July 2013 Received in revised form 11 November 2013 Accepted 18 December 2013

In this paper, we investigate action recognition using an inexpensive RGB-D sensor (Microsoft Kinect). First, a depth spatial-temporal descriptor is developed to extract the interested local regions in depth image. Such descriptors are very robust to the illumination and background clutter. Then the intensity spatial-temporal descriptor and the depth spatial-temporal descriptor are combined and feeded into a linear coding framework to get an effective feature vector, which can be used for action classiﬁcation. Finally, extensive experiments are conducted on a publicly available RGB-D action recognition dataset and the proposed method shows promising results. & 2014 Elsevier B.V. All rights reserved.

Keywords: RGB-D Action recognition Linear coding

1. Introduction Recognition of human actions has been an active research topic in computer vision. In the past decade, research has mainly focused on learning and recognizing actions from video sequences captured from a single camera and rich literature can be found in a wide range of ﬁelds including computer vision, pattern recognition, machine leaning and signal processing. Recently, there are some approaches using local spatio-temporal descriptors together with bag-of-words model to represent the action. Since these approaches do not rely on any preprocessing techniques, e.g. foreground detection or body-part tracking, they are relatively robust to the change of viewpoint, noise, background, and illumination. However, most existing work on action recognition is based on color video, which leads to relatively low accuracy even when there is no clutter. Different from these work, our motivation is driven by the application of the famous mass-production consumer electronics device Kinect, which provides a depth stream and a color stream. Kinect has been applied in extensive ﬁelds including people detection and tracking [1,2]. Currently there exist very few work that utilize the color-depth sensor combination for human action recognition. For example, Ref. [3] used the depth information but totally ignored the depth information. In fact, as we will analyze, the color information and depth information can be complementary since the human actions are in essence threedimensional. However, how to effectively fuse the color and depth information remains a great challenging problem. In this paper, we n

Corresponding author. E-mail address: [email protected] (H. Liu).

extract the local descriptors from the color and depth video and utilize the linear coding framework to integrate the color and depth information. The main contributions are summarized as follows: 1. The conventional STIP descriptor is extended by incorporating depth information to deal with depth video. Such descriptors are very robust to the illumination and background clutter. 2. A linear coding framework is developed to fuse the intensity spatial-temporal descriptor and the depth spatial-temporal descriptor to form robust feature vector. In addition, we further exploit the temporal intrinsics of the video sequence and design a new pooling technology to improve the description performance. 3. Extensive experiments are conducted on a publicly available RGB-D action recognition dataset and the proposed method shows promising results. The organization of this paper is as follows: in Section 2 we introduce the feature extraction. Sections 3 and 4 present the coding and pooling methods, respectively. The experimental results are given in Section 5. Finally, Section 6 gives some conclusions.

2. Feature extraction There are several schemes applied to time-consistent scene recognition problems. Some of them are statistics based approaches, such as Hidden Markov Models, Latent-Dynamic Discriminative Model [4], and so on. Differently, Space-Time Interest Points (STIPs) [5] regard the temporal axis as the same

http://dx.doi.org/10.1016/j.neucom.2013.12.061 0925-2312/& 2014 Elsevier B.V. All rights reserved.

Please cite this article as: H. Liu, et al., RGB-D action recognition using linear coding, Neurocomputing (2014), http://dx.doi.org/10.1016/ j.neucom.2013.12.061i

2

H. Liu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

as the spatial axes and looks for the features along the temporal axis as well. We prefer to the latter ones because the time parameter of the sample is essentially the same as the space parameters in the mathematics sense. Since we have plenty of reliable mathematics tools and feature construction schemes, the extensions of already existed feature schemes can be safely applied in such time-relevant problems. Meanwhile, those schemes can be naturally extended for more complex tasks. STIPs is an extension of SIFT (Scale-Invariant-Feature-Transform) in 3-dimensional space and uses one of Harris3D, Cuboid or Hessian as the detector. For certain video, dense sampling is performed at regular positions and scales in space and time to get 3D patches. We perform sampling from 5 dimensions x, y, t, σ and τ where σ and τ are the spatial and temporal scales, respectively. Usually, the minimum size of a 3D patch is 18 18 pixel by 10 frames. Spatial and temporal sampling are done with 50% overlap. Multi-scale patches are obtained by multiplying σ and τ by pﬃﬃﬃ a factor of 2 for consecutive scales. In total, there are 8 spatial and 2 temporal scales since the spatial scale is more important than the time scale. Different spatial and temporal scales are combined so that each video is sampled 16 times with different σ and τ parameters. The detector is applied in each video and locates interest points as well as the corresponding scale parameters. After that, the HOG-HOF (Histogram-Of-Gradient-Histogram-Ofoptical-Flow) descriptors are calculated at those detected interest points and the sample features are generated. In our work, the feature descriptors are extracted from both RGB image and depth image. For applying the STIPs detector and descriptor on the depth information, we scale the depth value from 16-bit unsigned integer to 8-bit unsigned integer by searching the maximum and minimum (above 0) of the depth value in the whole sample video, and transforming each depth pixel linearly as 8 d¼0 > < 0; d dmin dnew ¼ ð1Þ > : 255 dmax dmin ; d 4 0; where dmax is the maximal depth of the video sample and dmin is the minimal depth above 0 of the video sample. We save the matrices of dnew as the gray type depth video and use it in the same way as the RGB one. For a typical video, there are about several hundred frames of RGB and depth image pairs and thousands of STIPs descriptors detected. The STIPs descriptors are of 162 dimensional vector composed of 90-dimensional HOG [6] descriptor and 72-dimensional HOF [7] descriptor. The HOG and HOF descriptors are computed at the detected interest point with the associated scale factors. The STIPs descriptor describes the local variation characters well in the xy space as well as in the t space. Fig. 1 shows the features detected in one frame of both RGB image and depth image. The circles center at the interest points and the radius of the circle is proportional to the scale factor σ of that interest point. It can be seen that the STIPs

features on RGB image and depth image cover different regions of the subjects because of the different pixel variations in the two types of data. In fact, the brightness of each pixel in the depth image has larger variation near the contour of the subject, including the head, arms and legs, etc. On the other hand, the variation of the brightness of the RGB image appears at the boundary of the texture of the subject. So the STIPs features in the RGB images disclose more detail characters of the subjects themselves while in the depth images they extract more characters of the shape of the subjects. In conclusion, both features are useful and equally important for classiﬁcation.

3. Coding approaches A popular method for coding is the vector quantization (VQ) method, which solves the following constrained least square ﬁtting problem: M

min ∑ ‖xi Bci ‖22 C

i¼1

s:t: ‖ci ‖0 ¼ 1; ‖ci ‖1 ¼ 1;

ci ≽0; 8 i;

ð2Þ

where C ¼ ½c1 ; c2 ; …; cM is the set of codes for X ¼ ½x1 ; x2 ; …; xM . The cardinality constraint ‖ci ‖0 ¼ 1 means that there will be only one non-zero element in each code ci , corresponding to the quantization id of xi . The non-negative, ℓ1 constraint ‖ci ‖1 ¼ 1, ci ≽0 means that the coding weight for xi is 1. In practice, the single non-zero element can be found by searching the nearest neighbor. VQ provides an effective way to treat an image as a collection of local descriptors, quantizes those descriptors into discrete “visual words”, and then computes a compact histogram representation of the image for the classiﬁcation tasks. For the action recognition task, one can use all the STIPs descriptors extracted from the video sample as the candidate vectors for building the codebook. In our implementation, the standard k-means [8] method is employed to cluster the feature descriptors and the cluster centers are selected as the codewords to form the codebook. The codebook size K could be small to save the computation time (i.e, K o ¼ 512). For each feature descriptor, the nearest codeword in the Euclidean distance measurement is selected as the coding vector, in the form of a K-dimensional vector with all zero components but one set to 1. After a sum pooling (or sum normalization) [9] process, the local feature vectors are combined into a global histogram representation of the codewords. This is the framework of the well known Bag-of-Words model [10]. Finally, a χ2-kernel [11] SVM is used to do the classiﬁcation. The χ2 kernel is denoted as KðH i ; H j Þ ¼ e ð1=AÞDðHi ;Hj Þ ;

ð3Þ

where H i ¼ fhin g and H j ¼ fhjn g are two histogram vectors, DðH i ; H j Þ ¼ 12∑Kn ¼ 1 ðhin hjn Þ2 =ðhin þhjn Þ is the χ2 distance of Hi and Hj, and A is the mean value of the distance between every two pooling vectors of all the samples.

Fig. 1. STIPs features on RGB image (left) and depth image (right).

Please cite this article as: H. Liu, et al., RGB-D action recognition using linear coding, Neurocomputing (2014), http://dx.doi.org/10.1016/ j.neucom.2013.12.061i

H. Liu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

One disadvantage of the VQ is that it introduces signiﬁcant quantization errors since only one element of the codebook is selected to represent the descriptor. To remedy this, one usually has to design a nonlinear SVM as the classiﬁer which tries to compensate the quantization errors. However, using nonlinear kernels, the SVM has to pay a high training cost, including computation and storage. Considering the above defects, localityconstrained linear coding (LLC) – a more accurate and efﬁcient coding approach [9] is adopted to replace VQ in this paper. The optimization objective function of LLC is M

min ∑ ‖xi Bci ‖22 þ λ‖di ci ‖22 C

i¼1

s:t: 1T ci ¼ 1; 8 i;

ð4Þ

where denotes the element-wise multiplication, and di A RK is the locality adaptor that gives different freedom for each basis vector proportional to its similarity to the input descriptor xi . Speciﬁcally, distðxi ; BÞ ; ð5Þ di ¼ exp σ where distðxi ; BÞ ¼ ½distðxi ; b1 Þ; …; distðxi ; bK ÞT , and distðxi ; bj Þ is the Euclidean distance between xi and bj . The parameter σ is used for adjusting the weight decay speed for the locality adaptor. The constraint 1T ci ¼ 1 follows the shift-invariant requirements of the LLC code. To solve (4), the parameters λ and σ should be determined, which is nontrivial task in practice. Noticing that LLC solution only has a few signiﬁcant values, the authors of [9] develop a faster approximation of LLC to speedup the encoding process. Instead of solving (4), they simply use the k (k o d o K) nearest neighbors of xi to form the local bases B~ i , and solve a much smaller linear system to get the coding vector. The result coding coefﬁcients should be combined into a global representation of the sample during a pooling process and after that one can use classiﬁers like SVM to classify the samples. The next section is going to introduce the pooling approach. 4. Pooling strategy Similar to the VQ coding approach, the LLC coding coefﬁcients ci are expected to be combined into a global representation of the sample for classiﬁcation. In early work of VQ and LLC, SPM framework [12] is frequently used for pooling coding coefﬁcients. For SPM, the image is ﬁrst subdivided at several different levels of resolution, then for each level of resolution, the coding coefﬁcients that fall in each spatial bin are summed and ﬁnally all the spatial histograms are weighted concatenated. In the action recognition application, since the sample is an image sequence, the time axis should also be divided into several bins. To this end, the feature space of a sample is separated into m regions by equally dividing the time interval of the sample into m þ1 parts. Each region contains two adjacent time parts. So each two adjacent regions have 50% overlap. Then features (coding coefﬁcients) in each region are combined by max pooling approach according to [9] to get a representation of the corresponding region. At last, all the pooling features are concatenated to a feature vector of the sample. Fig. 2 indicates the partition and pooling processes. The proposed algorithm is concluded in Algorithm 1. Algorithm 1. Proposed action recognition algorithm. N

i

i

Input: RGB-D sample set fff t g; fg it gjt ¼ 1; 2; …; T i gi ¼ 1 . ff t g and fg it g

are the corresponding RGB image sequence and depth image sequence. Output: labels of testing samples 1: Detect STIPs in ff i g and fg i g respectively and extract HOGt

t

HOF feature descriptors fX if g and fX ig g.

3

2: Use K-Means to cluster the fX if g and fX ig g respectively and get the codebook Bf of RGB sequence and the codebook Bg of the depth sequence. 3: Perform LLC coding and feature pooling respectively on fX ig g and fX if g to obtain the representation fyif g of RGB data and fyig g of depth data. 4: Concatenate each yif and yig into the representation yi of the sample. 5: Use linear SVM to train and classify the representations N

fyi gi ¼ 1 .

5. Experimental results In this section, we ﬁrst introduce the details about the utilized dataset and the concerned methods. Then, we show the extensive experimental results in the second part. 5.1. Data set We use the RGBD-HuDaAct [1] video database for performance evaluation. The database is composed of 30 people playing daily activities of 13 categories including 12 named categories and 1 background category. Each sample is recorded in an indoor environment with a Microsoft Kinect sensor at a ﬁxed position for a few seconds. The video sample consists of synchronized and calibrated RGB-D frame sequences, which contains in each frame a RGB image and a depth image, respectively. The RGB and depth images in each frame have been calibrated with a standard stereocalibration method available in OpenCV so that the points with the same coordinate in RGB and depth images are corresponded. For each video sample, the dimension of RGB and depth images is 640 480 pixel. The color image is formatted as standard 24-bit RGB image, while the depth image is represented by a single channel 16-bit integer matrix. Fig. 3 shows some snapshots of the video samples. Since the video database primarily faces the elder people's everyday living issues the categories of activities are deﬁned in a large extent, which means each category may contain different concrete actions or different combinations of meta-actions. For example, to make a call one can stand up or sit down, with left hand or right hand. Therefore, the recognition and classiﬁcation tasks can be quite challenging. The authors of this database proposed a baseline approach which trivially uses the depth information for classiﬁcation. The difference between the proposed method and the baseline algorithm lies in the following three aspects: 1. The baseline method extracts feature descriptors from the RGB image sequence only, while the proposed method extracts feature descriptors from both RGB and depth sequences. Therefore the proposed method exploits more information. 2. The baseline method utilizes the VQ to realize linear coding, while the proposed method develops the LLC method, which produce less reconstruction errors. 3. The baseline method pools the local coding vectors according to the depth information, while the proposed method pools the coding vectors according to the temporal information. According to the suggestions in [1], we randomly selected 709 samples of 18 subjects out of total 1189 samples of 30 subjects (60% sampling) as the experimental data. Each subject contains videos of almost all the pre-deﬁned categories of activities

Please cite this article as: H. Liu, et al., RGB-D action recognition using linear coding, Neurocomputing (2014), http://dx.doi.org/10.1016/ j.neucom.2013.12.061i

4

H. Liu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 2. Pooling process: the practical video is shown in the right-top panel. For this video sequence, the temporal axis is divided into two regions (see the left-top panel). Each region contains two adjacent parts (denoted as d1 and d2.). From the d1 part we extract coding coefﬁcients (see the blue histogram in the left-bottom panel); from the d2 part we extract coding coefﬁcients (see the red histogram in the left-bottom panel). Then the two histograms are concatenated into one new histogram (see the rightbottom panel). (For interpretation of the references to color in this ﬁgure caption, the reader is referred to the web version of this article.)

Fig. 3. Some examples of the RGB and depth image from the RGBD-HuDaAct dataset [1].

performed by certain user. We repeated the experiment for 18 times. For each round, we selected one of the subjects as the testing data and the other 17 subjects as the training data. There was no duplication in the 18 splits. We recorded the testing results

for each round and collected all the results to calculate the overall recognition rate. Since the resolution of the original videos is high, we did a down-sampling for the video samples in 3 axis to reduce the

Please cite this article as: H. Liu, et al., RGB-D action recognition using linear coding, Neurocomputing (2014), http://dx.doi.org/10.1016/ j.neucom.2013.12.061i

H. Liu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

dimensions of x, y and t. Then the input video was 320 240 pixel and 15 fps, which is one-eighth of the size of the original one. The features extracted from one preprocessed video sample were about 0–2000. The STIPs detector and descriptor were implemented by the authors of STIPs, and the binary code was directly downloaded from the website [13]. The default parameters of the detector and descriptor were adopted: Harris3D detector, with scale factor σ 2 ¼ 4; 8; 16; 32; 64; 128; 256 and τ2 ¼ 2; 4; the minimal block size of the descriptor was ðδx ; δy ; δt Þ ¼ ð18; 18; 10Þ pixel; and the dimensionality of HOG and HOF was 72 and 90 respectively. To save the computation time, we set the size of the codebook as K ¼ 128; 256 and 512. Standard K-Means clustering was adopted to generate the codebook from the feature descriptors. The cluster center vectors were selected as the codewords. For VQ in the baseline method, the nearest neighborhood algorithm was employed for coding. For LLC, the size of neighborhood is set to k ¼10.The pooling method for VQ was sum normalization according to [1] which was actually a normalized histogram of the distribution of the codeword in the sample. For LLC, the number of time splits was set to m ¼ 1; 2; 3 and 4 respectively. Finally, we use the linear support vector machine which was implemented by [14] for classiﬁcation. The regularization parameter C is empirically set to 1. 5.2. Results First, we performed the experiments on RGB videos, depth videos and both. Table 1 shows the recognition rate of single region pooling (i.e. m ¼ 1) and the bold fonts mark the best results. The results show that the depth only method is better than RGB only method. The reason is that the spatial-temporal features of depth sequence reﬂect more distinctions of activities. Moreover, the results based on combination of both RGB and depth data contribute best results (much better than that on single modal data). This veriﬁes that RGB and depth data describe the various aspects of daily activities and the classiﬁcation performance can be dramatically improved when two kinds of features are combined. The results in Table 1 also show that the recognition results would be better when the size of the codebook increases. This reason is that larger codebook leads to more accurate representations for the local feature descriptors. However, the time cost also increases when the codebook becomes larger. For practical application, a tradeoff is always desired. Table 2 shows the recognition results for each category using LLC coding scheme. The abbreviations for each category

5

are B (go to bed), D (put on the jacket), E (exit the room), G (get up), I (sit down), K (drink), L (enter the room), M (eat meal), N (take off the jacket), O (mop the ﬂoor), P (make a phone call), T (stand up), and BG (background). The size of codebook was set to K ¼128, and the time axis was split into m ¼ 1; 2; 3 and 4 parts. The bold numbers mark the best results. The results show that when m becomes larger, most of the results of subcategory improve. This veriﬁes that time stamp of the meta actions are indeed very important clues for recognition of human activities. In the baseline method proposed in [1], the depth information is used to pool the local descriptors within different regions. We re-implement this method according to the setting suggested in [1]. The results are shown in Table 3, where the parameter L is the number of depth divisions. The last row of this table shows the results using the spatial-pyramid matching pooling scheme similar to [12]. From those results we ﬁnd that even the best results are inferior to the results of the proposed method with no feature space division (m ¼1). This sufﬁciently indicates the advantage of LLC coding and depth features. Tables 4 and 5 show the confusion matrix of the proposed method with m ¼ 1 and m ¼4, respectively. It can be seen from the tables that when m ¼1, no feature space division exists, some categories are prone to be misclassiﬁed. For example, in Table 4, there exist lots of confusions between action D (put on the jacket) and action N (take off the jacket). Fig. 4 shows some representative frames of the two actions. The local meta-actions of the two categories are indeed quite similar and the only signiﬁcant difference lies in the time order. So it turns to the case of m ¼4, i.e., the time axis is divided into 4 parts, the results of the action D and action N improve signiﬁcantly.

6. Conclusion In this paper, we perform action recognition using an inexpensive RGB-D sensor. A depth spatial-temporal descriptor is developed to extract the interested local regions in depth image. Such descriptors are very robust to the illumination and background clutter. Further, the intensity spatial-temporal descriptor and the depth spatial-temporal descriptor are combined into the linear coding framework and an effective feature vectors can be constructed for action classiﬁcation. Finally, extensive experiments are conducted on a publicly available RGB-D action recognition dataset and the proposed method shows promising results.

Table 3 Recognition results of baseline method. Table 1 Recognition results for m¼ 1. The codebook size k

128

256

512

RGB only Depth only RGB and Depth

0.74 0.78 0.83

0.84 0.86 0.89

0.86 0.88 0.91

The size of codebook

128

256

512

L¼1 L¼2 L¼4 L¼8 SPM

0.69 0.72 0.74 0.77 0.78

0.77 0.77 0.78 0.79 0.81

0.79 0.81 0.79 0.79 0.81

Table 2 Recognition results of the proposed method with different values of m. Cate.

B

D

E

G

I

K

L

M

N

O

P

T

BG

Overall

m ¼1 m ¼2 m ¼3 m ¼4

1.00 1.00 1.00 1.00

0.74 0.80 0.80 0.87

0.96 0.96 0.96 0.94

0.93 0.98 1.00 1.00

0.78 0.91 0.91 0.93

0.88 0.92 0.92 0.96

1.00 1.00 1.00 1.00

0.81 0.85 0.87 0.91

0.61 0.81 0.87 0.93

0.83 0.83 0.83 0.85

0.74 0.81 0.79 0.83

0.90 0.87 0.96 0.94

0.69 0.71 0.71 0.76

0.83 0.88 0.89 0.92

Please cite this article as: H. Liu, et al., RGB-D action recognition using linear coding, Neurocomputing (2014), http://dx.doi.org/10.1016/ j.neucom.2013.12.061i

H. Liu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

Table 4 Confusion matrix for m¼1. Action

B

D

E

G

I

K

L

M

N

O

P

T

BG

B D E G I K L M N O P T BG

1.00 0.00 0.00 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.74 0.02 0.00 0.00 0.00 0.00 0.00 0.19 0.07 0.01 0.00 0.07

0.00 0.00 0.96 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.93 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.02 0.00 0.00 0.78 0.00 0.00 0.00 0.06 0.02 0.00 0.04 0.00

0.00 0.00 0.00 0.00 0.07 0.88 0.00 0.13 0.00 0.00 0.15 0.00 0.02

0.00 0.00 0.02 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.81 0.00 0.00 0.01 0.00 0.00

0.00 0.15 0.00 0.00 0.04 0.01 0.00 0.00 0.61 0.02 0.01 0.02 0.04

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.83 0.00 0.00 0.07

0.00 0.00 0.00 0.00 0.00 0.04 0.00 0.06 0.00 0.00 0.74 0.00 0.11

0.00 0.00 0.00 0.00 0.04 0.00 0.00 0.00 0.04 0.00 0.00 0.90 0.00

0.00 0.09 0.00 0.00 0.07 0.04 0.00 0.00 0.07 0.06 0.08 0.04 0.69

Table 5 Confusion matrix for m¼4. Action

B

D

E

G

I

K

L

M

N

O

P

T

BG

B D E G I K L M N O P T BG

1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.87 0.02 0.00 0.02 0.00 0.00 0.00 0.00 0.06 0.01 0.00 0.07

0.00 0.00 0.94 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.02 0.00 0.00 0.93 0.00 0.00 0.00 0.04 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.02 0.96 0.00 0.02 0.00 0.00 0.04 0.02 0.00

0.00 0.00 0.02 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.91 0.00 0.00 0.05 0.00 0.00

0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.93 0.02 0.04 0.00 0.00

0.00 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.85 0.00 0.00 0.02

0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.07 0.00 0.00 0.83 0.00 0.15

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.94 0.00

0.00 0.02 0.02 0.00 0.03 0.02 0.00 0.00 0.03 0.07 0.03 0.04 0.76

Fig. 4. Snapshots of actions acted by the same person. Top: action put on the jacket. Bottom: action take off the jacket.

Acknowledgement This work was supported by the National Key Project for Basic Research of China (Grant no. 2013CB329403), the National Natural Science Foundation of China (Grant nos. 61075027, 91120011 and 61210013), and Tsinghua Self-innovation Project (Grant no. 20111081111) and in part by the Tsinghua University Initiative Scientiﬁc Research Program (Grant no. 20131089295). References [1] B. Ni, G. Wang, M. Pierre, RGBD-HuDaAct: a color-depth video database for human daily activity recognition, in: IEEE International Conference on Computer Vision Workshop, IEEE, Barcelona, Spain 2011, pp. 1147–1153. [2] J. Sung, C. Ponce, B. Selman, A. Saxena, Unstructured human activity detection from RGBD images, in: IEEE International Conference on Robotics and Automation, IEEE, St. Paul, MN, USA 2012, pp. 842–849. [3] W. Li, Z. Zhang, Z. Liu, Action recognition based on a bag of 3d points, in: IEEE Conference on Computer Vision and Pattern Recognition Workshop, 2010.

[4] L. Morency, A. Quattoni, T. Darrell, Latent-dynamic discriminative models for continuous gesture recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Minneapolis, USA 2007, pp. 1–8. [5] I. Laptev, On space-time interest points, Int. J. Comput. Vis. 64 (2005) 107–123. [6] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, IEEE, San Diego, USA 2005, pp. 886–893. [7] N. Dalal, B. Triggs, C. Schmid, Human detection using oriented histograms of ﬂow and appearance, in: European Conference on Computer Vision, Springer, Graz, Austria 2006, pp. 428–441. [8] A. Jain, M. Murty, P. Flynn, Data clustering: a review, ACM Comput. Surv. 31 (1999) 264–323. [9] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, Locality-constrained linear coding for image classiﬁcation, in: IEEE Conference on Computer Vision and Pattern Recognition, IEEE, San Francisco, CA, USA 2010, pp. 3360–3367. [10] L. Fei-Fei, P. Perona, A Bayesian hierarchical model for learning natural scene categories, in: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, IEEE, San Diego, USA 2005, pp. 524–531. [11] H. Wang, M. Ullah, A. Klaser, I. Laptev, C. Schmid, Evaluation of local spatiotemporal features for action recognition, in: British Machine Vision Conference, 2009. [12] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, in: IEEE Conference on

Please cite this article as: H. Liu, et al., RGB-D action recognition using linear coding, Neurocomputing (2014), http://dx.doi.org/10.1016/ j.neucom.2013.12.061i

H. Liu et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Computer Vision and Pattern Recognition, vol. 2, IEEE, New York, NY, USA 2006, pp. 2169–2178. [13] Public STIPs Binaries, 〈http://www.di.ens.fr/ laptev/download.html〉, 2005. [14] C. Chang, C. Lin, Libsvm: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (2011) 27.

Huaping Liu received the Ph.D. degree from the Department of Computer Science and Technology, Tsinghua University, Beijing, China, in 2004. He is currently an Associate Professor in the Department of Computer Science and Technology at Tsinghua University. His research interests include intelligent control and robotics.

7 Fuchun Sun received the Ph.D. degree from the Department of Computer Science and Technology, Tsinghua University, Beijing, China, in 1998. Now he is a full professor in this department. He serves as associated editors of IEEE Transactions on Fuzzy Systems and Mechatronics, and members of the Editorial Board of the International Journal of Robotics and Autonomous Systems, International Journal of Control, Automation, and Systems, Science in China Series F: Information Science and Acta Automatica Sinica. His research interests include intelligent control, neural networks, fuzzy systems, and robot teleoperation.

Mingyi Yuan received the Bachelor degree from the Department of Physics at Peking University in 2007, and the Master degree from Department of Computer Science and Technology at Tsinghua University in 2013. He is now with Microsoft Asia-Paciﬁc R&D Group. His research interests include computer vision and machine learning.

Please cite this article as: H. Liu, et al., RGB-D action recognition using linear coding, Neurocomputing (2014), http://dx.doi.org/10.1016/ j.neucom.2013.12.061i

RGB-D action recognition using linear coding

RGB-D action recognition using linear coding

Recommend Documents