Learning action patterns in difference images for efficient action recognition

Learning action patterns in difference images for efficient action recognition

Neurocomputing 123 (2014) 328–336 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Learnin...

3MB Sizes 0 Downloads 222 Views

Neurocomputing 123 (2014) 328–336

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Learning action patterns in difference images for efficient action recognition Guoliang Lu a,b,n, Mineichi Kudo a a b

Graduate School of Information Science and Technology, Hokkaido University, Sapporo 060-0814, Japan School of Mechanical Engineering, Shandong University, JiNan 250061, China

art ic l e i nf o

a b s t r a c t

Article history: Received 3 January 2013 Received in revised form 19 June 2013 Accepted 25 June 2013 Communicated by Liang Wang Available online 22 August 2013

A new framework is presented for single-person oriented action recognition. This framework does not require detection/location of bounding boxes of human body nor motion estimation in each frame. The novel descriptor/pattern for action representation is learned with local temporal self-similarities (LTSSs) derived directly from difference images. The bag-of-words framework is then employed for action classification taking advantages of these descriptors. We investigated the effectiveness of the framework on two public human action datasets: the Weizmann dataset and the KTH dataset. In the Weizmann dataset, the proposed framework achieves a performance of 95.6% in the recognition rate and that of 91.1% in the KTH dataset, both of which are competitive with those of state-of-the-art approaches, but it has a high potential to achieve a faster execution performance. & 2013 Elsevier B.V. All rights reserved.

Keywords: Action patterns Efficient action recognition Temporal self-similarities Bag-of-words

1. Introduction Vision-based human activity/action analysis has been received more and more attention in recent years. There are many datasets that can be used as benchmarks. Some of them include a large variety of scenes where more than one performer, sometimes a crowd of people, are doing actions simultaneously, as seen in CMU action detection dataset, Hollywood realistic action dataset and INRIA pedestrian detection dataset. Other datasets include one performer in a scene as seen in KTH dataset, Weizmann dataset and Keck gesture dataset. In this study, we concentrate on the latter type of datasets, driven by natural demands of it in single-person oriented applications such as human–computer interaction (HCI), home theelderly assistance system, motion retrieval and so on. In this direction, significant research efforts have been made to extract effective action representation for recognition. The detailed surveys can be seen in the works [1–3]. In this study, they are divided into two groups by whether requiring bounding boxes of human body or not. In the former group, bounding boxes are segmented first in each frame by either background subtraction or human detection/tracking, and then, the features are extracted from the normalized bounding boxes based on, e.g., motion flow [8,10,11,28], geometrical modeling of body parts [4–7,39], color/

n Corresponding author at: Graduate School of Information Science and Technology, Hokkaido University, Sapporo 060-0814, Japan. Tel.: þ 81 11 706 6854. E-mail addresses: [email protected], [email protected] (G. Lu), [email protected] (M. Kudo).

0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2013.06.042

binary appearance [11–13,31–33,37]. This kind of methods has shown good performances in recognition accuracy in some public datasets but also have two common disadvantages:

 It needs a prior procedure of modeling background/human



body for background subtraction or human detection. In addition, detection of bounding boxes in real applications is often computationally expensive. Its performance heavily relies on the quality of the found bounding boxes. Unfortunately, it is not so easy to obtain high-quality results due to noise and variations between training and observation images.

The latter group of methods do not need pre-segmentation nor tracking individuals in a video. Instead they rely on space–time interested points (STIP) [14,15,35], spatio-temporal features (e.g., obtained by combining 3D gradient descriptor and optical flow descriptor [20] or by using 3D DT-CWT [21]), volumetric analysis of video frames [16,18,19,34,36,37], difference images [9,22,23] and so on. For the case of space–time points, the sparse interested points, e.g. detected by [14], are sometimes not sufficient to characterize the human actions [20,35]; the larger number of interested points, e.g. detected by [35], requires a larger computational cost due to the higher dimensionality of action video even for the sparse detection [24]. The spatio-temporal features relax the requirement of space–time interested points but need a relatively complex computation method. For volumetric analysis, a large number of space–time pitches may also result in higher computational load. On the other hand, the difference image has been proved the potential capacity/power of

G. Lu, M. Kudo / Neurocomputing 123 (2014) 328–336

action discrimination [22,23], which can be obtained very fast by frame substraction.

LTSSs for action recognition. Our main contributions are described in two fold:

 The newly proposed LTSS, as well as GTSS, does not require a

1.1. The previous work Temporal self-similarity (TSS) is one of the attractive cues for vision based action recognition by virtue of its following properties:

 TSS has a better capacity of absorbing intra-class variations, e.g., 

329

the different visual appearances between individuals, performing styles of actions [31,32,41,42]. By the use of TSS features, less training samples are required for action modeling, e.g., in [19], just only one video sequence is used for modeling one action.

Typical TSS seeks the action patterns with original frame description [10,41], human silhouettes [10], histogram of gradient [31,32], optical flow [31,32] or body trajectory [31,32], with a requirement of subject detection in each frame. A new TSS, called GTSS, has been proposed in our previous work [25], which seeks motion patterns between all pairs of difference images in a given video sequence without requiring finding bounding boxes and has been demonstrated advantages to some typical TSSs. It, however, still needs some improvements for practical usage, considering the two following aspects: (1) the GTSS has succeeded in bypassing time-consuming subject detection for extracting action patterns, but it still requires a relatively large computation cost, that is OðN 2 Þ for a video sequence with N frames. This is not suitable for real-time applications especially for a long-term video. (2) In [25], we have justified the use of GTSS in dynamic sequence matching based action recognition. The experimental results showed its priority, in recognition rate, to the conventional TSSs. The recognition rate is, however, very low, that is only 77.8% in Weizmann dataset, which is far away from real-world applications.



time-consuming subject detection, and meanwhile inherits the general properties of conventional TSSs, e.g., the robustness against intra-class variations. On the other hand, it needs less computation cost than the GTSS (see Section 3.4 for details), which satisfies the requirement of practical usage. We employ the bag-of-words framework [15,24] for action classification by representing the action as a collection of numerous extracted LTSSs, i.e., codebook, trained from all training video sequences. It is experimentally shown that the proposed LTSS and bag-of-words expression is comparable, in recognition rate, with state-of-the-art work, but has promisingly higher efficiency in execution (see Section 5.3 for comparison).

The rest of the paper is organized as follows. Overview of the proposed framework is given in Section 2. Section 3 describes the proposed LTSS. In Section 4, the bag-of-words framework for action recognition is introduced, followed by experimental results and analysis in Section 5. In Section 6, we discuss our proposed approach. Finally, we conclude the paper and present one possible improvement direction in Section 7.

2. Overview of the proposed scheme First, we show the overview of the proposed framework in Fig. 1. In the training phase, the LTSSs are computed from every training video sequence and described by block-based descriptors, and then, the bag-of-words framework is employed to model actions. In the testing phase, LTSSs of one testing sequence are also computed in the same procedure as in the training, and then, action is classified with the trained human action models in the employed bag-of-words framework. The detailed procedures are then described in the following.

1.2. Contributions of this paper To cope with these two aspects, in this paper, by extending the previous work, we propose a new pattern extraction method using local temporal self-similarities (LTSSs) derived from difference images, and introduce the bag-of-word framework to assemble these obtained

3. Proposed local self-similarities in difference images For computing LTSSs, we put the following assumptions: (1) the sampling rate (25 fps in experiments) is sufficiently high for

Fig. 1. Overview of the proposed framework.

330

G. Lu, M. Kudo / Neurocomputing 123 (2014) 328–336

generating difference images and (2) the camera-view is stationary or could be estimated.

histograms to the maximum value of one as H Xt ðxÞ ¼

∑ny ¼ 1 ΔIðx; y; tÞ

;

x ¼ 1; 2; …; m

ð2Þ

∑m x ¼ 1 ΔIðx; y; tÞ ; maxy ð∑m x ¼ 1 ΔIðx; y; tÞÞ

y ¼ 1; 2; …; n

ð3Þ

maxx ð∑ny ¼ 1 ΔIðx; y; tÞÞ

3.1. Difference image generation and representation

H Yt ðyÞ ¼

As the same as [25], the proposed LTSS is also computed directly from difference images which expresses the change of the current frame from its preceding frame and can be obtained very fast usually without requiring background subtraction nor object detection/ tracking. The outline of generating process is given as follows. First, the gray-leveled difference image ΔIðx; y; tÞ between two consecutive frames I(t) and Iðt1Þ is computed as

where HtX is the histogram of X-axis projection in the tth frame and HtY is that of Y-axis. The temporal volumes, respectively, formed by H Xt ðxÞ and H Yt ðyÞ show discriminative shapes of actions. For example, as shown in Fig. 3, for the forward-movement such as run, walk, skip, we can clearly see the movement direction in VX and periodicities in VY; but in some periodic movements in one position such as jack, wave, we can see periodicities in both of H Xt ðxÞ and H Yt ðyÞ.

ΔIðx; y; tÞ ¼ jIðx; y; tÞIðx; y; t1Þj

ð1Þ

where Iðx; y; tÞ A ½0; 255 is the pixel value at (x,y) of tth frame in a given frame sequence. As a pre-processing, we remove small differences by a threshold value λ : ΔIðx; y; tÞ ¼ 0 if jIðx; y; tÞIðx; y; t1Þj r λ, where λ was set to 5 in experiment. In addition, we remove isolated spatial noise prior to (1) by applying a smoothing filter: IðtÞ←IðtÞnG, where G is a filter (a mean filter of 3  3 pixels, in experiments) and ‘n’ is the convolution operation. Next, for each difference image ΔIðx; y; tÞ of size n  m, we project the difference pixel values onto X- and Y-axis (see Fig. 2), respectively, for image representation. Then, normalize these

3.2. Local temporal self-similarity computation For computing the LTSS DðtÞ at time t, we collect some neighboring frames to form a subsequence QðtÞ with a given number Δt (set to 7 empirically in experiments): QðtÞ ¼ fΔIðx; y; tΔtÞ; …; ΔIðx; y; tÞ; …; ΔIðx; y; t þ ΔtÞg which are represented by fH ZtΔt ; …; H Zt ; …; H Ztþ Δt g, where Z denotes X or Y. Then, DðtÞZ is constructed as the anti-similarity (distance) between every pair of ith and jth frames in QðtÞ: 0 Z 1 Z d1;1 ⋯ d1;T B C Z Z Z Z ⋱ ⋮ C DZ ¼ ½di;j  ¼ B ð4Þ @ ⋮ A; di;j ¼ J H i H j J Z Z dT;1 ⋯ dT;T where J  J is the Euclidean distance between histograms of two frames in QðtÞ; i; jA ½tΔt; …; t þ Δt, and the size T of DZ is

Fig. 2. An example of difference image and its X- and Y-axis projections.

Fig. 4. Calculation of DðtÞ from Dðt1Þ: assuming that there are two consecutive LTSSs, Dðt1Þ and DðtÞ, respectively, labeled by a square consisting of small blocks each of which means the anti-similarity between one pair in the corresponding sub-sequence, it is obvious that only last row and column are sufficient to be calculated to obtain DðtÞ from Dðt1Þ.

Fig. 3. The temporal volumes VX and VY, respectively, formed by H Xt ðxÞ and H Yt ðyÞ, of nine actions from Weizmann dataset.

G. Lu, M. Kudo / Neurocomputing 123 (2014) 328–336

331

Fig. 5. Procedure of obtaining the block-based descriptor for an LTSS: given an LTSS at time i (a), the gradient (b) is first computed. Then, an upper-right triangle mask consisting of 15 blocks (c) [10] is confirmed over the local TSS. Within every block q A f1; 2; …; 15g, we compute a 8-bin histogram bq of gradient directions [31,32]. The blockt t t based descriptor pi of the frame i is generated by concatenating all 15 b's as a vector of length 8  15 ¼120 dimensions: pi ¼ ½b1 ; b2 ; …; b15 t , where t stands for transpose. The block size was set as 3  3, in experiments. (a) Local TSS, (b) Gradient and (c) Block-bassed descriptor.

ð2Δt þ1Þ. Apparently, DZ is symmetrical and its diagonal elements are always zeros. Practical issue: One notes that there is a large amount of overlapping elements between two consecutive LTSSs, i.e., Dðt1Þ and DðtÞ. Therefore, no-recomputation is needed for each di;j , for computing DðtÞ, as illustrated in Fig. 4. This is fit for on-line computation in practical usage. 3.3. Block-based descriptor extraction For capturing the structures embedded in the obtained LTSS, we employ a block-based descriptor, as illustrated in Fig. 5, in order to take into account the local noise/fluctuations in the temporal extent of a video. Let us use p denote this kind of descriptor, and thus one video sequence F is represented by a sequence of these descriptors, i.e., PðFÞ ¼ ½p1 ; …; pi ; …; pN  where pi corresponds to the ith difference image and N is the frame number. It is noted that, since we compute local TSS in X- and Y-axis, respectively, there are two independent descriptors, piX and piY computed from DðiÞX and DðiÞY , for the ith difference image. The resulted descriptor pi in P(F) is obtained by concatenating piX and piY, i.e., pi ¼ ½pXi ; pYi t . 3.4. Comparison with GTSS We compare the proposed LTSS with the previously proposed GTSS [25] in two terms of computation complexity and practical usage. Computation complexity: Localizing of TSS, i.e., LTSS, is beneficial in two points. First, the time complexity is reduced from OðN 2 Þ to O (N), because a constant-number-frame analysis is employed. Second, the flexibility is increased, that is, the global TSS is suitable to model one periodic action during observation, but LTSS can adopt to several different actions periodic/non-periodic in a single video sequence. Practical usage: However, LTSS seeks action patterns locally in the temporal extent of video frames. As a result, this limits its usage in some ranges of action-to-action comparison [42], time– frequency analysis of one periodic action [10,41] and so on, where GTSS can be used.

of employing such a framework includes (see Fig. 1): (1) codebook (i. e., a vocabulary of words) formation and video description; (2) action modeling and classification. 4.1. Codebook formation and video description Let us assume that every training video sequence has been represented as a sequence of the employed block-based descriptors P, as described already in Section 3.3. By collecting these descriptor sequences, we have R ¼ ⋃M j ¼ 1 P j , where Pj corresponds to the jth training sequence and M is the sequence number. As illustrated in Fig. 6, the codebook (i.e., vocabulary of words) is then constructed by clustering R using K-means algorithm. The codewords are then defined by the centers of resulted clusters. Each frame in the training video sequences can be assigned as one codeword, i.e., one cluster center, by minimizing the Euclidean distance over all codewords of codebook.1 Last, every training video sequence is described as a histogram of assigned codewords. The effect of codebook size K on action recognition was investigated in experiments (see Tables 1 and 2). 4.2. Action modeling and classification Let us assume we have Mc histograms of codewords for all training video sequences of action c A f1; 2; …; Cg. For a newly observed video sequence F′ described by a histogram h′ of codewords, we apply either k-nearest neighbors classifier (k-NNC) or a support vector machine (SVM) to classify the performed action. c c k-NNC: We compare h′ with k nearest histograms, i.e. fh1 ; …; hk g, of every action c by a distance metric dist, typically the Euclidean distance. Then the most similar class is chosen as c fhi g; i A f1; 2; …; M c g

F′-cn

k

c

where cn ¼ arg min ∑ distðh′; hj Þ; c

ð5Þ

j¼1

SVM: We train linear SVMs in a one-against-all framework to handle multi-class classification.

5. Experiments 4. Bag-of-words framework for action representation and classification The benefit of LTSS in computation cost makes it more attractive compared with GTSS, especially in some time-saving or even realtime applications. However, as described already, due to the limitation of time scope in computation, the LTSS is low-leveled description of human action, i.e., it does not have powerful capacity of discriminating different actions. In this section, we employ the standard bag-of-word framework to assemble these low-leveled local descriptors, motivated by its recent successes [15,24]. The typical procedures

We investigated the performance of the proposed framework on two publicly available human action datasets: the Weizmann dataset [29] and the KTH dataset [30]. Weizmann dataset: The Weizmann dataset (see Fig. 7(a)) contains 10 categories of human actions: bend, jack, jump, pjump, run, side, skip, walk, wave1 and wave2, performed by each of nine 1 Here, it is noted that in the recent work [17], kernel density estimation is proposed in frame-assignment to absorb the information loss caused by quantization errors. But in this paper, we used the typical processing strategy.

332

G. Lu, M. Kudo / Neurocomputing 123 (2014) 328–336

Fig. 6. Procedures of codebook formation and video description in a bag-of-words framework.

Table 1 Recognition rate (%) by the proposed framework in Weizmann dataset. K

70

75

80

85

90

95

100

105

110

105

120

1-NNC 3-NNC 5-NNC SVM

91.11 90.00 90.00 91.11

88.89 93.33 94.44 93.33

91.11 94.44 94.44 92.22

86.67 90.00 88.89 92.22

86.67 87.78 87.78 92.22

83.33 90.00 91.11 92.22

88.89 92.22 95.56 92.22

87.78 91.11 90.00 95.56

85.56 92.22 92.22 93.33

86.67 93.33 93.33 91.11

83.33 88.89 88.89 94.44

Table 2 Recognition rate (%) by the proposed framework in KTH dataset. K

70

75

80

85

90

95

100

105

110

105

120

(a) KTH_s1 1-NNC 3-NNC 5-NNC SVM

88.67 88.67 89.33 95.33

88.00 88.67 86.67 90.00

86.00 87.33 88.67 92.67

83.33 88.67 87.33 89.33

88.67 91.33 92.67 92.00

87.33 90.00 92.67 95.33

88.67 91.33 90.67 92.00

89.33 90.67 90.00 94.67

89.33 89.33 92.00 91.33

86.67 89.33 90.00 93.33

90.00 91.33 91.33 92.00

(b) KTH_s3 1-NNC 3-NNC 5-NNC SVM

80.00 82.67 82.00 82.67

78.67 82.67 81.33 86.00

79.33 83.33 80.67 83.33

76.00 82.00 82.67 86.67

77.33 84.00 83.33 84.67

80.67 82.67 82.00 87.33

80.00 83.33 81.33 85.33

86.67 86.67 86.00 86.00

82.67 84.67 84.00 86.67

83.33 86.67 86.00 84.67

84.67 85.33 82.67 85.33

(c) KTH_s4 1-NNC 3-NNC 5-NNC SVM

75.33 78.67 79.33 80.67

90.00 90.67 88.67 84.00

86.67 86.00 86.00 85.33

80.67 82.00 84.00 85.33

76.00 81.33 82.00 87.33

83.33 86.67 88.00 88.67

79.33 84.00 83.33 82.00

82.00 88.00 86.67 86.00

83.33 86.00 86.67 90.67

82.00 82.67 82.67 85.33

82.00 83.33 86.00 84.67

subjects. The resolution of each image is 180  144 pixels. The two actions of wave1 and wave2 have very similar flow and are easily confused to each other by the flowed based approaches [40] including the proposed one. We hence regard them as one action: wave, as made in [31,32]. KTH dataset: The KTH dataset (see Fig. 7(b)) contains six categories of human actions: boxing, handclapping, handwaving, jogging, running and walking, performed by each of 25 subjects under four scenarios: outdoors (s1), outdoors with scale variation (s2), outdoors with different clothes (s3) and indoors with shadows (s4). The resolution of each image is 160  120 pixels. In the experiments, we excluded the scenario s2 since difference images computation is based on the assumption of stationary camera view. 5.1. Implementation and settings In the experiments, we used the same parameter settings as described already in the above. In addition, we tested the codebook size K from 70 to 120 with a step of 5. We tested 1-NNC, 3-NNC and 5-

NNC for k-NNC based action classification. We used OSU-SVM toolbox2 in MATLAB for implementing the SVM based action classification, where the parameters were set to the default values. The classification performance was estimated by leave-one-out cross-validation.

5.2. Performance The recognition rates3 of the proposed framework are shown in Tables 1 and 2. In the Weizmann dataset (see Table 1), the recognition rate by the proposed framework achieved 95.56% at K¼100 using 5-NNC, which is the same as the one using SVM at K¼105. In the KTH dataset (see Table 2), the proposed framework shows a little difference of recognition rate between k-NNC based classification and SVM. That is, 92.67% (at K¼90, 95 using 5-NNC) versus 95.33% (at 2

http://www.ece.osu.edu/  maj/osu_svm This recognition rate is defined as the ratio of the number of correctly recognized videos over the total number in dataset. 3

G. Lu, M. Kudo / Neurocomputing 123 (2014) 328–336

333

Fig. 7. Example images of the datasets. (a) Wiezmann dataset and (b) KTH dataset.

clothes, dust coats), which affects to some extent the difference images used in the LTSS. 5.3. Comparison

Fig. 8. Confusion matrix of action recognition on Weizmann by SVM.

K¼95 using SVM) in s1; 86.67% (at K¼ 105 using 1-NNC and at K¼105, 115 using 3-NNC) versus 87.33% (at K¼ 95 using SVM) in s3; although it is the same 90.67% (at K¼75 using 3-NNC and at K¼110 using SVM) in s4. In total, SVM classifier outperformed k-NNC. The recognition performance is largely affected by the codebook size K, both in k-NNC classification and in SVM, which is consistent to the report in [33]. This is probably because the number of training samples is not sufficient to describe actions, at least not sufficient to generate stable code words to describe actions. The confusion matrices of action recognition using SVM classification are shown in Figs. 8 and 9. In the Weizmann dataset (see Fig. 8), one notes that the action skip has been reported a relatively low action discrimination in [8,11,15,26,31,32,40], because its style strongly depends on performers. Nevertheless, the proposed framework showed a better performance than these works and achieved the best as [27], as shown in Table 3. In the KTH dataset (see Fig. 9), walking is the easiest action to be recognized, that is 100% in all scenarios. Action discrimination between jogging and running is, however, relatively unsatisfactory in s3 in the proposed method, that is only 72% and 76%. This is not surprising because in this scenario, the performers were requested to wear different clothes (e.g., sports

Many approaches [26–28,33,38,39] use a strategy of combining multiple features for action recognition. This strategy gives rise to more satisfactory recognition performances, however, it results in larger computation load, unavoidably. Since we here only use one single kind of feature (i.e., LTSS), for time-saving recognition, we compare the performance with those of state-of-the-art approaches also using single feature, for a fair and objective comparison. Tables 4 and 5 show the results. For these compared approaches, the recognition rates are transferred from the reference papers. It is seen that, in recognition performance, the proposed framework performs the same as the best work [27] in Weizmann dataset; in KTH dataset, it achieves the best in the scenario of s1 and in total comparable to the best work [34]. But, it is noted that the work [34] obtained the performance with computationally expensive biologically inspired features, which may not be suitable for real-time applications. On the contrast, our framework has a satisfactory performance in processing speed, which will be investigated in the following. It is hard to compare the real computation cost due to hardness of faithful implementation and fair execution of those approaches under the same condition, but the proposed framework has a high potential to achieve a faster execution performance. This is because the employed LTSS for frame representation does not need a relatively time-consuming pre-processing to locate human body in the observing scene, instead it utilizes difference images which can be generated very fast. In Table 6, in our computational environment (CPU3.10 GHz, RAM 4.00GB), we measured the computation time of three phases in the proposed framework and compared them to the computation time of human body location. We chose one video sequence in Weizmann dataset: lena_walk with a length of 72 frames. In the proposed framework, we employed SVM classification using the bag-of-words framework with codesize of K¼ 105. For human body location in comparison, we initialized the actor scope by humansetting in the first-arrived frame, and then performed the typical kernel based tracking [43], to track the actor in time–space extent

334

G. Lu, M. Kudo / Neurocomputing 123 (2014) 328–336

Table 4 Comparison with state-of-the-art approaches on Weizmann. The bold value is for emerging the best recognition performance. Method

Year

Description

Ave. (%)

Junejo [31,32]

2008, 2011

Wang [27]

2012

Jiang [33]

2012

Image-based TSSs þNNC Trajectory-based TSSs þNNC Single interest featureþ SDSM Compound featureþ SDSM Motion segment þ SDSM Motionþprototype tree Shape þ prototype tree

93.5 94.9 94.5 95.6 94.5 88.9 81.1

LTSSþ SVM

95.6

Ours

Table 5 Comparison with state-of-the-art approaches on KTH. The bold value is for emerging the best recognition performance. Method

Year

Description

s1

s3

s4

Ave. (%)

Dollar [35] Jhuang [34]

2005 2007

Yang [36] Jiang [33]

2009 2012

Cuboid prototypes Gradient based GrC3 þ SVM Flow based OfC3 þ SVM Motion edge history Motionþprototype tree Shape þ prototype tree

88.2 91.3 92.3 83.7 92.8 71.9

78.5 90.3 91.7 82.6 89.4 53.0

90.2 93.2 92.0 92.4 83.6 57.4

85.6 91.6 92.2 86.2 88.6 60.8

LTSSþ SVM

95.3

87.3

90.4

91.1

Ours

Table 6 Comparison of computation time (s) of the proposed framework in three phases and the method based on human body location (DI generation: difference images generation; LTSS: computation and description of the proposed LTSS; Clas: classification; HB location: human body location; FE: feature extraction). Method

Phase 1: time

Phase 2: time

Phase 3: time

Total

Proposed Compared

DI generation: 0.015 HB location [43]: 4.7

LTSS: 0.109 FE: N/A

Clas: 0.16 Clas: N/A

0.28 4.7 þ

frame-by-frame. It is noted that a large portion of computation time is consumed in LTSS computation and classification phases in the proposed framework and the total time is 0.28 s on the whole. While, the process of human body location needs 4.7 s in the sole usage. Therefore, this result implies that we have succeeded in passing the process with the largest computation cost.

6. Discussion Fig. 9. Confusion matrix of action recognition on KTH by SVM: (a) confusion matrix on KTH s1; (b) confusion matrix on KTH s3; and (c) confusion matrix on s4.

Table 3 Comparison of skip recognition in the Weizmann dataset.

 adopting a sliding-window approach over the video, by which

Method

Year

Description

Rec. (%)

Dhillon [11]

2009

69

Guo [8] Junejo [31,32] Benmokhtar [26] Wang [27]

2010 2008, 2011 2012 2012

Appearance and motion based featureþ SVM Optical flowþ sparse representation Image/trajectory-based TSSs þNNC Feature fusion þ SVM Compound featureþ SDSM

74 70 56 89

LTSS þSVM

89

Ours

Although we have assumed a single performer in this study, the proposed framework can be utilized even for the case of multiple performers. We may implement this by employing one of the following two techniques:



we first assume only one performer in each searching 3D subvolume and then compute the SSM by the proposed method, and finally the action in this subvolume is recognized; reserving only the target performer by excluding other uninterested performers in the observing scene, by which the action of this performer is recognized as stated already. For this case, the target performer has to be located prior to action recognition, but we only require an approximate position of this performer so as to exclude other performer(s), which can be

G. Lu, M. Kudo / Neurocomputing 123 (2014) 328–336

obtained by a naive human detection algorithm. This relaxes the requirement of detection algorithm.

7. Conclusion In this study, a bag-of-words based framework is presented for single-person oriented action recognition. This framework has been investigated the promise, comparing with some state-of-the-art approaches, in two public datasets. It has a high potential to achieve a faster execution performance since it utilizes difference images instead of location of human body. For further improving the recognition performance, to combine other calculated-fast descriptors is one direction for our future work.

Acknowledgments We would like to thank Professor Shunichi Kaneko and Professor Hideyuki Imai, from the Graduate School of Information Science and Technology of Hokkaido University, for their comments and discussions when we were preparing this work. References [1] R. Poppe, Vision-based human motion analysis: an overview, Computer Vision and Image Understanding 108 (2007) 4–18. [2] P. Turaga, R. Chellappa, V.S. Subrahmanian, O. Udrea, Machine recognition of human activities: a survey, IEEE Transactions on Circuits and System for Video Technology 18 (11) (2008) 1473–1488. [3] R. Poppe, A survey on vision-based human action recognition, Image and Vision Computing 28 (6) (2010) 976–990. [4] G. Mori, J. Malik, Estimating human body configurations using shape context matching, in: Proceedings of the International Conference on ECCV, 2002, pp. 150–180. [5] F. Lv, R. Nevatia, Single view human action recognition using key pose matching and Viterbi path searching, in: Proceedings of the International Conference on CVPR, 2007, pp. 1–8. [6] Y. Shen, H. Foroosh, View-invariant action recognition using fundamental ratios, in: Proceedings of the International Conference on CVPR, 2008, pp. 1–6. [7] P. Natarajan, V.K. Singh, R. Nevatia, Learning 3d action models from a few 2d videos for view invariant action recognition, in: Proceedings of the International Conference on CVPR, 2010, pp. 2006–2013. [8] K. Guo, P. Ishwar, J. Konrad, Action recognition using sparse representation on covariance manifolds of optical flow, in: Proceedings of the International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2010, pp. 188–195. [9] M. Yang, F. Lv, W. Xu, K. Yu, Y. Gong, Human action detection by boosting efficient motion features, in: Proceedings of the International Conference on ICCV Workshop, 2009, pp. 522–529. [10] C. BenAbdelkader, R. Cutler, L.S. Davis, Motion-based recognition of people in eigengait space, in: Proceedings of the International Conference on Automatic Face and Gesture Recognition, 2002, pp. 267–274. [11] P.S. Dhillon, S. Nowozin, C.H. Lampert, Combining appearance and motion for human action classification in videos, in: Proceedings of the International Conference on CVPR Workshop, 2009, pp. 22–29. [12] L. Shao, D. Wu, X. Chen, Action recognition using correlogram of body poses and spectral regression, in: Proceedings of the International Conference on Image Processing (ICIP), 2011, pp. 209–212. [13] L. Shao, X. Chen, Histogram of body poses and spectral regression discriminant analysis for human action categorization, in: Proceedings of the International Conference on BMVC, 2010, pp. 1–11. [14] I. Laptev, On space–time interest points, International Journal of Computer Vision 64 (2) (2005) 107–123. [15] J.C. Niebles, H. Wang, L. Fei-Fei, Unsupervised learning of human action categories using spatial-temporal words, International Journal of Computer Vision 79 (3) (2008) 299–318.

335

[16] M. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri, Actions as space–time shapes, in: Proceedings of the International Conference on ICCV, 2005, pp. 1395–1402. [17] D. Wu, L. Shao, Silhouette analysis based action recognition via exploiting human poses, IEEE Transactions on Circuits and Systems for Video Technology 4 (2012) 1–7. [18] E. Shechtman, M. Irani, Matching local self-similarities across images and videos, in: Proceedings of the International Conference on CVPR, 2007, pp. 1–8. [19] H.J. Seo, P. Milanfar, Detection of human actions from a single example, in: Proceedings of the International Conference on ICCV, 2009, pp. 1965–1970. [20] L. Ballan, M. Bertini, A. Del Bimbo, L. Seidenari, G. Serra, Recognizing human actions by fusing spatio-temporal appearance and motion descriptors, in: Proceedings of the International Conference on ICIP, 2009, pp. 3569–3572. [21] R. Minhas, A. Baradarani, S. Seifzadeh, Q.M. Jonathan Wu, Human action recognition using extreme learning machine based on visual vocabularies, Neurocomputing 73 (2010) 1906–1917. [22] M.B. Holte, T.B. Moeslund, P. Fihl, View invariant gesture recognition using the CSEM SwissRanger SR-2 camera, International Journal of Intelligent Systems Technologies and Applications 5 (3) (2008) 295–303. [23] M. Zobl, F. Wallhoff, G. Rigoll, Action recognition in meeting scenarios using global motion features, in: Proceedings of the International Conference on PETS-ICVS, 2003, pp. 32–36. [24] T. Kobayashi, N. Otsu, Motion recognition using local auto-correlation of space–time gradients, Pattern Recognition Letters 33 (9) (2012) 1188–1195. [25] G. Lu, M. Kudo, Self-similarities in difference images: a new cue for singleperson oriented action recognition, IEICE Transactions on Information and System 96 (5) (2013) 1238–1242. [26] R. Benmokhtar, Robust human action recognition scheme based on high-level feature fusion, Multimedia Tools and Applications (2012) 1–23. [27] H. Wang, C. Yuan, W. Hu, C. Sun, Supervised class-specific dictionary learning for sparse modeling in action recognition, Pattern Recognition 45 (11) (2012) 3902–3911. [28] A. Fathi, G. Mori, Action recognition by learning mid-level motion features, in: Proceedings of the International Conference on CVPR, 2008, pp. 1–8. [29] M. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri, Actions as space–time shapes, in: Proceedings of the International Conference on ICCV, 2005, pp. 1395–1402. [30] C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: a local SVM approach, in: Proceedings of the International Conference on ICPR, 2004, pp. 32–36. [31] I.N. Junejo, E. Dexter, I. Laptev, P. Pérez, Cross-view action recognition from temporal self-similarities, in: Proceedings of the International Conference on ECCV, 2008, pp. 293–306. [32] I.N. Junejo, E. Dexter, I. Laptev, P. Pérez, View-independent action recognition from temporal self-similarities, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (1) (2011) 172–185. [33] Z. Jiang, Z. Lin, L.S. Davis, Recognizing Human Action by Learning and Matching Shape-Motion Prototype Trees, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (3) (2012) 533–547. [34] H. Jhuang, T. Serre, L. Wolf, T. Poggio, A biologically inspired system for action recognition, in: Proceedings of the International Conference on ICCV, 2007, pp. 1–8. [35] P. Dollar, V. Rabaud, G. Cottrell, S. Blongie, Behavior recognition via sparse spatio-temporal features, in: Proceedings of the International Conference on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005, pp. 65–72. [36] M. Yang, F. Lv, W. Xu, K. Yu, Human action detection by boosting efficient motion features, in: Proceedings of the International Conference on ICCV Workshop, 2009, pp. 522–529. [37] L. Shao, L. Ji, Y. Liu, J. Zhang, Human action segmentation and recognition via motion and shape analysis, Pattern Recognition Letters 33 (4) (2012) 438–445. [38] M. Marín-Jiménez, N. Pérez de la Blanca, M. Mendoza, Learning features for human action recognition using multilayer architectures, in: Proceedings of the International Conference on Pattern Recognition and Image Analysis, 2011, pp. 338–346. [39] M. Ahmad, S.W. Lee, Human action recognition using shape and CLG-motion flow from multi-view image sequences, Pattern Recognition 4 (7) (2008) 2237–2252. [40] P. Scovanner, S. Ali, M. Shah, A 3-dimensional sift descriptor and its application to action recognition, in: Proceedings of the International Conference on Multimedia, 2007, pp. 357–360. [41] R. Cutler, L.S. Davis, Robust real-time periodic motion detection, analysis, and applications, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (8) (2000) 781–796. [42] A.A. Efros, A.C. Berg, G. Mori, J. Malik, Recognizing action at a distance, in: Proceedings of the International Conference on ICCV, 2003, pp. 726–733. [43] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (5) (2003) 564–577.

336

G. Lu, M. Kudo / Neurocomputing 123 (2014) 328–336 Guoliang Lu received the B.E. and M.E. degrees from Shandong University, Jinan, China, in 2006 and 2009, respectively. He received his Dr. degree from the Graduate School of Information Science and Technology of Hokkaido University, Sapporo, Japan, in March 2013. He is currently an associate research professor in Shandong University. His research interests include human action/activity analysis from images and video.

Mineichi Kudo received his Dr. Eng. degree in Information Engineering from the Hokkaido University in 1988. At Hokkaido university, he was an instructor (1988– 1994), an associate professor (1994–2001) and is currently a professor (2001–). In 1996, he visited at University of California, Irvine. In 2001, along with professor Jack Sklansky, he received the 27th Annual Pattern Recognition Society award for the most original manuscript from all 2000 Pattern Recognition issues. His current research interests include design of pattern recognition systems, image processing, data mining and computational learning theory. He is a member of the Pattern Recognition Society and the IEEE.