Engineering Applications of Artificial Intelligence 26 (2013) 924–935
Contents lists available at SciVerse ScienceDirect
Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai
Automatic scene calibration for detecting and tracking people using a single camera$ David Perdomo n, Jesu´s B. Alonso, Carlos M. Travieso, Miguel A. Ferrer Institute for Technological Development and Innovation in Communications, ULPGC, 35017 Las Palmas, Spain
a r t i c l e i n f o
a b s t r a c t
Article history: Received 8 February 2012 Received in revised form 1 July 2012 Accepted 26 August 2012 Available online 15 October 2012
Segmenting and tracking of people is an important aim in the video analysis with multitudinous applications. The scene calibration enables the system to process the input video in a different way depending on the camera position and the scene characteristics for arising the successful results. In complex situations there exists extended occlusions, shadows and/or rejections, so that an appropriate calibration is required in order to achieve a highly developed people’s segmentation as well as a tracking algorithm. In the majority of cases, once the system has been installed in a certain scene, it is difficult to obtain the calibration information of the scene. In this paper, an automatic method to calibrate the scene for detecting and tracking people systems is presented based on measurements of video sequences captured from a stationary camera. & 2012 Elsevier Ltd. All rights reserved.
Keywords: People tracking Multiple people segmentation Active vision Surveillance Scene calibration
1. Introduction At present, the automatic analysis of video sequences has become a very active research area in which a large number of technological developments are used. A good monitoring system should be able to detect and track in a robust way the various elements of the scene. Furthermore it should keep information about the trajectory for later analysis of the behavior of moving subjects. Typical problems that can be found are partial or total occlusion, merging and splitting into groups or lighting changes that hinder proper detection and subsequent tracking of people. Therefore, the scene should be properly calibrated in order to optimize the results of every processing step. Usually, the monocular systems for detecting and tracking people are limited by the camera position and its perspective conditions to avoid complexity in the segmentation and tracking modules. The objective of this paper is to present an automatic scene calibration for this type of systems. It aims to enable automatically the system to achieve the appropriate adjustment according to the perspective and the scene conditions. Due to this adjustment it is possible to optimize notably the system modules and in consequence the final results. Since the term calibration could be used for different purposes, it is important to remark some different concepts. The intrinsic camera-calibration is known as the determination of the set of parameters which describes the image formation process (focal $ This work has been supported by the Spanish Government Project TEC2009/ 14123/C04. n Corresponding author. Tel.: þ34 928 452863; fax: þ34 928 451243. E-mail address:
[email protected] (D. Perdomo).
0952-1976/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.engappai.2012.08.009
length, central point, distortion) (Du and Brady, 1993). On the other hand, the extrinsic camera-calibration refers to those parameters that relate the position and orientation of the camera to a fixed reference according to the used coordinates system (position, orientation) (Faugeras, 1995). A very usual problem is the 3D reconstruction (Jih-Gau, 1997), which is solved with multiple/stereo-camera systems, camera motions (structure from motion), or perspective projection. However, the scene calibration (Avitzour, 2004) defines the geometry of a monocular system (with a stationary camera), which enables the system to estimate distances between points and the height of the objects assuming a preferred 2-d surface in the scene, on which objects in the scene rest. The proposed scene calibration aims to simplify this problem automatically by using calibration video sequences. In this sense, the procedure provides the system with the necessary information to interpret the size of the objects (in pixels) and how do they change along the scene. In the calibration procedure, we assume that the camera parameters are known and no camera calibration is necessary. In this paper, we first give an overview of detecting and tracking people systems and the main used techniques. Section 3 describes the proposed scene calibration process and the algorithms developed using this calibration. Experimental results using the proposed method are given in Section 4. Finally, we conclude the paper in Section 5.
2. Detecting and tracking people systems The main purpose of these systems is to determine the presence of passersby and achieve the pursuit of their paths.
D. Perdomo et al. / Engineering Applications of Artificial Intelligence 26 (2013) 924–935
From a wide point of view, it is also possible to distinguish different types of systems taking into account how they operate. On the one hand, there are systems based on biometric features detection (and tracking), such as facial or body parts detection (Xi et al., 2009; Chengbin and Huadong, 2010). Those systems use machine learning approach for features detection, which is considered a subfield of artificial intelligence. This approach process training data with labels to learn a model of the data and after that, it uses the learned model to predict test data. The main key is the use of statistical reasoning to find approximate solutions for tackling difficulties like e.g. feature matching and indexing, 3D reconstruction from a single image, subjects segmentation, pattern classification, etc. Although this type of systems can reach successful results, the biometric features must be properly registered in such a way that the system is able to detect them. This condition can limit this type of systems regarding the position of the camera and the scene conditions (e.g. if the system is developed for detecting the faces of the passersby, changing the perspective could impede a proper detection of the faces). On the other hand, there are systems based on detection of regions of motion and posterior segmentation (Snidaro et al., 2005; Haritaoglu et al., 2000) and tracking. For such systems it is essential to develop a people segmentation algorithm. In this case, the robustness is subjected to a good motion detector conforming to the proposed requirements. Detecting, segmenting and tracking of objects may be considered as primal steps depending on the way the systems works. Furthermore, following tasks (e.g. classification of objects, registration and evaluation of events) could also be considered as the main processing steps. Saykol et al. (2010) provides a multi-layered data model to handle the information in which different objects features are considered in order to achieve the extraction, tracking, classification of objects, annotation of events and meta-data extraction. Their model contains information classified in different levels (pixel-data orientation, class type, color, shape features, primitive objects actions and conceptual objects predicates). Stereo systems determines the three-dimensional characteristics of a scene from two (or more) cameras with different points of view. Despite the complexity of adding stereo to this type of systems, the main advantages include segmentation, locating objects in 3D, and handling occlusion events. Furthermore, the most significant advantage is the acquisition of information because it leads to direct range measurements, and unlike monocular approaches, does not infer depth of orientation through the use of photometric and statistical ˜oz-Salinas et al. (2008) use means of stereo assumptions. E.g. Mun vision to get body silhouettes with depth information and, in this way, to carry out the registration of complex body poses without having to track feature points. In Dittrich and Koerich (2012), multiple cameras are used to mitigate the problem of occlusion establishing a weighting scheme of corner motion statistics that takes into account the position (perspective and distance) of the people in relation to the camera position. Although vision-based techniques are known widely as a traditional method used to detect and track people, it is suitable to mention briefly laser-based systems. This type of systems try to avoid usual problems (conditions of camera setting, change of illumination, whether conditions, etc.). Laser Range Finder has been developed as a novel sensor-technology which aims to measure range distance from sensor to target object. Lee et al. (2007) present a people tracking system employing two Laser Range Finders in order to detect trespassers crossing through a security door. They use a discrete human model in Kalman Filtering to estimate the state of human object. The automatic scene calibration presented in this paper was developed for a simple monocular system. The use of only one camera hinders the exact 3D reconstruction of the scene, so that
925
no information about distances and dimensions is available. In our system, the three techniques motion detection, objects segmentation and tracking, are considered as sequential steps and operationally independent but closely related with each other. The following subsections introduce briefly a simple classification of these main techniques. 2.1. Main motion detection techniques Taking into account that the robustness of the system depends strictly on the working of the motion detection, selecting a proper technique plays an important part. Presently there is a wide range of motion detection algorithms, but it is possible to classify them into 3 categories according to the way of processing: 2.1.1. Algorithms based on the temporal gradient It is perhaps the most simple motion detection. Assuming that the camera is static, this algorithm detects the movement through the differentiation of 2 consecutive frames. The subtraction generates a difference image in which noticeable intensity changes indicate the movement (Bouthemy and Lalande, 1993; (Ma et al., 2008). This method allows the adaptation to changing environments, but also depends on the size and speed of moving objects, and provides very poor results when extracting all relevant pixels. 2.1.2. Algorithms based on background subtraction This technique also requires the camera to be static, since the identification of moving objects is done by differentiating each video frame with a background reference model (Dawson-Howe, 1996; Wang et al., 2008). This background image is a frame in which only the static elements of the scene appear. The pixels of the processing frame that deviate significantly from the value of the fixed model are considered moving objects (or foreground). This technique achieves better results than the previous one when obtaining more pixels of a moving object. Hereby it was chosen for implementing our system. 2.1.3. Algorithms based on optical flow It is another of the conventional methods used to detect moving objects, with the additional advantage that it can work with camera movements. This approach is based on the calculation of the local relative motion (Barron et al., 1994), which is also used for segmentation of space (Ranchin and Dibos, 2004). Although valuable information is obtained, this method is computationally very heavy and therefore usually requires specific hardware. 2.2. People segmentation On the one hand there are segmentations algorithms based on the contour of people, i.e., analyzing the edges of people silhouette (Ranchin and Dibos, 2004) and on the other hand, those algorithms based on the regions of motion (Zhang and Chen, 2007). This classification is shown in Fig. 1. Methods based on the shape of the silhouette are divided in two groups: systems based on a model of complete person silhouette (Hessein et al., 2006; Koenig, 2007; Zhou and Hoang, 2005) and based on a model of joining parts of the same human shape (Haritaoglu et al., 1998; Wu and Nevatia, 2005). This type of systems needs to get the outline very precisely, which can be complicated depending on the scene, lighting changes, shadows presence, movement speed, etc. Methods based on motion regions can also be subdivided: systems that model the person as a single region (Cui et al., 2007), and those that detect the different regions within a person and his
926
D. Perdomo et al. / Engineering Applications of Artificial Intelligence 26 (2013) 924–935
Template: This tracking is based on the use of a template to
Fig. 1. Classification of main people segmentation methods.
Fig. 2. Classification of main tracking techniques.
subsequent union (Harasse et al., 2006; Sprague and Luo, 2002). By contrast, systems based on features extracted from the regions of motion, do not usually need to obtain a well-defined region, but with a good approximation is sufficient. 2.3. Tracking Primarily there are three different types of tracking techniques (Yilmaz et al., 2006): point, kernel and silhouette tracking. This classification is shown in Fig. 2. 2.3.1. Point tracking This type of technique is used in case of having an object model based on points in which the goal is to relate the points in frame n to the same points in the frame n 1. This method is divided in two main categories:
Deterministic: The matching is achieved using a set of motion
constraints. The most appropriate matching is chosen with optimal assignment methods, e.g. the Hungarian algorithm (Kuhn, 1955) or greedy search methods. Probabilistic: The matching is achieved taking into account some uncertainty (quantified in the form of error) measurements (Isard and Blake, 1998) using the state space approach to model the object properties such as position, velocity, and acceleration. For example, the Kalman filter is used to estimate the state of a linear system where the state is assumed to be normally distributed (Gaussian). The limitation of this assumption can be overcome by using particle filtering (Tanizaki, 1987), in which the conditional state density is represented by a set of samples (not necessary Gaussian). After an initialization, this method iterates 3 steps: resample (new particles are propagated), compute (the samples are evaluated) and prediction (a concrete number of particles are selected for being propagated in the next iteration). Using these steps the position of the subject can be estimated.
2.3.2. Kernel tracking It is based on primitive regions and its parameters from one frame to the next (perimeter, texture features, etc.).
find the object through the image, usually by correlations (Schweitzer, et al., 2002) or some similar method. This method can be problematic when object’s appearance varies during the scene. However, it works very well for tracking objects that does not change their shape. Appearance models: It is based on the search of certain characteristics of the object, such as color histograms, gradients or other obtaining density models, whereby the moving direction of the object is expected without reviewing the entire image. One such method is the Mean Shift (Comaniciu et al., 2003). Multiview appearance models: Multiple views of the object are learned offline in order to make the tracking algorithm less susceptible to drastic changes in the shape of the object. For this, it is used a subspace-based method called Eigenspace (Black and Jepson, 1998), based on the Principal Component Analysis (PCA) (Shlens, 2005).
2.3.3. Tracking based on silhouettes It deals with tracking complex objects, for which a simple geometric shape cannot be used. This kind of techniques can be subdivided in two groups:
Shape matching: Complex shapes are used as search pattern (Cole et al., 2004).
Contour tracking: This method does not consider a static shape, but tries to follow frame by frame the evolution of the contour of each object in order to obtain more accurate tracking of objects that tend to change their shape during the scene (Cole et al., 2004). Although particle filters has recently become popular in computer vision for object detection and tracking, our tracking algorithm uses the relationship of the centers of mass of the segmented subjects to accomplish the deterministic correspondence of the points. Thanks to this method and its flexibility, the system seeks to incorporate concepts of cognitive video. In this sense, the way the system works is adapted in a novel and simple way that conforms to the proposed requirements.
3. Proposed automatic scene calibration The described problem and the proposed solution are focused on the height of the detected objects and their relationship with their position in the screen. Considering the objects width is convenient for situations in where the width of the objects changes significantly when moving horizontally. Nevertheless, we assume that the camera is positioned parallel with the horizontal plane of the scene, so that the changes of the objects width are not appreciable compared with the changes of the height when moving vertically. However, the same approaches explained below could be also considered and applied to the width of the objects in relationship with how they change along the screen. 3.1. Problem description In many systems, a view from the top is chosen for the camera in order to avoid problems given by occlusions (see Fig. 3(a)). This requirement could be considered as a limitation of the systems capabilities and applications (e.g. to register facial images of every incoming passerby, analyze people silhouettes and behaviors, etc). On the other hand, a frontal placement of the camera with
D. Perdomo et al. / Engineering Applications of Artificial Intelligence 26 (2013) 924–935
927
Fig. 3. Different views depending on the position of the camera: (a) vertical view and (b) horizontal view with perspective.
Fig. 4. Example of two different scenes with different perspective projections.
perspective (Fig. 3(b)) does not have these limitations but generates partial and total occlusions so that the complexity of the segmentation and tracking algorithms is considerably increased. The proposed calibration aims to optimize the tracking algorithm and to help the system to handle multiple people segmentation with occlusions independently on the position of the camera. As can be seen in Figs. 3 and 4, the way the system should process the video sequence depends strictly on the position of the camera and the scene conditions. Other systems deal with the segmentation problem calculating the perspective as a scalar factor for every different scene. For example, to compute the number of horizontal occlusion people, (Ma et al., 2008) it uses a scaling factor (according to the perspective) and a representative width of a single-person. This solution could be appropriate in those scenes in which the width of the passersby changes proportionally to the scaling factor in a considerable area inside the screen (e.g. the excerpt of sequence in Fig. 4(a)). On the other hand, this proposal is not suitable for every scene condition. For the excerpt of sequence shown in Fig. 4(b) this method would be not enough. Hence, it is necessary to achieve a more particular calibration of the scene so that the system is able to interpret how the size of a person should be at any point in the screen independently on the position of the camera. Hereby, further analysis of the subject’s height is required. This should be achieved as a function of the coordinate of the center of mass of its generated region of motion (hereafter called centroid). As a graphical example Fig. 5 shows an excerpt of sequence in which a passerby is crossing the scene. This example is suitable to
help to understand some determining concepts used in this paper. For every frame, a bounding box is represented as an estimation of the real size of the subject (obviously, it is not always possible to display this bounding box inside the screen). On one hand, the arrow in blue on the right side of its bounding box indicates the estimation of the real height of the subject. On the other hand, the left arrow in red indicates the height of the region of motion generated by the subject. It is important to notice that two centroids are perceptible in the images: the one on the region of motion (colored in red) and the one of the estimated real height of the subject (colored in blue). Thanks to both Figs. 5 and 6 it is possible to explain graphically the relationship between the position of the centroid, the height of the subject and the corresponding parameters. In the first frame of Fig. 5 the subject is not yet totally inside the screen, thus, the generated region of motion does not correspond to the bounding box. In the second frame, the person is totally inside the screen; therefore, both heights and centroids coincide. At his position, the centroid reaches point A. In the third frame both heights coincide as well until the centroid reaches point B. As the subject continues walking in the fourth frame, it comes out of the screen and the centroids and heights do not coincide. As can be seen in the Fig. 6(a) and (b), there are three different zones (z1, z2, z3) taking into account how the region of motion changes as a function of the position on the y-axis of its centroid. The importance of this behavior is described in the relationship between the vertical positions of both centroids (colored in blue and red correspondently). In the Fig. 6(a) both mentioned heights are represented as a function of the vertical coordinate on the
928
D. Perdomo et al. / Engineering Applications of Artificial Intelligence 26 (2013) 924–935
Fig. 5. Excerpt of sequence: a single person crossing the scene as a graphical example of the analysis of the height of the subject. (For interpretation of the references to the color in this figure, the reader is referred to the web version of this article).
Fig. 6. Denominated height-position curve (a) and representation of the different zones in the screen according to the height-position curve (b). (For interpretation of the references to the color in this figure, the reader is referred to the web version of this article).
y-axis of the centroid of the generated region of motion. This resulting curve is defined as the height-position curve. Fig. 6(b) shows a frame with three subjects in which their centroids are in the three different zones. It is important to remark that the three different zones (z1, z2, z3) are delimited by the points A and B. This information is helpful for carrying out the segmentation of passersby in the regions of motion and thereby solving the perspective problems. However, the calibration is not essential only for carrying out the segmentation, but the tracking algorithm too. The distance between one single subject in sequent frames increases as the subject approaches the camera (due to the perspective of the scene) so that the tracking algorithm also needs calibration parameters in order to optimize the results.
3.2. Scene calibration process: parameters estimation The goal of the described process is to compute automatically the calibration model just by giving several calibration video sequences as inputs. In other words, from several video sequences, the calibration process generates automatically the calibration model. The calibration model is the set of parameters that describe mathematically the behavior of the real size (in pixels) of the subject and the size of the generated region of motion based on the position of its centroids for every pixel in the screen. This calibration process is developed for calculating the calibration model independently on the position of the camera provided a simple condition: the resulting height-position curve
D. Perdomo et al. / Engineering Applications of Artificial Intelligence 26 (2013) 924–935
929
determines how much does the frame act on the background model. If a low value of a is chosen, the background is rapidly updated, so that motions can be taken into account as part of the background. Hence the value of the adapting factor should be near to 1 in order to get an appropriate adaptation to the slow motions of the video sequences. Once the background is updated, the subtraction is achieved using a method based on the relationship between the actual pixel and its neighborhood (Spagnolo et al., 2004): Pa Pa i ¼ a j ¼ a Q ðx þi,y þ jÞHði,jÞ Cðx,yÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pa Pa Pa Pa 2 2 i ¼ a j ¼ a ðFRðx þ1,y þjÞÞ Hði,jÞ i ¼ a j ¼ a ðBGðx þi,y þ jÞÞ Hði,jÞ
ð2Þ
Fig. 7. Flowchart of the scene calibration process.
should consider the three different zones (z1, z2, z3) explained in Fig. 6. Namely, this condition entails that the position of the camera produces at least a minimal perspective (i.e. a 100% vertical view from the top is not possible). If the scene (due to the position of the camera) has these features, it is irrelevant how large the zones are and how depth the perspective is. Assuming that the mentioned conditions are suitable, the video sequences used for the calibration process should have the following features due to the importance of the motion detection:
The sequence should contain a single passerby (or several
passersby) crossing the scene without occlusions. The path and direction of the subjects is irrelevant provided that there is no occlusion between them. The motion detection should work appropriately: the generated region of motion for every subject should represent accordingly the size of the person. That is to say, that the region of motion should not include existing shadows or should not ignore parts of the subject.
As previously mentioned, the calibration aims to get the set of parameters that describes the calibration model. Hence, it is necessary to calculate the two inflection points (A and B) in the so-called height-position curve in order to compute the perspective of the scene. As can be seen in Fig. 7, the whole process is structured in two main steps: the motion detection and the analysis of the regions. As mentioned above, the motion detection is achieved by using the background subtraction technique and a background updating model. To model the background a simple running average technique (Benezeth et al., 2008) is used: BGðn,pÞ ¼ aBGðn1,pÞ þ ð1aÞFRðn,pÞ
ð1Þ
where BG(n,p) is the background for the pixel p at the instant n, FR(n,p) is the corresponding frame from the video sequence and a is the adapting factor (between 0 and 1). The adapting factor
where BG(x,y) is the backgrounds pixel for the position (x,y), FR(x,y) is the pixel of the analyzed frame, Q(x,y) is BG(x,y) FR(x,y), H(i,j) is the correlation mask of size n n and a is (n-1)/2. An appropriate thresholding is achieved in C(x,y) for generating the binary image B(x,y), which represent the regions of motion. Thereafter, further morphological processing is applied for enhancement and improvement of the blobs in B(x,y). These blobs represent the size of the subject inside the screen. The samples {yk,hk} are the heights hk of every region of motion for all the calibration video sequences. They are obtained as a function of the vertical position yk of its centroids. This set of points is used to obtain the height-position curve, which determines the calibration model (see Fig. 6 (a)). After the samples are obtained, they are interpolated fy^ n , h^ n g s s and successively smoothed fy^ n , h^ n g in order to avoid eventual deviations of the registered heights (see a real example in Fig. 14). After that, the aim is to calculate the inflection points A and B s s s s ðy^ A , h^ A Þ, ðy^ B , h^ B Þ for computing mathematically the height-position curve h(y): 8 z1 z1 > < ya1 þ a0 z2 hðyÞ ¼ ya1 þ az2 0 > : yaz3 þ az3 1
0
f or 0 r y oA
ðzone1Þ
f or A ry o B
ðzone2Þ
f or B r yo N
ðzone3Þ
ð3Þ
where N indicates the number of lines of the screen and therefore zi the last value of y, azi 0 is the h-intercept and a1 is the slope of the i-line (i-zone). s s The point A ðy^ A , h^ A Þ is calculated by finding the first local maxi s dh^ mum of the first derivative of h with respect to y Localmax dy^ sn . n
In other words, this point indicates the first appreciable change of h with respect to the change in y (so that the slope of the curve s s changes for first time). Otherwise, the point B ðy^ , h^ Þ is easily B
B
obtained by finding the global maximum in the interpolated and s s smoothed curve Global maxðfy^ , h^ gÞ. n
n
Thanks to the calibration model, the system has concrete information about the relationship of the estimation of the real height of a single person crossing the scene and its generated region of motion. In other words, it is able to estimate the real height of a person and the height of the generated region of motion just from the position of its centroid (due to consideration of the three different mentioned zones z1, z2, z3). Thanks to this information it is also possible to consider if the head of the subject is inside the screen and its position. For example, once a scene is properly calibrated and assuming that a detected incoming passerby is walking normally and approaching the camera, the system knows which the position of the face should be. As it is described in more detail in following sections, the calibration enables the optimization of the segmentation and tracking algorithms.
930
D. Perdomo et al. / Engineering Applications of Artificial Intelligence 26 (2013) 924–935
3.3. Segmentation algorithm using the proposed calibration As explained graphically in Fig. 4, the position of the camera influences the dimension of a person across the image, so that the segmentation algorithm should be able to estimate the position of people in the image just with the information of the centroids of its regions of motion. This knowledge is possible thanks to the calibration model. However, this segmentation algorithm is developed for segmenting blobs assuming that they are generated by several passersby with a standard size (determined from the calibration). In other words, this algorithm will ignore generated blobs with a senseless size (e.g. blobs generated by animals, objects, etc.) The developed algorithm individually examines every region of motion and takes the corresponding decision according to the size and/or characteristics of the region until an appropriate segmentation is carried out. A simplified flowchart is shown in Fig. 8. The segmentation algorithm is graphically explained with an example in which five passersby have different sizes due to their positions and the perspective projection. Fig. 9 shows the process of the algorithm given a binary image generated by 5 people. The first step is to analyze every region of motion F1(i) in order to split off the oversized blobs by comparing the background with the frame (just for the analyzed region) and further thresholding using the Otsus method (Otsu, 1979). If after this first segmentation, the (sub-) region F2(j) is still oversized, local maximums are searched on its vertical projection in order to estimate the position of people (Ma et al., 2008; Zhang and Chen, 2007). For every estimated person, an appropriate ellipse is added to the corresponding output image B2. Since the local maximums do not indicate always every person in the blob (as it happens in the given example), the image M2 is created by subtracting B2 from
Fig. 9. Process of the segmentation algorithm with a given example.
Fig. 10. Detecting subjects by local maximums in vertical projection of F2.
Fig. 8. Simplified flowchart of segmentation algorithm.
F2 (M2¼B2 F2) in order to indicate not detected persons. This process is shown in Fig. 10. As can be seen in the real results in Fig. 11, thanks to the information of the calibration model, the algorithm is able to achieve accurate results even in complicated situations.
D. Perdomo et al. / Engineering Applications of Artificial Intelligence 26 (2013) 924–935
931
Fig. 11. Real result of the developed segmentation algorithm: (a) actual processed frame; (b) generated regions of motion and (c) result of segmentation algorithm.
Fig. 13. Example of characterization of the relationship of two relevant centroids.
Fig. 12. Simplified flowchart of the tracking algorithm.
3.4. Tracking algorithm using the proposed calibration The developed algorithm uses the information of the centroids of the segmented subjects. Its function seeks to incorporate concepts of cognitive video using analyzing techniques that take into account the information not only from the current frame, but the sequence of previous frames. This helps to determine the relationship of the centroids taking into account previous information. It tries to answer questions such as: Where does the centroid come from? How far is the centroid from its precedent? Which is the relationship between the actual centroids and the previous recognized paths?
Thanks to the calibration, the algorithm is able to estimate in which zone the centroids are (z1, z2 or z3), how the size of the blob should be for its position and how often and how separated the centroids should appear corresponding to a person walking normally through the scene. As a result of the segmentation, the tracking module receives the coordinates of the centroid of every segmented person. Basically, the tracking is based on an appropriate registration of every centroid with the connection between them. In other words, the algorithm should register every centroid and determine where it comes from. For every incoming centroid, the main five steps are: to search the relevant centroids, characterize the relationship with the others, join the best candidate, assign it an identification number (ID) according to the previous connected centroids and validate the labeling. A simplified flowchart of the tracking algorithm is shown in Fig. 12. The first step for every actual centroid (a) is to search and register the relevant centroids. They are searched in a variable search-radius defined thanks to the calibration model depending on the position of the actual centroid in an adjustable number of previous frames (according to the processing frame-rate). Afterwards, the relationship of every relevant centroid (i) with the actual is characterized as it is show in Fig. 13.
932
D. Perdomo et al. / Engineering Applications of Artificial Intelligence 26 (2013) 924–935
Fig. 14. Same scene with two different calibrations due to different camera positions.
The characterization of the relationship G is computed as follows:
Gi ¼ D+i Pa þ Ddisti Pd ð8Pa þ Pd ¼ 1Þ
ð4Þ
where D+i is the relative angle difference and Ddisti is the relative distance:
D+i ¼
9ai bi 9 o
180 %
Ddisti ¼
ri R
ð5Þ
The parameters Pa and Pd are adjustable weighting factors used for giving more relevance to the relative angle difference or to the relative distance as appropriate. For the example displayed in Fig. 13, the actual centroid (represented with a white point) has to relevant centroids (i¼ 1 and i ¼2). In contrast to the centroid 2, the path of the centroid 1 with a generates a straight line, so that the relative angle difference has a very small value (a1 b1 E0). In this example, the algorithm considers that the actual centroid comes from the centroid 1. Once the characterization G is computed, the algorithm joints the actual centroid a with the best relevant one and assigns a corresponding identification number (ID) so that every centroid from the same path has the same ID. To assign this ID, it is taken into account how many times (N) has been the actual centroid a joined with another precedent centroid e (see flowchart in Fig. 12). The centroid a is labeled with ID¼0 if it has been joined less than N-times (e.g. if a new subject appears in the scene or there is a mistake from the segmentation process). A path of centroids is considered as a new path only if it appears N-times, so that the algorithm assigns a new ID. In case that the centroid has been joined to the path more that N-times, the same ID is assigned to the incoming centroid because it corresponds to the same path. This assignment uses the variable N in order to avoid errors from isolated centroids generated in previous steps (for example motion detection or subject segmentation). The final step of the tracking algorithm is the validation of the assignations in order to handle errors (possible peoples crossing, complex overlapping, etc). Thanks to the used techniques, the algorithms is able to track the path of several subjects independently on their directions or how do they change they speed (e.g. in case they stay in the same position for a relatively short time provided that the detection of motion does not consider the subject as background). In case that two subjects cross normally their paths through the scene they will be tracked properly due to the characterization of the relationship between the centroids (see Fig. 17). However, erroneous assignments are avoided provided that the passersby act normally.
Fig. 15. Examples of real height-position curves.
4. Results As it has been mentioned in previous sections, the aim of the proposed calibration is to enable the system to adjust automatically the way it works in different scenes with different position of the camera. In this manner, it helps the system to achieve the appropriate segmentation and tracking of people passing through the scene. Taking this into account, it is suitable to distinguish two concepts with regard to the results. On the one hand, the results of the proposed automatic calibration, in which the
D. Perdomo et al. / Engineering Applications of Artificial Intelligence 26 (2013) 924–935
height-position curve and its parameters are automatically estimated for the corresponding scene and position of the camera. On the other hand, it is interesting to mention the results of the developed system using the proposed calibration. Although both concepts are strongly joined, they could be also seen from different points of view and considered separately. In this
Fig. 16. Results view: (a) main processed frames; (b) result of the segmentation algorithm; (c) result of the tracking algorithm and (d) registration of people’s faces with face detection algorithm.
933
manner, the approaches of the proposed automatic calibration could be also implemented in other systems that work on similar concepts. Nevertheless, it could be also interesting to consider the possibility of combining the proposed approach with different types of system and/or techniques (e.g. multi-view systems, machine learning algorithms, etc). As an example of the proposed calibration result, Figs. 14 and 15 illustrate two real calibrations (case (a) and (b)) for the same scene but changing the position of the camera. This example helps to understand graphically the relevance of the calibration process depending on the perspective and the scene conditions. On the one hand, the Fig. 14 shows both backgrounds with their corresponding horizontal lines indicating the position of the points A and B in the screen. On the other hand, Fig. 15 shows both corresponding real height-position curves (a) and (b). It is easy to perceive that the perspective is bigger in the example (a) than in (b), so that the distance between points A and B is also bigger. In order to understand the relationship between different camera positions and the image of the scene, it is appropriate to consider some limit situations with regard to possible values of A and B. In this sense, the difference between both points tends to 0 in the limit situation (without perspective), in which the position of the camera would produce almost a view from the top (the points A
Fig. 17. Result example: four subjects through the scenario.
934
D. Perdomo et al. / Engineering Applications of Artificial Intelligence 26 (2013) 924–935
Fig. 18. Result example: two subjects crossing their paths.
and B would have close values). However, the point A would tend to 0 in the limit situation in which the vanishing point coincide with the first line of the screen (the top). As explained in detail below, the approaches of the proposed automatic calibration have been tested in a monocular system based on background subtraction, segmentation and tracking of objects. The improvement of the way the system works due to the calibration could be graphically described with excerpt of sequences of real results. In order to show the whole working process, the results are shown not just as a processed frame, but as described in Fig. 16. This way of representing the results enables to comprehend the results of the different modules of the system that use the calibration model. Note that the images of the faces are presented in (d) only if the centroid of the subject is in the zone z2 or z3. In the given examples the passersby are approaching the camera (coming from the top to the bottom of the screen), so that the direction is considered as positive and consequently the image of the faces are displayed. In case that some passersby are walking away (the direction is negative) they are processed normally but the images of the faces are not shown. Fig. 17 illustrates an excerpt of sequence of four subjects crossing the scene. Thanks to the proposed calibration, the system knows if the subjects are totally inside the screen or not (and
therefore it is possible to estimate the faces position of the subjects provided that they are walking normally) so that the face images are just registered when the faces are inside the screen (from 3rd to 6th shown frames). In the 4th and 5th frames, the influence of the segmentation algorithm is perceptible due to the inserted ellipses. Fig. 18 shows an excerpt of sequence in which 2 subjects are crossing their paths (observe the result of the tracking algorithm and the assigned ID). Note that in the 4th frame there is a total occlusion between them so that just one blow is processed by the segmentation algorithm. Nevertheless, once the blobs are separated (5th shown frame) the tracking algorithm is able to assign the correct ID to the subjects taking into account their previous paths. The proposed automatic calibration has been tested in different environments with different positions of the camera. Some processed and corresponding original videos sequences are uploaded for public access (http://downloaddpdsvideos.idetic.eu, accessed 25 11 11).
5. Conclusion This paper presents an automatic calibration method for detecting and tracking people systems. This calibration aims the
D. Perdomo et al. / Engineering Applications of Artificial Intelligence 26 (2013) 924–935
improvement of the main modules of the system in order to optimize the results in complex situation or scenarios. The calibration was implemented in the described detecting and tracking people system. The proposed calibration is based on the automatic processing of several calibration video sequences of a determined scene in order to characterize it by generating the calibration model from the determined height-position curve and its parameters. This is achieved by the analysis of the relationship between the real height of the subjects and their generated regions of motion inside the screen. Thanks to this calibration, the system is able to estimate the real size of a person according to the position of the centroid of its region of motion. Consequently the position of the face can be estimated, so that it is possible to take advantage of this knowledge for using the facial images of the detected subjects (i.e. facial detection, biometrical facial recognition). Furthermore, the results of the tracking module could be used for further analysis of detected subject paths (speed of passersby, abnormal behavior, etc). References Avitzour, D., 2004. Novel scene calibration procedure for video surveillance systems. IEEE Trans. Aerosp. Electron. Syst. 40 (3), 1105–1110. Barron, J.L., Fleet, D.J., Beauchemin, S., 1994. Performance of optic flow techniques. Int. J. Comput. Vision 12 (1), 43–77. Benezeth, Y., Jodoin, P.M., Emile, B., Laurent, H., Rosenberger, C., 2008. Review and evaluation of commonly-implemented background subtraction algorithms. In: Proceedings of the 19th International Conference on Pattern Recognition. Black, M., Jepson, A., 1998. Eigentracking: robust matching and tracking of articulated objects using a view-based representation. Int. J. Comput. Vision 26 (1), 63–84. Bouthemy, P., Lalande, P., 1993. Recovery of moving object masks in an image sequence using local spatiotemporal contextual information. Opt. Eng. 32 (6), 1205–1212. Chengbin, Z., Huadong, M., 2010. Robust Head–Shoulder Detection by PCA-Based Multilevel HOG–LBP Detector for People Counting. In: Proceedings of the 20th International Conference on Pattern Recognition (ICPR). pp. 2069–2072. Cole, L., Austin, D., Cole, L., 2004. Visual Object Recognition using Template Matching. In: Proceedings of the Australasian Conference on Robotics and Automation. Comaniciu, D., Ramesh, V., Meer, P., 2003. Kernel-Based Object Tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25 (5), 264–577. Cui, X., Liu, Y., Shan, S., Chen, X., Gao, W., 2007. 3D Haar-Like Features for Pedestrian Detection. In: Proceedings of the International Conference on Multimedia & Expo (ICME). pp. 1263–1266. Dawson-Howe, K.M., 1996. Active Surveillance Using Dynamic Background Subtraction. Technical Report TCD-CS-96-06. Trinity College. Dittrich, F., Koerich A.L., 2012. People Counting in Crowded Scenes Using Multiple Cameras. In: Proceedings of the 19th International Conference on Systems, Signals and Image Processing (IWSSIP), pp. 138–141. Du, F., Brady, M., 1993. Self Calibration of the Intrinsic Parameters of Cameras for Active Vision Systems. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, IEEE Proceedings Computer Society. New York. pp. 477–482. Faugeras, O., 1995. Stratification of three-dimensional vision: projective, affine, and metric representations. Opt. Soc. America 12 (3), 465–484. Harasse, S., Bonnaud, L., Desvignes, M., 2006. Human model for people detection in dynamic scenes. Comput. Vis. Pattern Recognition, 335–338. Haritaoglu, I., Harwood, D., Davis, L.S., Ghost, 1998. A Human Body Part Labelling System Using Silhouettes. In: Proceedings of the International Conference on Pattern Recognition (ICPR). pp. 77–82.
935
Haritaoglu, I., Harwood, D., Davis, L.S., 2000. W4: real-time surveillance of people and their activities. IEEE Trans. Pattern Anal. Mach. Intell. 22 (8), 809–830. Hessein, M., Abd-Almageed, W., Ran, Y., Davis, L.S., 2006. Real-Time Human Detection, Tracking, and Verification in Uncontrolled Camera Motion Environments. In: Proceedings of the International Conference on Vision Systems (ICVS), pp. 41–47. /http://downloaddpdsvideos.idetic.eu/S (accessed on 25 11 11). Isard, M., Blake, A.,1998. CONDENSATION—Conditional density propagation for visual tracking. Int. J. Comput. Vision 29 (1), 5–28. Jih-Gau J., 1997. Parameter Estimation in the Three-Point Perspective Projection Problem in Computer Vision. In: Proceedings of the IEEE International Symposium on Industrial Electronics. 3, pp. 1065–1070. Koenig, N.T., 2007. Real-Time Human Detection and Tracking in Diverse Environments. In: Proceedings of the International Conference on Development and Learning (ICDL), pp. 324–328. Kuhn, H., 1955. The Hungarian method for solving the assignment problem. Nav. Res. Logistics Quart., 83–87. Lee, J.H., Kim, Y., Kim, B. K., Ohba, K., Kawata, H., Ohya, A., Yuta, S., 2007. Security Door System Using Human Tracking Method with Laser Range Finders. In: Proceedings of the 2007 IEEE International Conference on Mechatronics and Automation (ICMA2007), pp. 2060–2065. Ma, H., Lu, H., Zhang, M., 2008. A Real-time Effective System for Tracking Passing People Using a Single Camera. In: Proceedings of the 7th World congress on Intelligent Control and Automation, pp. 6173–6177. ˜ oz-Salinas, R., Medina-Carnicer, R., Madrid-Cuevas, F.J., Carmona-Poyato, A., Mun 2008. Depth silhouettes for gesture recognition. Pattern Recognition Lett 29, 319–329. Otsu, N., 1979. Threshold Selection Method from Gray-Level Histograms. IEEE Trans. Syst., Man, Cybernetics, 9 (1), 62–66. Paolo Spagnolo, Marco Leo, Tiziana D’Orazio, Arcangelo Distante, CNR-ISSIA, Italy, 2004. Robust moving objects segmentation by background subtraction. In: Proceedings of the 5th International Workshop on Image Analysis for Multimedia Interactive Services; WIAMIS 2004, Available from: /http://www.img. lx.it.pt/WIAMIS2004/S. Ranchin, F., Dibos, F., 2004. Moving Objects Segmentation Using Optical Flow Estimation. In: Proceedings of the Workshop on Mathematics and Image Analysis. Saykol, E., Gudukbay, U., Ulusoy, O., 2010. Scenario-based query processing for video surveillance archives. Eng. Appl. Artif. Intell. 23 (3), 331–345. Schweitzer, H., Bell, J.W., Wu, F., 2002. Very Fast Template Matching. In: Proceedings of the 7th European Conference on Computer Vision, Part IV. Shlens, J., 2005. Tutorial on Principal Component Analysis. Systems Neurobiology Laboratory. Snidaro, L., Micheloni, C., Chiavedale, C., 2005. Video security for ambient intelligence. IEEE Trans. Syst. Man, Cybernetics A 35, 133–144. Sprague, N., Luo, J., 2002. Clothed People Detection in Still Images. In: Proceedings of the International Conference on Pattern Recognition (ICPR), pp. 585–589. Tanizaki, H, 1987. Non-Gaussian state-space modeling of nonstationary time series. J. Am. Stat. Assoc. 82, 1032–1063. Wang, W., Yang, J., Gao, W., 2008. Modeling background and segmenting moving objects from compressed video. IEEE Trans. Circ. Syst. Video Technol. 18 (5), 670–681. Wu, B., Nevatia, R., 2005. Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detector. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 90–97. Xi, Z., Delleandrea, E., Liming, C., 2009. A People Counting System Based on Face Detection and Tracking in a Video. In: Proceedings of the 6th IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 67–72. Yilmaz, A., Javed, O.M., Shah, M., 2006. Object tracking: a survey. ACM Comput. Surv. 38 (4), Article 13, /http://doi.acm.org/10.1145/1177352.1177355S. Zhang, E., Chen, F., 2007. A Fast and Robust People Counting Method in Video Surveillance. In: Proceedings of the International Conference on Computational Intelligence and Security, pp. 339–343. Zhou, J., Hoang, J., 2005. Real time robust human detection and tracking system. Comput. Vis. Pattern Recogn., 149–156.