TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll02/19llpp423-433 Volume 14, Number 4, August 2009
Spatially Adaptive Subsampling for Motion Detection* Charley Paulus (ຑ֚ढ), ZHANG Yujin (Ⴣᇏࠫ)** Tsinghua National Laboratory for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing 100084, China Abstract: Many video surveillance applications rely on efficient motion detection. However, the algorithms are usually costly since they compute a background model at every pixel of the frame. This paper shows that, in the case of a planar scene with a fixed calibrated camera, a set of pixels can be selected to compute the background model while ignoring the other pixels for accurate but less costly motion detection. The calibration is used to first define a volume of interest in the real world and to project the volume of interest onto the image, and to define a spatial adaptive subsampling of this region of interest with a subsampling density that depends on the camera distance. Indeed, farther objects need to be analyzed with more precision than closer objects. Tests on many video sequences have integrated this adaptive subsampling to various motion detection techniques. Key words: motion detection; background modeling; adaptive subsampling; calibration
Introduction Most motion detection techniques in video sequences are based on background subtraction[1]. The current frame of the sequence is compared to a model of the background scene. Pixels that are significantly different from the model are marked as foreground. Moving objects are then extracted as connected components of these foreground pixels. A basic way to model the background is to compute the average of the n previous frames[2]. The memory requirement can be reduced by modeling the background as the running average of the sequence. A more complex technique proposed by Wren et al.[3] models each pixel of the background as a Gaussian distribution. Received: 2008-05-27; revised: 2008-10-27
* Supported by the National Natural Science Foundation of China (No. 60872084) and the Specialized Research Fund for the Doctoral Program of Higher Education of MOE, China (No. 20060003102)
** To whom correspondence should be addressed. E-mail:
[email protected]; Tel: 86-10-62781430
For a given pixel, if the difference between its current value and its mean is bigger than a certain multiple of its standard deviation, this pixel is marked as foreground. However, these two techniques are very sensitive to illumination variations, which will lead to unstable behavior especially if the analyzed scene is outdoors with many changes in the lighting conditions. To cope with illumination variations, Stauffer and Grimson proposed a technique using a mixture of Gaussians[4]. The scene is modeled by a multimodal Gaussian distribution. Each mode (usually 3 to 5 modes) is associated with a certain weight. At each frame, the weights are updated depending on how well they match with the current frame. The modes are ranked according to their weights, with the first mode chosen as the background. This technique is popular in motion detection since it is robust in scenes with illumination variations, but it is very costly in computation since several Gaussian distributions have to be updated and ranked for every pixel of every frame. The achievement of real time performance is very difficult when working on sequences with large images. In these algorithms, the background is modeled
424
individually for each pixel. This paper focuses on such techniques rather than techniques that model the background globally, such as the eigenbackgrounds approach[5,6]. This paper presents a method that uses a short preprocessing step after the camera calibration and before the motion detection, which significantly lightens the background modeling. The goal is to reduce the number of pixels to be monitored during motion detection with previously presented techniques to accurately model the background by defining a selected set of pixels where the background model is computed. Pixels that do not belong to this subsample are ignored during the background modeling. One way to select a subset of pixels is to subsample the image in a regular manner[7], for example, by computing the background model at every other pixel. A more advanced method is to use predefined subsampling patterns[8], but this also leads to analysis of unnecessary pixels. The current method uses the camera calibration data to compute an area of interest on the image and then reduces the number of pixels even more by computing a spatially adaptive subsampling of this area. The computation of this adaptive subsampling is based on the camera calibration performed using the Tsai method[9,10]. Here, the monitored area is a planar surface, so the result is a two-way correspondence between the 3-D world coordinates and the 2-D image coordinates. The current method first draws a polygon of interest on the floor in the 3-D world. The height of this polygon defines the volume of interest. The projection of this volume on the image (the area of interest) is used to compute the background model. Within this area, a spatially adaptive subsampling is defined, keeping in mind that the motion detection for objects at the front of the scene needs less precision than for objects at the back of the scene.
1 Spatially Adaptive Subsampling 1.1
Volume of interest
The first step for reducing the number of monitored pixels during the motion detection is to restrict the area of interest on the image. Usually when monitoring a planar area, the camera is set so that the whole area is included in the field of view. Thus, the camera also
Tsinghua Science and Technology, August 2009, 14(4): 423-433
captures parts that do not belong to this area, which can be ignored when computing the background model. The definition of volume of interest requires the assumption that the objects to be detected must be in contact with the floor, or close to it (people, vehicles, abandoned objects, …). With this assumption, a polygon of interest P on the planar surface in the real world is first defined. People walking (or objects evolving) on this polygon P will be monitored by the motion detection. People or objects outside of P can be ignored (They might be detected anyway, but the important fact is that the desired objects are actually detected). P is characterized by an ordered list of coplanar vertices defined by selecting their corresponding pixels on the image. The vertices of P are conventionally defined counter clockwise. Giving a height to P defines the volume of interest V . The height is set according to the requirements. If we want to detect people or cars, a height of 2 m to 3 m should be sufficient. The projection of V on the image defines the region of interest R . Therefore, the projection of any object inside the volume of interest V will be entirely included in the region of interest R . All the pixels lying outside of R will be ignored during the motion detection. As mentioned previously, objects outside of the volume of interest might also be detected. For example, a bird flying above V may have its projection inside of R and thus be detected. Here, the goal is not to restrict the motion detection to a volume in the 3-D world, but to reduce the number of monitored pixels. The faces of the volume V are the bottom face ( P ), the top face (parallel to P , lying at the chosen height), and the vertical faces F1 ,..., Fn , where n is
the number of vertices of P . The faces can be divided into two sets, i.e., the front faces, composed of the top face (the camera should be higher than the top face) and the vertical faces that show their outside surface to the camera, and the back faces, composed of the bottom face ( P ) and the vertical faces that show their inside surface to the camera, as shown in Fig. 1. On the image, the region of interest R is totally covered by the projection of the faces that show their outside surface to the camera (the front faces), or by the projection of the faces that show their inside surface to the camera (the back faces). For accuracy reasons the back
Charley Paulus (ຑ֚ढ) et al.ġSpatially Adaptive Subsampling for Motion Detection
faces should be used to compute the spatially adaptive subsampling since back faces are farther from the camera than front faces. As noted later, scanning an area that is far from the camera results in a higher subsampling density, which results in more accurate motion detection.
F4
F5 F1
F3
425
close to the camera, two points a distant of d apart appear far from each other on the image, which results in a low subsampling density for this area, as shown in Fig. 2. The distance d can be chosen directly or defined by the lowest subsampling density needed for the front of the scene with the 2-D to 3-D correspondence obtained in the calibration computing the corresponding step distance d .
F2
P Fig. 2 Spatially adaptive subsampling
Fig. 1 Polygon of interest P and corresponding volume of interest V
1.2
Computing the adaptive subsampling
After reducing the number of pixels to be monitored by defining the region of interest R on the image, this number is reduced even more by extracting a subset of pixels within the region of interest R . For that purpose, subsampling R is determined in such a way that the motion detection, computed at these pixels only, will still be accurate. The motion detection does not need to be computed at every pixel. More precisely, the motion of objects located at the back of the scene needs to be analyzed more accurately, i.e., at every pixel, but the motion of objects located at the front of the scene that will appear bigger in the image can be analyzed at only a set of extracted pixels. The subsampling density at each area on the image should depend on the distance between the camera and the corresponding 3-D point. This spatially adaptive subsampling is computed by first defining a step distance d (in meters) in the real world coordinates. Then, the back faces of V as described later are scanned, and at every d meters in both scanning directions, the corresponding pixels on the image are marked. In areas far from the camera, two points a distantance of d apart appear very close to each other on the image. This results in a high subsampling density for this area. On the other hand, in an area that is
If a person stands on the polygon of interest P , only his feet, in contact with P , will be analyzed with the desired subsampling density. His head will be projected higher in the region of interest R , where the subsampling density is higher, in an area that may belong to the projection of P or to the projection of one vertical back face. This is not a problem here since the computation of the adaptive subsampling is done offline so that it does not take moving objects into account. This does not affect the accuracy of the motion detection since every object will be analyzed with at least the desired precision, with fewer pixels to be monitored. In a post processing part, moving objects could be detected using the same subsampling density from the bottom to the top, but this is not the objective of this paper. An object should be detected with the appropriate subsampling density. Therefore, if the projection of a vertical back face overlaps with the projection of P or with the projection of another vertical back face, the adaptive subsampling should be correctly computed. Denote P as the projection of P on the image and Fi as the projection of the back face Fi . If P and F overlap, a pixel in this overlapping area may repi
resent a 3-D point that lies on a line between Fi and P . Since P is farther from the camera than F , the i
subsampling density of P should be used to guarantee the maximum precision. In the same way, if the projections of two back faces Fi and F j overlap,
Tsinghua Science and Technology, August 2009, 14(4): 423-433
426
the subsampling density of the farthest back face from the camera is used, as shown in Fig. 3.
Add p(X,Y,0) to adaptiveSampling end if end for end for
Add P to alreadyScannedArea
1
//Scan vertical back faces Hmaxĕheight of V mĕnumber of vertical back faces // m
2
3 4
Sort vertical back faces to that F1 is the farthest and Fm is the closest to the camera for i=1 to m do for point A running along the bottom edge of Fi by step d do for Z=0 to Hmax step d do if p(XA,YA,Z) alreadyScannedArea then Add p(XA,YA,Z) to adaptiveSampling end if end for end for Add Fi to alreadyScannedArea end for
2 Integrated Motion Detection Fig. 3 Volume of interest and corresponding spatially adaptive subsampling with overlapping back faces
The spatially adaptive subsampling is computed by first scanning P and then the back faces one after another, starting with the farthest one from the camera and ending with the closest one. The scan keeps track of which areas have already been sampled. Thus, if a part of a back face overlaps with a part of R that has already been sampled, it is not resampled. The first subsampling is kept. Algorithm 1 is used to compute the adaptive sampling with the function p ( X , Y , Z ) giving the pixel corresponding to the 3-D point ( X ,Y , Z ) . Algorithm 1 Spatially adaptive subsampling algorithm. Initialize adaptiveSampling // empty frame Initialize alreadyScannedArea // empty frame //Scan P Set Xmin, Ymin, Xmax, and Ymax as the lower and upper bounds of P for X=Xmin to Xmax step d do for Y=Ymin to Ymax step d do if p(X,Y,0)ę P then
2.1
Subsampling mask
An easy yet efficient approach for integrating the adaptive subsampling in a motion detection algorithm is to use the subsampling region as a mask. For each frame in the video stream to be analyzed, first mask all the pixels in that frame that do not belong to the subsampling region by setting the value of all these pixels to 0 to obtain a masked frame. Then, input the mask into the motion detection algorithm. Pixels that do not belong to the subsampling will still be processed, but since their value is always 0, the computation will be fast. For example, motion detection based on a Gaussian distribution computes the mean and the variance of each pixel. For pixels outside the subsampling, the mean and the variance will always be 0, which can be easily computed and updated. In an algorithm based on the mixture of Gaussians, such a pixel will be attributed to only one Gaussian distribution (of mean and variance 0). Therefore, the algorithm will not need to sort several distributions for this pixel, which will save time. Figure 4 shows the integration of spatially adaptive subsampling for motion detection using the subsampling region as a mask.
Charley Paulus (ຑ֚ढ) et al.ġSpatially Adaptive Subsampling for Motion Detection
427
Fig. 4 Integration of spatially adaptive subsampling for motion detection using the subsampling region as a mask
2.2
1-D array of pointers to pixels
Although the use of the adaptive subsampling region as a mask on the input image reduces the computations, a faster computation would completely ignore the pixels outside of the subsampling region. For that purpose, storing pointers to the sampled pixels are required in a 1-D array and only the background at these pixels is modeled. Thus, one frame of the analyzed video stream will be processed through one “for” loop, which implies that all the data structures used in the motion detection algorithm also need
Fig. 5
2.3
to be 1-D arrays. In an algorithm that uses Gaussian distributions for instance, the means and variances of the pixels of each frame should also be stored as 1-D arrays. A 1-D array is also used to represent the foreground (an array that contains 1 if a pixel belongs to a moving object and 0 if it belongs to the background). These values are put back into a frame structure only when the foreground is displayed. Figure 5 shows the integration of spatial adaptive subsampling in motion detection when using the subsampling as a 1-D array.
Integration of spatially adaptive subsampling for motion detection using the subsampling as a 1-D array
Adding a cell to each pixel of the subsampling
The goal of background modeling is to detect not only moving pixels, but also moving objects. Once pixels are marked as belonging to the foreground, they should be grouped into connected components. The adaptive subsampling set is sparse, i.e., most pixels are not connected. Therefore, a moving object detected using this subsampling will also be a group of unconnected pixels. One way to connect pixels for each object is to apply dilatation to the image. However, this would not connect pixels that are too sparse. Therefore, a cell is associated with each pixel
of the adaptive subsampling defined as a group of contiguous pixels around the selected pixel. The pixels of the cells are chosen so that two neighboring cells form a connected group of pixels, without overlapping. Figure 6 shows the subsampling pixels and their associated cells. One way to compute the cells is to make them grow progressively around the pixels of the adaptive subsampling. Thus, the adaptive subsampling set is read entirely several times with each cell grown one unit if necessary in each pass. During the first pass, analyze the eight neighboring pixels of the square that surrounds the desired pixel. If these pixels do not belong
Tsinghua Science and Technology, August 2009, 14(4): 423-433
428
Fig. 6 Pixels of the subsampling (white pixels) and their associated cells (gray scales in the figure for visualization)
to any cell, add them to the cell for this pixel. Pixels that belong to the subsampling or to another cell are ignored. During the second pass, analyze the 18 pixels neighboring the previous square and add potential pixels as described previously. Mark a cell as “full” when all the analyzed pixels already belong to another cell or to the subsampling set. Figure 7 shows the first three passes of the algorithm.
Each cell is stored as an array of extra pixels associated with a pixel of the subsampling set. After processing the motion detection on only the pixels of the subsampling set, the detected objects are represented by the pixels of the subsampling and their associated cells. Thus, a moving object will appear as a group of connected pixels. The quality of the connected components is improved by performing a light morphological operation (an opening, then a closing) after adding the associated cells to the detected pixels. The resulting connected components have smoother contours and cells due to the removal of noise pixels. Figure 8 shows moving objects before and after adding associated cells to the detected pixels.
(a) Before adding associated cells
(a) Pass 1
(b) Pass 2
(b) After adding associated cells Fig. 8 Moving objects before and after adding associated cells to the detected pixels
(c) Pass 3 Fig. 7 Growing cells around pixels of the adaptive subsampling (In this case all cells are full after three passes.)
Figure 9 describes the integration of the spatially adaptive subsampling algorithm for motion detection when using the subsampling as a 1-D array and adding cells to the detected foreground pixels.
Fig. 9 Integration of spatially adaptive subsampling for motion detection using the subsampling as a 1-D array and adding cells to the detected foreground pixels
429
Charley Paulus (ຑ֚ढ) et al.ġSpatially Adaptive Subsampling for Motion Detection
3
Experiments
3.1
Protocol
The algorithm was implemented with Visual C++ and the OpenCV library. The first application handles the calibration, the definition of the volume of interest, and the computation of the spatial adaptive subsampling. The second application handles the motion detection. Experiments were done on eight sequences of 30 s or 1 min. The frame rate was 25 fps and the image size was 320×240 pixels. The scenes were recorded outdoors on a planar area with mini-DV cameras from different points of view with three from the northeast corner of the area, one from the southeast corner, and four from the southwest corner. The area was quite large (20 m×20 m), so the distance between the front of the scene and the back of the scene was significant. The recorded signal was interlaced, with more or less noise depending on the global illumination and with significant lens distortion. Some of the eight sequences only have a small amount of traffic with isolated people, bikes, and cars (not more than four objects at a time in the frame), while other sequences have heavy traffic with groups of people and overlapping vehicles. Figure 10 shows sample images from the eight sequences with the corresponding volumes of interest. The calibration was performed using the Tsai algorithm[9]. About 20 control points were defined for each point of view, with a matching of their 3-D to 2-D coordinates. This matching was very convenient since the area was paved with a regular pattern. The calibration results were quite accurate. Since the area is planar, the calibration provides a two-way 3-D to 2-D correspondence. After calibration, a polygon of interest P was defined for each point of view. The polygons of interest were quite large, covering the whole planar surface of the scenes. The camera position and orientation are assumed to be set so that the analyzed scene covers a maximum part of the frame. Then, a volume of interest V and a region of interest R were selected on the image for each point of view. Some of the volumes of interest had overlapping back faces. The main parameter to be varied is the subsampling density of the region of interest R to determine the minimum density that still provides accurate motion
(a) Sequence 1
(b) Sequence 2
(c) Sequence 3
(d) Sequence 4
(e) Sequence 5
(f) Sequence 6
(g) Sequence 7 (h) Sequence 8 Fig. 10 Sample images from eight sequences with examples of volumes of interest. Sequences 1 to 3 were recorded from the northeast corner of the area, sequence 4 from the southeast corner, and sequences 5 to 8 from the southwest corner.
detection. Therefore, various subsamplings of R were generated from very dense (with a scanning distance of 2 cm) to very sparse (with a scanning distance of 1 m). The order of magnitude of the scanning distance depends on the scene configuration and can vary from one set of tests to another. The generated subsampling was best characterized by the ratio of the total number of pixels to the number of pixels of R . Thus, the densest subsamplings contained 100% of the pixels of R and the sparsest subsamplings contained less than 5% of the pixels of R . The motion detection was evaluated for each sequence and each density of subsampling and compared with running average algorithm[2], the Gaussian distribution algorithm[3], and the mixture of Gaussians algorithm[4].
Tsinghua Science and Technology, August 2009, 14(4): 423-433
430
Figure 11 shows the subsampled scenes and the corresponding detected objects for various subsamplings in low and heavy traffic scenes, respectively.
(a) Low traffic
than 40% of the pixels in R . Below 40%, the cells corresponding to the foreground pixels became too big and the connected components had larger areas than the objects they represented. Moreover, the connected components started to fragment because of bad connections between cells. Although the cells of two neighboring pixels in the subsampling are defined to be contiguous, if a pixel of a moving object fails to be detected, its corresponding cell will not be displayed. If the subsampling is too sparse, that missing cell may be the only link from one part of a connected component to another. Thus, the connected component will fragment. To evaluate the algorithm more quantitatively, a criterion is needed to rate the motion detection result. Analysis of the qualitative observations showed two significant feature, i.e., the average detected area of the connected components and the average detected number of connected components. Therefore, the total area of the connected components (which is the number of white pixels displayed in the result frame) and the total number of connected components were computed for each frame of a given sequence and for a given subsampling. Then their average values were computed over the whole sequence. Denote ai as the average area of the connected components detected using the subsampling that contains i% of the pixels of the region of interest R . Denote ni as the average number of connected components detected using the subsampling that contains i% of the pixels of R . a100 and n100 are then the average area and the average number
(b) Heavy traffic Fig. 11 Subsampled scenes in case of low traffic (a) and heavy traffic (b), and the corresponding detected objects for a subsampling containing 100% (top), 30% (middle), and 6% (bottom) of the pixels in the region of interest R
3.2
Evaluation
The quality of the motion detection was first analyzed qualitatively. Analysis of the detection process for all the sequences with the three algorithms and eight different subsampling densities (from 100% to less than 5% of the pixels in R ) showed that the detection was accurate with subsampling densities containing more
of connected components for the motion detection that analyzes all the pixels of R . Those two values are used as references. Tables 1, 2, and 3 show the number of pixels in the various subsamplings in the region of interest, the area percentage of the region of interest, the area percentage of the whole frame, and the detected areas and numbers of connected components for the three motion detection techniques for three different scenes. To find how much i, the percentage of pixels in R in the subsampling, can be reduced and yet obtain accurate motion detection, ai and ni are plotted on Fig. 12. For clarity, ai / a100 (in %), the normalized
average area of connected components, is plotted where a100 is used as a reference.
431
Charley Paulus (ຑ֚ढ) et al.ġSpatially Adaptive Subsampling for Motion Detection Table 1 Area percentage of Samples the region of interest (%)
Area percentage of the whole frame (%)
Sequence 1 (recorded from the northeast corner) Running average
Area (%)
Number (%)
Gaussian distribution Area (%)
Number (%)
Mixture of Gaussians Area (%)
Number (%)
34 275
100.0000
44.629
2406 (100.000) 2.41 (100.000) 1041 (100.000) 1.90 (100.000) 1888 (100.000) 2.77 (100.000)
28 444
82.9876
37.037
2412 (100.250) 2.42 (100.415) 1047 (100.576) 1.92 (101.053) 1892 (100.212) 2.76 ( 99.639)
19 549
5.0357
25.454
2414 (100.330) 2.45 (101.660) 1054 (101.249) 2.00 (105.263) 1907 (101.006) 2.77 (100.000)
14 262
41.6105
18.570
2407 (100.040) 2.44 (101.245) 1068 (102.594) 2.15 (113.158) 1923 (101.854) 2.90 (104.693)
10 056
29.3392
13.094
2399 (99.709) 2.52 (104.564) 1091 (104.803) 2.44 (128.421) 1952 (103.390) 3.19 (115.163)
5580
16.2801
7.266
2385 (99.127) 2.56 (106.224) 1154 (110.855) 2.94 (154.737) 1960 (103.814) 3.50 (126.354)
2181
6.3632
2.840
2440 (101.410) 2.86 (118.672) 1275 (122.478) 3.44 (181.053) 2084 (110.381) 3.95 (142.600)
556
1.6222
0.724
2685 (111.600) 3.13 (129.876) 1410 (135.447) 3.08 (162.105) 2383 (126.218) 4.29 (154.874)
Table 2 Area percentage of Samples the region of interest (%)
Area percentage of the whole frame (%)
Sequence 4 (recorded from the southeast corner) Running average
Area (%)
Number (%)
Gaussian distribution Area (%)
Number (%)
Mixture of Gaussians Area (%)
Number (%)
33 299
100.0000
43.3581
1242 (100.000) 2.91 (100.000)
499 (100.000)
1.64 (100.000) 1314 (100.000)
2.50 (100.0)
30 969
93.0028
40.3242
1242 (100.000) 2.91 (100.000)
501 (100.401)
1.66 (101.220) 1315 (100.076)
2.51 (100.4)
22 624
67.942
29.4583
1247 (100.400) 2.92 (100.344)
520 (104.203)
1.74 (106.098) 1322 (100.609)
2.52 (100.8)
17 622
52.9205
22.9453
1257 (101.210) 2.96 (101.718)
549 (110.020)
1.88 (114.634) 1335 (101.598)
2.56 (102.4)
14 005
42.0583
18.2357
1268 (102.090) 2.97 (102.062)
591 (118.437)
2.11 (128.659) 1346 (102.435)
2.58 (103.2)
9025
27.1029
11.7513
1276 (102.740) 2.99 (102.749)
696 (139.479)
2.72 (165.854) 1349 (102.664)
2.70 (108.0)
4116
12.3607
5.35938
1371 (110.390) 3.25 (111.684)
970 (194.389)
4.24 (258.537) 1516 (115.373)
3.74 (149.6)
1087
3.26436
1.41536
1579 (127.130) 4.11 (141.237) 1262 (252.906) 4.11 (250.610) 1889 (143.760)
4.83 (193.2)
Table 3 Area percentage of Samples the region of interest (%)
Area percentage of the whole frame (%)
Sequence 5 (recorded from the southwest corner) Running average
Area (%)
Number (%)
Gaussian distribution Area (%)
Number (%)
Mixture of Gaussians Area (%)
Number (%)
34 428
100.0000
44.8281
1255 (100.00) 2.86 (100.000)
850 (100.000)
32 420
94.1675
42.2135
1258 (100.24) 2.86 (100.000)
856 (100.706)
2.13 (100.000) 1687 (100.297) 3.16 (100.000)
22 940
66.6318
29.8698
1258 (100.24) 2.89 (101.049)
865 (101.765)
2.13 (100.000) 1713 (101.843) 3.13 (99.051)
17 361
50.4270
22.6055
1256 (100.08) 2.93 (102.448)
870 (102.353)
2.21 (103.756) 1720 (102.259) 3.08 (97.468)
13 360
38.8056
17.3958
1267 (100.96) 2.97 (103.846)
895 (105.294)
2.30 (107.981) 1749 (103.983) 3.18 (100.633)
8248
23.9572
10.7396
1288 (102.63) 3.04 (106.294)
928 (109.176)
2.47 (115.962) 1763 (104.816) 3.16 (100.000)
3535
10.2678
4.60286
1356 (108.05) 3.41 (119.231) 1041 (122.471) 3.11 (146.009) 1844 (109.631) 3.62 (114.557)
924
2.68386
1.20313
1494 (119.04) 3.32 (116.084) 1184 (139.294) 2.81 (131.925) 2048 (121.760) 3.78 (119.620)
Figure 12 shows the results for a low traffic scene (two to four isolated objects moving in the scene). For subsamplings containing more than 40% of the pixels of R , the average detected area remains roughly the same as when 100% of the pixels in R are monitored. Below 40%, the average detected area starts to increase because cells associated with detected pixels become bigger than the objects they represent. The average
2.13 (100.000) 1682 (100.000) 3.16 (100.000)
number of connected components also starts to increase when using subsamplings that contain less than 40% of the pixels of R . With such subsamplings, connected components start to fragment into smaller groups of cells since the cells are no longer well connected. Of the three motion detection algorithms, the mixture of Gaussians gives the most accurate results. The
432
single Gaussian distribution is very sensitive to noise with the noisy cells having a tendency to group components that should remain isolated. Thus, this
Tsinghua Science and Technology, August 2009, 14(4): 423-433
algorithm detects fewer objects, with larger corresponding areas. Figure 13 shows the results for a heavy traffic scene.
Fig. 12 Evolution of the normalized average area of the connected components (a) and of the number of connected components (b) for various numbers of pixels in the subsampling (% of pixels of R ) for a low traffic scene
Fig. 13 Evolution of the normalized average area of the connected components (a) and of the number of connected components (b) for various numbers of pixels in the subsampling (% of pixels of R ) for a heavy traffic scene
In tests with all eight sequences, the curves representing the normalized average area and the average number of connected components for various subsampling densities all have the similar shape. They remain almost constant for subsamplings containing more than 40% of the pixels of R , and start to increase with lower subsampling densities for the detection.
4
Conclusions
The spatially adaptive subsampling method presented in this paper is a preprocessing step before motion detection which reduces the number of pixels to be analyzed for the background modeling. This adaptive subsampling is the same for all the frames of a given input video stream since it is only based on the geometry of the empty scene and does not change when
objects come into the scene. If an object stands on P , only the bottom which is in contact with the floor will be analyzed with the desired subsampling density. The upper part will be analyzed with a denser subsampling since its projection is higher on the image. Post processing could avoid this process by drawing a bounding box around an object once it is detected as a connected group of pixels with the subsampling density at the bottom, and then this bounding box will be used as the whole bounding box. However, if an object located behind this object comes into this bounding box, it will be analyzed with a density that is too low. Thus, care is needed when bounding boxes overlap. A uniform subsampling density for a whole object assumes that the bottom of its bounding box corresponds to the bottom of the object. However, this is not
Charley Paulus (ຑ֚ढ) et al.ġSpatially Adaptive Subsampling for Motion Detection
true if the object’s shadow points towards the bottom of the image. The shadow could be removed by using the calibration data. The size of the bounding box including the shadow in pixels, the scene geometry, and the sun orientation could be used with the 2-D to 3-D correspondence to calculate the part of the bounding box that holds for the shadow. This could then be removed by cropping the bottom of the box, with the subsampling density of the bottom of the new bounding box used for the whole object. References [1] Piccardi M. Background subtraction techniques: A review. In: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics. The Hague, Netherlands, 2004, 4: 3099-3104. [2] Lo B P L, Velastin S A. Automatic congestion detection
433
Pattern Recognition. Ft. Collins, Colorado, USA, 1999, 2: 246-252. [5] Pentland A P, Rosario B, Oliver N M. A Bayesian computer vision system for modeling human interactions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 831-843. [6] Rymel J, Renno J P, Greenhill D R, et al. Adaptive eigen-backgrounds for object detection. In: Proceedings of the International Conference on Image Processing. Singapore, 2004, 3: 1847-1850. [7] Chumerin N, Van Hulle M. Cue and sensor fusion for independent moving objects detection and description in driving scenes. Signal Processing Techniques for Knowledge Extraction and Information Fusion, 2008, 3: 161-180. [8] Alzoubi H, Pan W D. Efficient global motion estimation using fixed and random subsampling patterns. In: Proceedings of the International Conference on Image Processing. San Antonio, Texas, USA, 2007, 1: 477-480.
system for underground platforms. In: Proceedings of In-
[9] Tsai R Y. An efficient and accurate camera calibration
ternational Symposium on Intelligent Multimedia, Video
technique for 3D machine vision. In: Proceedings of the
and Speech Processing. Hong Kong, 2001: 158-161.
IEEE Conference on Computer Vision and Pattern Recog-
[3] Wren C R, Azarbayejani A, Darrell T, et al. Pfinder:
nition. Miami Beach, Florida, USA, 1986: 364-374.
Real-time tracking of the human body. IEEE Transactions
[10] Tsai R Y. A versatile camera calibration technique for
on Pattern Analysis and Machine Intelligence, 1997, 19(7):
high-accuracy 3D machine vision metrology using
780-785.
off-the-shelf TV cameras and lenses. IEEE Robotics and
[4] Stauffer C, Grimson W E L. Adaptive background mixture models for real-time tracking. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and
Automation Magazine, 1987, 3(4): 323-344.