Dynamic task decomposition for decentralized object tracking in complex scenes

Dynamic task decomposition for decentralized object tracking in complex scenes

Computer Vision and Image Understanding 134 (2015) 89–104 Contents lists available at ScienceDirect Computer Vision and Image Understanding journal ...

3MB Sizes 41 Downloads 119 Views

Computer Vision and Image Understanding 134 (2015) 89–104

Contents lists available at ScienceDirect

Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu

Dynamic task decomposition for decentralized object tracking in complex scenes Tao Hu a,b, Stefano Messelodi a, Oswald Lanz a,⇑ a b

FBK Fondazione Bruno Kessler, Trento, Italy ICT Doctoral School, University of Trento, Italy

a r t i c l e

i n f o

Article history: Received 24 January 2014 Accepted 7 February 2015

Keywords: Multi-camera tracking Object tracking Distributed tracking Camera selection Resource allocation Task assignment Performance evaluation

a b s t r a c t The employment of visual sensor networks for video surveillance has brought in as many challenges as advantages. While the integration of multiple cameras into a network has the potential advantage of fusing complementary observations from sensors and enlarging visual coverage, it also increases the complexity of tracking tasks and poses challenges to system scalability. For real time performance, a key approach to tackling these challenges is the mapping of the global tracking task onto a distributed sensing and processing infrastructure. In this paper, we present an efficient and scalable multi-camera multi-people tracking system with a three-layer architecture, in which we formulate the overall task (i.e., tracking all people using all available cameras) as a vision based state estimation problem and aim to maximize utility and sharing of available sensing and processing resources. By exploiting the geometric relations between sensing geometry and people’s positions, our method is able to dynamically and adaptively partition the overall task into a number of nearly independent subtasks with the aid of occlusion reasoning, each of which tracks a subset of people with a subset of cameras (or agencies). The method hereby reduces task complexity dramatically and helps to boost parallelization and maximize the system’s real time throughput and reliability while accounting for intrinsic uncertainty induced, e.g., by visual clutter and occlusions. We demonstrate the efficiency of our decentralized tracker on challenging indoor and outdoor video sequences. Ó 2015 Elsevier Inc. All rights reserved.

1. Introduction The technology of people tracking in visual surveillance has developed fast over the past years. With the increasing demand of security and surveillance as well as other applications such as ambient assisted living, sports analysis, traffic control and smart museums, nowadays visual surveillance often involves monitoring large, open areas (eg., airports) with multiple networked cameras. The employment of visual sensor networks in surveillance systems has brought in as many challenges as advantages. While the integration of multiple cameras into a network has the potential advantage of fusing complementary observations from sensors and enlarging visual coverage, it also increases the complexity of tracking tasks and poses challenges to system scalability. Traditional methods of tracking all the targets with a joint likelihood can easily incur the curse of dimensionality as the number of targets increases. On the other hand, tracking with totally independent particle filters often results in the problem of hijacking [4]. Another issue arises from the enlarged area to be ⇑ Corresponding author. http://dx.doi.org/10.1016/j.cviu.2015.02.007 1077-3142/Ó 2015 Elsevier Inc. All rights reserved.

monitored and the increased number of views to be processed. Larger area means more computational resources to be allocated and more views means more data to be processed, hence engendering more computational overhead. How to curtail the computational load so as to maintain real-time tracking without losing frames is then a big issue. Besides, camera networks usually have limited resources such as communication bandwidth and sensors typically have limited or even no computational capabilities, the way of information gathering, sharing and processing among cameras is then of crucial importance. A possible approach to tackling these challenges would then be to dynamically map the demanding global task onto a decentralized or distributed sensing and processing infrastructure. In this paper, we demonstrate that the overall tracking task can be split into nearly independent subtasks corresponding to the tracking of subsets of people with a subset of cameras, thereby reducing computational load and saving communication bandwidth due to branched image transfer. Our dynamic task decomposition strategy brings out a three-layer architecture for the tracking system (Fig. 1), in which cameras constitute the bottom layer and are dynamically grouped into clusters (or agencies,

90

T. Hu et al. / Computer Vision and Image Understanding 134 (2015) 89–104

2.1. Centralized tracking approaches

Fig. 1. The three-layer architecture of our tracking system.

the middle layer) which track a subset of targets using existing base trackers. The top layer is a supervisor process which takes care of target detection and track termination, and task decomposition. For real-time consideration, we design a simple but effective cost function which measures the cost induced by decomposing a task. Two methods for task decomposition are proposed. One is based on exhaustive search, i.e., to enumerate all feasible combinations of agencies and subsets of targets under given constraints and find out the one that results in the least cost. The other method formulates it as a minimum cost flow problem. The main contributions of this paper develop around a novel formulation of the many-to-many assignment problem in multitarget multi-camera tracking. It is based on a task decomposition cost that combines computational complexity of base trackers with camera utility measure in a parameter-free objective function. Camera utility goes beyond the concept of visual coverage of a target, and considers information theoretic measure of predicted estimation uncertainty capturing essential information from multi-view geometry of the camera network available via calibration. The assignment problem defined over the objective function is efficiently solved by a minimum cost flow method that exploits a graph encoding the structure of the function and occlusion constraints. Preliminary results have been published in conference proceedings [47,48]. The remainder of the paper is structured as follows. Section 2 gives a literature review of the related work. Section 3 presents the framework of our tracking method and how we decentralize the overall tracking task. Section 4 details how we implement dynamic task decomposition. In Section 5 we present comprehensive evaluation about the proposed algorithms by conducting a number of experiments and we draw conclusions in Section 6.

2. Related work Visual tracking is a broad topic and has been studied for decades. While providing an exhaustive literature review is not possible, here we focus on the context of multi-camera multi-object tracking with overlapping fields of view (FOVs). We classify existing approaches into two main categories: centralized and distributed tracking approaches, and further divide the latter category roughly into three sub-categories: tracking-fusing approaches (i.e., each camera performs tracking independently and then the outputs from different cameras are fused), individually tracking approaches (i.e., each target is tracked separately) and group tracking approaches (i.e., targets are tracked in groups). A good related work discussion can also be found in [13].

These approaches track all the targets together in a centralized framework, whether they use Kalman filtering [32,33], particle filtering [34,39], or other methods such as a probabilistic occupancy map [35]. A major problem with these trackers is that they are not suitable for real time tracking in large, complex scenes where there are many people occluding each other. For example, the tracker proposed in [35] can track just up to six people according to the authors. The limited computational resources as well as system bandwidth within a camera network make it difficult for centralized trackers to track a large number of targets in real-time. Moreover, for particle filtering-based trackers, there is the curse of dimensionality as the number of targets goes up. Some approaches have been proposed to reduce computational cost and data flow, for example, by only evaluating high-quality particles [24] or dynamically deactivating some cameras [19]. Also some approaches are proposed to tackle the curse of dimensionality. [20,22] proposes a Hybrid Joint-Separable (HJS) filter, an approximate Bayesian tracker that propagates independent representations for each target with a joint likelihood to manage occlusions. The HJS filter has a computational complexity upperbound that grows quadratically (instead of exponentially as for the standard PF) with the number of targets. However, these approaches cannot solve the scalability problems with the centralized trackers in a fundamental way. As the number of targets and the number of cameras increase to a certain level, the performance of a centralized tracker (even the more efficient HJS filter) is likely to decline. 2.2. Distributed tracking approaches In order to perform reliable real-time tracking, balance system computational load and data flow, and thus enhance system scalability, we have to resort to a distributed tracking framework or a decentralized one (which can be seen as a special case of the distributed framework). Here we further divide these approaches into three sub-categories: tracking-fusing approaches, individually tracking approaches and group tracking approaches. 2.2.1. Tracking-fusing approaches These approaches first track the targets in each camera independently, and then do a fusion of the tracking results. An representative approach is the in-network aggregation approach proposed in [16]. Each camera is assigned a set of particles and the posterior density is approximated separately using Gaussian Mixture Models (GMMs) whose order is selected by optimal matching according to KL-distance. Before sent to the base station, the mixture model parameters from different cameras are reduced using in-network aggregation to reduce communication overhead. Another example is the weighted average method proposed in [5], where targets are tracked by each camera independently and the belief of target states computed from each camera is fused with its neighboring cameras (cameras within a communication range) through a weighted average method. The belief weight for each camera is determined by its observation confidence and an occlusion indicator. Instead of using particle filtering as in [16,5], the Kalman-Consensus filter is employed by [17] in a Pan-tilt-zoom (PTZ) camera network which aims to selectively track certain targets at desired resolutions while keeping the entire area in view. Each camera works as an autonomous agents and has an embedded tracking module, a control module and a Kalman-Consensus filter which enables neighboring cameras to reach a consensus about the estimated states of the targets. Consensus is reached through point-to-point communications between neighboring cameras which are defined in dynamically generated topological

T. Hu et al. / Computer Vision and Image Understanding 134 (2015) 89–104

graphs for each camera. The approaches [16,5,17] have the advantage of saving system bandwidth since every camera performs tracking independently (no image transfer is needed). However, these approaches require the cameras have processing power (i.e., smart cameras). Besides, the tracking performance of each camera is not guaranteed when there are a large number of people since it lacks 3D information about the scene, and consequently the results after fusion may be affected. 2.2.2. Individually tracking approaches Typically these approaches assign the best camera(s) to each target and track each target independently [10,9,11,6]. An important issue is to design a camera utility function quantifying how well the camera(s) is able to track the object, which usually considers such factors as target orientation, resolution and distance from the camera. Another key issue is the assignment of cameras to targets, which can be modeled as a game [10,9], a market mechanism [7], a constraint satisfaction problem (CSP) [11], a distributed constraint satisfaction problem (DCSP) or even a Stable Marriage Problem [14], etc. A comparison of methods for camera selection and handoff can be found in [12]. More recently, [45] investigate the problem of online selection of monocular view sequence in calibrated camera network and derive an objective function for the quality of a view sequence given a task based on visual coverage and smoothness criterion. In this paper we go beyond the concept of visual coverage, and derive a solution for the manyto-many assignment problem in multi-target multi-camera tracking based on information theoretic criterion of predicted estimation quality induced by multi-view geometry. While the methods [10,9,26,7,45] assign a single camera for each target, assigning multiple cameras to each target helps increase accuracy and robustness. Although we can assign all the cameras viewing the target to it, in practice not all cameras are equally need for the tracking task. Therefore we could instead select a subset of cameras for tracking each target, which may save computational resources and system bandwidth substantially. Tessens et al. [13] propose to select a subset of cameras for tracking each target independently in a distributed smart camera network. The selection of the subset is done by computing a suitability value which is based on the Dempster–Shafer (DS) theory of evidence. They also demonstrate that the reduction of observations does not decrease the performance. In [8], cameras are grouped into surveillance clusters to which surveillance tasks are allocated. The allocation problem is formulated as a distributed constraint satisfaction problem (DCSP) and solved by minimizing a cost function. In [23], a cluster-based Kalman filter approach is proposed. Once a new target is detected, cameras which can observe the same target communicate with each other to form a cluster and elect a cluster head. The cluster head aggregates all the local measurements of the target acquired by members of the cluster. The estimation of the target position is then made by the cluster head via Kalman filtering and is transmitted to a base station. Another example is the decentralized system composed of a number of hierarchical Fault Containment Units (FCUs) proposed in [15], where the target is tracked by an FCU comprised of a camera pair and is switched to another FCU in case of a tracking failure. 2.2.3. Group tracking approaches As discussed above, tracking all targets together in a centralized fashion is not suitable for tracking in complex scenes. On the other hand, tracking each target independently may suffer from the problem of hijacking, especially when there are serious occlusions among targets. The worst case is that the target is occluded in all views. If this continues for some time, the tracker is likely to fail. Therefore, a better choice would be to divide the targets into groups and track each group in a centralized way.

91

Most existing group tracking approaches [36–38] focus on the detection and evolving (merging, splitting, etc.) of groups. Here our goal is to track each target by dividing people into groups. Kembhavi et al. [18] propose to divide targets into interaction groups according to a similarity score and perform tracking with a joint filter for each group of targets. The approach helps circumvent the curse of dimensionality and performs better than using independent particle filters for each target or using a single joint filter. Most recently the decentralized particle filter [27] has been proposed to increase the level of parallelism for particle filtering and is used in [25] for joint tracking of individual targets and groups. However, the approaches [18,27] are built on single view tracking and thus are not suitable for tracking in complex scenes due to intrinsic uncertainties induced by occlusions and clutters. In this paper, we dynamically divide the targets into groups, each of which is tracked by an HJS filter [22] (a subtask). We divide targets into groups according to occlusion constraints (will be introduced later) not according to spatial proximity and appearance similarity among targets as in [18]. The reconfiguration of subtasks is managed by a supervisor process. Our tracking system is hereby decentralized, which has the advantage of less fusion information as compared to distributed ones [29]. 3. Framework for probabilistic tracking with dynamic task decomposition The application scenario addressed in this paper suggests the choice of a probabilistic framework, since measurements may convey intrinsic uncertainty which cannot be eliminated due to occlusion, target similarity and background clutter. Besides, for efficiency reasons, illumination effects (e.g. shadows) are neglected and only coarse shape and appearance models can be used, leading to a noisy measurement process. Estimates are therefore inherently inaccurate and a deterministic framework would not adequately account for this. 3.1. Probabilistic tracking by sequential Bayesian state estimation In this paper, object tracking is interpreted as a state estimation problem. The dynamic components of interest of the monitored environment are described with a vector x of numbers, the state, which is evolving in continuous time and observed at discrete time instants t through a measurement vector zt . More specifically, in this paper the state is composed of the target’s 2D positions on the ground plane. The aim is then to estimate its posterior distribution pðxt jz1:t Þ at t conditioned on a sequence of observations z1:t obtained up to t. In order to support sequential estimation imposed by real time applications, signal xt is modeled as a first order Markov process, and observations z1:t are assumed to be conditionally independent given a sequence of states x1:t . This enables us to compute a new estimate pðxt jz1:t Þ from the actual observation zt and its previous estimate pðxt1 jz1:t1 Þ using stochastic propagation and Bayes law

pðxt jz1:t Þ / pðzt jxt Þ

Z

pðxt jxt1 Þpðxt1 jz1:t1 Þdxt1

Due to the highly non-linear relation between the targets’ ground positions xt and the resulting image measurements zt , especially under occlusion, the resulting posteriors pðxt jz1:t Þ are complex and no closed form solution is available for computing the recursion above. In such cases, sampling methods such as particle filtering [3] can be used effectively in combination with generic non-parametric likelihood models to propagate a sample based representation of the posteriors, e.g., for color based tracking [1,2]. However, since the number of samples (particles) required to

92

T. Hu et al. / Computer Vision and Image Understanding 134 (2015) 89–104

Fig. 2. Overall system architecture supported by the proposed task decomposition framework: leveraging on distributed/cloud computing infrastructure (left), and exploiting next-generation smart camera network (right) to leverage on bandwidth requirements – a major limitation and scalability issue in centralized monitoring of large environments. For our experiments in Section 5.4 we use a simulated smart camera environment as depicted in this figure (right).

properly represent generic posteriors increases exponentially with the size of x, i.e., with the number of targets, a fully joint formulation of the multi target problem becomes computationally prohibitively expensive even with a moderate number of targets. To break this curse-of-dimension under explicit occlusion reasoning, the Hybrid Joint-Separable (HJS) model has been proposed for color-based tracking [22], where instead of a fully joint representation of pðxt jz1:t Þ, the marginals for each target are propagated over time resulting in a computational complexity with quadratic upper-bound in the number of tracked targets, instead of exponential. We use [22] as the baseline tracking framework with systematic occlusion handling in this work. 3.2. Distributed estimation via dynamic factorization The task of monitoring a large environment may be formalized as a single, joint state estimation problem. However, it is easy to imagine that applying this to real time monitoring in real world scenarios such as the one depicted in Fig. 10 where many people interact in a large, unstructured environment, may be unaffordable even with the aforementioned HJS tracking. Rather than attempt to track all people with a single HJS filter in a centralized fashion, a more feasible solution would be to instantiate several sub-filters with reduced, local competence (e.g. by dividing targets into groups associated with subsets of cameras in which they are best observable) which together may perform more efficiently not only in terms of overall computational complexity, but also in terms of real time behavior (different sub-filters could be executed on different processing units) as well as data flow management (image transfer could be routed intelligently in a distributed implementation). From a theoretical point of view it is easy to show that tracking with independent sub-filters is exact when dynamical model pðxt jxt1 Þ and observation model pðzt jxt Þ factorize over the chosen set of camera-target associations, i.e., if ai is the subset of cameras (which we call an agency in the following) used to track a group of targets g i we have

pðxt jxt1 Þ ¼

Y g g i pðxt i jxt1 Þ; gi

pðzt jxt Þ ¼

Y

a

g

pðzt i jxt i Þ

ð1Þ

fai ;g i g

While assuming independence among the motion patterns of targets may be acceptable at least for short predictions (the Bayes filter is first order in time), this is certainly not true for the observation model when people arrange in groups causing frequent and persistent occlusions. Under occlusion, the appearance of one target cannot but be explained in relation to that of the occluder, which requires therefore a joint analysis of captured images. On the other hand, the same scene may be seen occlusion-free from another viewpoint where indeed the resulting image can be elaborated independently for locating each target,

thus, at lower computational cost and in parallel. This is the key observation leading to our task decomposition algorithm described in the next section: given known view geometry (camera calibration) and the targets’ estimated positions provided online by tracking, we determine (and continuously update) camera-target associations that maximize parallelization while guaranteeing consistent occlusion handling. 3.3. System overview Fig. 2 gives an overview of the overall tracking system architecture supported by the proposed task decomposition framework. HJS base trackers are implemented by two types of processes: (i) HJS particle filter (PF) process implements stochastic propagation with an MRF collision avoidance dynamical model [22], and (ii) HJS likelihood process is connected to a specific camera and receives a set of propagated particles from a PF process, computes their HJS likelihoods [22] utilizing a color model for each target extracted automatically from images upon detection, and sends them back to the HJS PF for multi-view Bayesian integration and resampling. The HJS likelihood computation has a complexity that scales quadratically with the number of occluding targets, and is the computationally most demanding process. There can be several HJS likelihood process instances connected to the same camera serving different HJS PFs. All HJS likelihood processes can run in parallel1 as they are based on conditionally independent models given the predicted particles computed by the HJS PFs (this is a common assumption in multi-camera tracking and, further, a valid assumption given occlusion reasoning in the task decomposition method, Section 4.2). Detection of new targets is assumed to be carried out by an external multi-view people detection process, and in our implementation we use the motion-based method in [43] that utilizes the current estimates to allocate computations to, and only when, observations not explained by tracked targets, consequently, avoiding multiple detections of the same target. Detector signals the presence of a new target to the global task decomposition process. At a fixed frequency (3 Hz in our experiments), the decomposition process receives the current tracking estimates from the HJS PFs, computes the new decomposition and, if required, sends reconfiguration instructions to the HJS PFs, i.e., which HJS likelihood processes each HJS PF can use for tracking. While the proposed task decomposition framework can leverage on distributed processing using existing CCTV infrastructure (Fig. 2, left), interesting opportunities arise with next-generation smart cameras that are expected to provide high-performance front-ends to run dedicated embedded applications. Our experiments are based on a simulated smart camera environment 1 We use parallel HJS likelihood processes even with the centralized HJS PF experiments in Section 5.4.

93

T. Hu et al. / Computer Vision and Image Understanding 134 (2015) 89–104

(Fig. 2, right) where all HJS likelihood processes are executed onboard the camera and image transfer over the network is avoided. This definitely overcomes the scalability issue due to bandwidth requirements in centralized approaches to monitoring large environments. At the same time, it opens up new research directions for computer vision previously explored in sensor networks, such as how to optimize load distribution in camera networks with local processing power [44]. 4. Task decomposition As mentioned in the previous section, a principled approach to distributed tracking may be derived by exploring dependency relations among environment dynamics (target trajectories) and measurements (utilized cameras). While such dependencies may appear in many different facets (group behavior, illumination effects such as shadows), the most essential one arises from occlusions. The approach proposed next accounts for this, providing a solution derived from knowledge about camera placement and estimated target positions under given constraints. 4.1. Task decomposition: an example An example is shown in Fig. 3: four targets have to be tracked with three cameras. Since all four targets may be visible in all three cameras, a trivial solution is to instantiate a single task equipped with all three cameras. However, closeness of targets B, C and D suggests that they should be grouped together, whereas target A might be handled separately. This can be derived from occlusion reasoning. The closeness of targets B, C and D indicates potential occlusions among them in all the three views within a certain time interval, as can be seen from the intersecting dashed red circles. Due to the potential occlusions, target B, C and D should be tracked together. With this division, no occlusions between target A and group (B, C, D) are allowed, a feature that depends on the relative placement of selected cameras. Camera 1 observes (B, C, D) occluded by A and can therefore not be used to track (B, C, D) only (i.e., if camera 1 is used to track (B, C, D), it must also track A at the same time). On the other hand, camera 1 could take care of tracking A only as (B, C, D) are visible but do not influence on the appearance of A. Cameras 2 and 3 could all be associated to both

Fig. 3. Tracking four targets with three cameras: how to optimally assign cameras, targets to independent tracking processes? The red dashed circles centered around the targets are the predicted supports which define all possible positions a target can reach within a short time interval Dt assuming a maximum velocity v max of the targets. Dt is associated to the time interval for which such assignment is assumed to be valid (see Table 1 for values used in the experiments).

(A) and (B, C, D) as there are no occlusions among them. An appropriate choice would then instantiate two subtasks,

Sub1 ¼ fðAÞ; ð1; 2; 3Þg and Sub2 ¼ fðB; C; DÞ; ð2; 3Þg It can be concluded that grouping purely based on target proximity is not sufficient to find feasible subtasks and view geometry needs to be taken into account in the definition of subtasks: indeed, in Fig. 3 the circle of Target A does not overlap with any of B, C, D but nonetheless, camera 1 cannot be used to track (B, C, D) alone because A can occlude B, C or D, i.e., the independence assumption in Eq. (1) is violated. This solution does not consider constraints about resources. In practice, we may have requirements about (1) the minimum number of cameras in an agency to ensure tracking performance; (2) maximum number of cameras in an agency for computational overhead consideration; (3) maximum number of agencies (which should not be larger than the number of targets) to ensure there are enough available processing units for every subtask to be mapped into. Besides, this solution is only the snapshot configuration in Fig. 3. Task decomposition cannot be carried out with a frequency comparable to the observation rate. A lower rate may be employed while accounting for potential occlusions within a given time interval. To sum up, a task decomposition algorithm must be able to: (1) detect potential interactions (i.e., occlusions) among targets, and (2) reason about efficiency (or cost) and suitability of a possible allocation, which will be detailed in the next sections. 4.2. Occlusion reasoning As outlined in Section 3.2, the allocation of a set of agencies remains sustainable as long as the measurement model remains separable over the set of positions (or supports) that can be reached from the current estimates within a given time window. Verifying this at full resolution is not required: we may limit ourselves to detect potential occlusions while reasoning on the ground (on a 2D world observed with linear cameras). To compute each targets’ support for occlusion reasoning on a given time interval one can use the dynamical model pðxt jxt1 Þ and stochastic propagation (see Section 3.1). To further reduce complexity, we can instead approximate the current estimate by a circular support centered around the estimated target position, i.e., around the MAP maximum a posteriori particle. Its predicted support is then obtained as the envelope of all the positions that can be reached within a time span at a predefined maximal velocity. This corresponds to drawing a circle around the current

_

_

Fig. 4. Potential occlusion detection. A0 A1 and B0 B1 are the projections of the support outlines of targets A and B to the circle _ centered _at the viewpoint of the camera. (a) No occlusion between A and B when A0 A_1 and B0 B1_have no intersection. (b) Occlusion is detected between A and B when A0 A1 and B0 B1 overlap.

94

T. Hu et al. / Computer Vision and Image Understanding 134 (2015) 89–104

Fig. 5. Overview of the ground occupancy map generation process. The process uses a calibrated model of the imaging process and inherits uncertainty modes owing to occlusion, perspective, and resolution (see also examples in Fig. 12).

position with a radius equal to the product of maximum velocity and time span (the dashed red circles in Fig. 3). To detect potential occlusion in each view, we draw a circle around the camera viewpoint and project the support outline of each target to the circle. If the projection of outlines has intersection, then occlusion is detected (Fig. 4). 4.3. Efficiency measure for task allocation Since in most cases there are many feasible solutions for grouping cameras and targets and associating them to form subtasks even given a number of constraints, we need to quantify the cost for each configuration. The optimal solution is the one with lowest cost. Efficient task allocation relies on the choice of an appropriate function to quantify the cost. It should account for computational load as well as for sensing and networking issues. While possible solutions to the latter issue in the context of active vision were preliminary investigated by the authors in [28], in this paper the focus is on the computational and sensing aspects. When we split a joint filter into agencies with competence on a subset of K i targets, the P P 2 computational complexity reduces from ð K i Þ2 to K i with HJS P particle filtering (in the case of joint base trackers from e K i to P K e i ). Beyond complexity, another factor that impacts on tracking performance is the number and spatial distribution of associated cameras. A higher number guarantees better visual coverage, but at the increased cost of analyzing more images (besides, power consumption and networking load due to image transfer). Two cameras that observe the same target from orthogonal directions convey more information about its position than two parallel or anti-parallel cameras do. Also, a target may be observed (partially) occluded in one view but maybe free of occlusion in another view, and the amount of information about its position that can be provided by different cameras may vary significantly even if occlusions are handled properly as in HJS tracking. To account for all these factors in defining the camera utility component of the cost function, we combine the imaging geometry of an agency available with calibration with a generative model of object observation to predict the estimation uncertainty associated with localizing a subset of targets with the subset of cameras composing the agency. Interestingly, this allows establishing a link to the computational complexity component of the cost function via information theory, and we obtain a parameter-free cost function to be minimized for optimal task decomposition, which can be

transformed into a graph-based assignment problem efficiently solved by a minimum cost flow method. To do so, we perform three steps: first, we generate predicted observations of the targets in the different views based on given MAP ground positions provided by tracking; then, we estimate probabilistic ground occupancy maps from those predicted observations; and, finally, we compute the entropy of such ground occupancy maps to evaluate camera utility for each target. 4.3.1. Generating observations We generate observations by projecting a coarse 3D shape model placed at the targets’ MAP ground position to each camera. This projection results in the generation of the outline of the projected 3D shape for the considered view. For each camera, these projections are done in order, from the closest target to the farthest. When an occlusion is detected in the projection, i.e., when an outline pixel falls within the silhouette of an already visited target, instead of retaining that pixel for the outline, we sample a new pixel randomly within the occluding objects’ silhouette (see Fig. 5). In this way we consistently inject noise in the generated observations induced by occlusion, which results in occupancy uncertainty and, consequently, in a penalization of the camera utility value (see Fig. 13(f)). 4.3.2. From observations to ground occupancy maps The synthetic contours generated this way for each camera in the agency can be back-projected to probabilistic ground occupancy maps by the method in [21]. The method implements a generalized-Hough like transform to back-project object contours to ground occupancy under a 3D rigid shape model hypothesis,2 combined with a kernel density estimation on the ground to accumulate evidence from the different views.3 The kernel bandwidth hereby accounts for inaccuracies induced due to modeling people with a 3D rigid shape. Further in the context of estimating 2 More precisely, the same 3D shape model is used for generating predicted observations (4.3.1) and back-projecting them to ground occupancy maps (4.3.2), as well as for detection [43] and color likelihood computation with the HJS base trackers [22]. 3 The method [21] was originally proposed for detection in multi-camera environment (indeed we use a previous version [43] of it for detection in the experiments): apparent motion extracted via frame differencing is used to generate ground occupancy maps of moving people under a 3D shape model hypothesis. Detection occurs when a peak above threshold is detected in the map. For the purpose of estimating camera utility, we use the outlines of 3D shape projections (4.3.1) instead of apparent motion to generate the maps.

T. Hu et al. / Computer Vision and Image Understanding 134 (2015) 89–104

95

agency utility, we can adapt the bandwidth to consider other uncertainty sources such as prediction time of stochastic propagation and length of time interval for which a decomposition is assumed to be valid, as well as for inaccurate calibration and poor frame synchronization across views as observed in PETS_2009_S2_L1 multi-view dataset (see experiment Section 5.4.3). Fig. 5 sketches the ground occupancy map generation process. 4.3.3. Combining uncertainty in ground occupancy with computational complexity via information theory Important to note here that the above generation of ground occupancy maps from synthetized observations relies on a calibrated model of multi-view imaging process, and inherits all uncertainty modes induced by the imaging process associated with the agency and the relative locations of the targets to be assigned to it. These include distortions due to perspective and occlusion, and even the resolution with which observations and occupancy maps are rendered. The effect of these on the ground occupancy can be observed in Fig. 5. From occupancy maps it is straightforward to compute, for each target tj , the entropy Hðtj jai Þ as a principled measure of predicted localization uncertainty under agency ai . Further, the relation of entropy to the volume of the high probability regions in the occupancy map (its typical set) is formalized by the Asymptotic Equipartition Property theorem [40]. In [41] this relation has been exploited for adapting the number of particles to be propagated over time by a particle filter to estimation uncertainty: the number of i.i.d. samples (particles) to consistently represent an estimate with entropy H is N ¼ qeH where q is a constant density number of particles per unit (e.g. q ¼ 20  20 ¼ 400 particles per squaremeter). The computational cost for HJS particle filtering with P P adaptive sample size becomes then q faj ;gj g K j ti 2gj eHðti jaj Þ with K j ¼ Cardðg j Þ denoting the cardinality of target set g j .4 Since our base trackers operate with fixed number N0 of particles, after algebraic steps we can conveniently split this function into a sum of two terms

X

X N0 g Cardðg j Þ þ q fa ;g g Cardðg j Þ t 2g ðeHðti jaj Þ  eH0 Þ j j j i j |fflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} HJS complexity

2

X

camera utility term

where eH0 ¼ N 0 =q. The first term is the computational cost of HJS base trackers (i.e., of the real computations performed with tracking), while the second term accounts for additional computations if the particle size is adapted to estimation uncertainty, which depends on the chosen set of cameras (agencies) observing the targets. While this second term is derived from computational cost, these computations are never executed but instead counted for the purpose of penalizing decompositions that result in predicted uncertainty that cannot be coped with using the constant number of particles of base trackers. Therefore, without impairing the effectiveness of the cost function, we can introduce two modifications to the camera utility term.  We lower-bound the difference eHðti jaj Þ  eH0 in the inner summation to 0; this is equivalent to lower-bounding the minimum number of particles in the adaptation to N 0 . Indeed, N 0 particles are always used by the base trackers and this way we ignore negative contributions from ‘virtual savings’ when less particles would suffice. If not ignored, these negative contributions could compensate for higher predicted uncertainty of other targets, leading to sub-optimal decompositions.

4 In HJS tracking with K occluding targets, for every particle, the likelihood is computed from K evaluations which is the dominating computational cost.

Fig. 6. Camera utility term of cost function in Eq. (2) is proportional to the Bayes risk of tracking failure: if the characteristic function of an estimate (the ground occupancy pdf) has support (typical set) TSðti jaj Þ and with N 0 particles we can populate a subset of volume eH0 , then the Bayes risk of failure is approximated by the volume difference eHðti jaj Þ  eH0 .

 We replace the summation weights Cardðg j Þ with 1; this is equivalent to changing the cost of additional particles to standard particle filtering cost, i.e., particle filtering without explicit occlusion handling. Note that we already take into account potential occlusions in the generation of ground occupancy maps, thus, in the entropy estimation used for computing the utility term. Most importantly, keeping these weights independent of groupings g j allows us to avoid exhaustively searching for the minimum cost solution that would introduce a scalability issue (see also Fig. 11), instead, we can solve for task decomposition efficiently with the cost flow method in Section 4.5. With these modifications, the cost function for task decomposition is

X   qX Cardðg j Þ2 þ max 0; eHðti jaj Þ  eH0 N 0 t g j

ð2Þ

i

To further elaborate on its information theoretic grounding, in Fig. 6 we give an alternative interpretation of camera utility defined this way, in terms of Bayes risk minimization of tracking failure. 4.4. Task decomposition by exhaustive searching A trivial solution for finding the best configuration of agencies and targets is through exhaustive search. Although the numbers of possible partitions of cameras and targets grow exponentially, it is computationally tractable when there are not too many cameras or targets. Let us denote the number of cameras by N c , the number of targets by N t , the maximum number of agencies by MAX a , the minimum and maximum number of cameras each agency can have by MINc and MAX c . The idea is first to partition the camera set fC 11 ; . . . ; C 1M1 ; . . . ; C i1 . . . C iMi ; . . . ; C Nc 1 ; . . . ; C Nc MNc g to generate agencies (subsets of cameras), and partition the target set ft 1 ; t 2 ; . . . ; t Nt g to generate groups of targets, where C i1 ; . . . ; C iMi are the multiple copies of camera C i and Mi ði ¼ 1; . . . ; N c Þ is the maximum number of agencies camera C i can be assigned to (we introduce multiple copies to allow each camera to be assigned to more than one agency). Camera partitions that do not meet the constraints of MAX a ; MINc and MAX c should be precluded. In this way, we get a camera partition set Sc and a target partition set St . Each camera partition contains a set of agencies and each target partition contains a set of target groups. Then for each camera partition and each target partition we enumerate all the possible configurations of subtasks (combinations of

96

T. Hu et al. / Computer Vision and Image Understanding 134 (2015) 89–104

T1

C11|1 C21|1

A1

1|1

0|K-1

3|1

0|1

T2 0|1

s

B1

C1Na|1

C2Na|1 . . .

. . .

0|Na

A1,1 0|1

5|1

t

A1,K-1

0|Nt-Na

2K+1|1

0|1 1|1

CNt1|1

B2 3|1

TNt

CNtNa|1

ANa

0|K-1

ANa,1

0|1

5|1

2K+1|1

ANa,K-1

Fig. 7. The flow network with N t targets and N a agencies. The edge label ‘‘xjy’’ means the cost of the edge is x and its capacity is y. ‘‘cij ’’ is the cost to assign target i to agency j   (i.e., Nq0 max 0; eHðti jaj Þ  eH0 ).

agencies and target groups) and check whether they meet our required occlusion constraint. If so, compute the cost according to Eq. (2). Finally the configuration with the least cost is selected. In practice, the partitions of cameras and targets and the combinations of agencies and target groups can be computed offline and stored in a lookup table. 4.5. Task decomposition by minimum cost flow Finding the optimal task decomposition solution can be modeled as an assignment problem [42] that we map into a minimum-cost flow problem [30]. First, the set of partitions of available cameras is computed as in Section 4.4. Each partition P consists of a set of N a agencies A ¼ fa1 ; . . . ; aNa g. The problem is to find a surjective mapping M from the set T ¼ ft 1 ; . . . ; tNt g of targets to the set of agencies:

M:T !A which minimizes the cost function (2) which can be rewritten in the following way:

0 min@ M

X aj 2A

1

2

CardðM ðaj ÞÞ þ

qX N0 t 2T

1  Hðt jMðt ÞÞ  H0 A i i : max 0; e e

i

For each partition we construct a weighted network whose minimum cost flow allows computation of the best mapping M. A weighted network N ¼ ðV; E; cap; costÞ consists of a directed graph G ¼ ðV; EÞ, a capacity function cap : E ! Rþ and a cost function cost : E ! R. The set of vertices V ¼ fs; t; B1 ; B2 g [ Vt [ Va [ Vs is composed of two special vertices, the source s and the sink t, two nodes B1 and B2 that constrain the map to be surjective, and the following three subsets (see Fig. 7):

The total number of vertices is N v ¼ N t þ N a ðN t  N a Þ þ 4. In the case K < 2 the set V s is empty and the number of vertices is N v ¼ N t þ N a þ 4. The set of edges E ¼ Es [ Em [ Eb [ Ea1 [ Ea2 [ Et is composed of the following six subsets: Es ¼ fðs; T i Þji ¼ 1 . . . N t g a set of N t edges connecting the source node to each target vertex in Vt (blue edges in Fig. 7); Em ¼ fðT i ; Aj Þji ¼ 1 . . . N t ; j ¼ 1 . . . ANa g a set of N t  N a edges connecting each target vertex to each agency vertex (yellow edges); Eb ¼ fðAj ; B1 Þjj ¼ 1 . . . N a g a set of N a edges connecting each agency vertex in Va to the B1 node (green edges); Ea1 ¼ fðAi;j ; B2 Þji ¼ 1 . . . N a ; j ¼ 0 . . . K  1g, where Ai;0  Ai , a set of N a K auxiliary edges (brown edges); Ea2 ¼ fðAi;j ; Ai;jþ1 Þji ¼ 1 . . . N a ; j ¼ 0 . . . K  2g another set of N a ðK  1Þ auxiliary edges (purple edges); Et ¼ fðB1 ; tÞ; ðB2 ; tÞg two edges connecting nodes B1 and B2 to the sink node (red edges). The total number of edges is N e ¼ 2 þ N t þ 3N a N t  2N 2a ; in the special case K < 2 it is N e ¼ 2 þ N t þ N a þ N a N t . The capacity and cost functions are defined as follows: e 2 Es e 2 Em

capðeÞ ¼ 1 capðeÞ ¼ 1   costðeÞ ¼ Nq0 max 0; eHðti jaj Þ  eH0

costðeÞ ¼ 0

e 2 Eb e 2 Ea1

capðeÞ ¼ 1 capðeÞ ¼ 1

e 2 Ea2 e ¼ ðB1 ; tÞ e ¼ ðB2 ; tÞ

capðeÞ ¼ K  j capðeÞ ¼ N a capðeÞ ¼ N t  N a

costðeÞ ¼ 1 costðeÞ ¼ 2 ðj þ 1Þ þ 1 costðeÞ ¼ 0 costðeÞ ¼ 0 costðeÞ ¼ 0

A feasible ðs; tÞ  flow is a function f : E ! R that satisfies: Vt ¼ fT i ji ¼ 1 . . . N t g a set of N t vertices corresponding to the targets; Va ¼ fAj jj ¼ 1 . . . N a g a set of N a vertices corresponding to the agencies; Vs ¼ fAi;j ji ¼ 1 . . . N a ; j ¼ 1 . . . K  1g a set of N a ðK  1Þ auxiliary vertices, where K ¼ N t  N a .

(1) 0 6 f ðeÞ 6 capðeÞ P P (2) e2inðv Þ f ðeÞ ¼ e2outðv Þ f ðeÞ

8e 2 E 8v 2 V n fs; tg

where inðv Þ (outðv Þ) is the set of edges entering (leaving) vertex v.

T. Hu et al. / Computer Vision and Image Understanding 134 (2015) 89–104

The value of flow f is:

v alðf Þ ¼

P

P

e2outðsÞ f ðeÞ



P

e2inðsÞ f ðeÞ,

97

while

the cost of the flow is: costðf Þ ¼ e2E f ðeÞcostðeÞ. The minimum cost flow (MCF) problem consists in finding a flow function f with v alðf Þ ¼ V, for a given V, having minimal cost. In our case the desired flow value is V ¼ N t . The complexity of the problem is OðVN e logðN v Þ=maxð1; logðN e =N v ÞÞÞ which in our case is equivalent to OðN a N 2t logðN a KÞÞ. The capacity values are selected in such a way to guarantee that every target is assigned to an agency and every agency has at least one target assigned to it. In fact, the unitary capacity of Es edges guarantees that, in order to reach the desired flow from source to sink, every target is assigned to an agency. On the other hand, the capacities of edges ðB1 ; tÞ and ðB2 ; tÞ require that exactly N a units of flow pass through node B1 and, because of the unitary capacities of Eb edges, every agency node has to contribute to that flow. The costs of edges in Em account for the second term of the cost function (2), and in fact each target ti sends a unitary flow towards an agency aj ¼ Mðti Þ contributing with the cost Nq0 max   0; eHðti jaj Þ  eH0 to the total cost. The costs of edges in Ea1 contribute to the first term of the cost function (2): for each agency aj a cost term is added equal to the square of the number of targets assigned to the agency. Lets denote nj ¼ CardðM 1 ðaj ÞÞ and rewrite n2j as the sum of the first nj odd Pnj 1 2h þ 1. Edges Ea1 encode this decomposition, numbers: n2j ¼ h¼0 in fact:  as they have capacity one, if a flow value of nj units reaches the node Aj , meaning that nj targets have been assigned to agency aj , it has to be split into nj unitary flows: one has to follow the edge ðAj ; B1 Þ having cost 1, the others nj  1 will follow different Ea1 edges to reach the node B2 ;  as these edges have increasing costs 2ðh þ 1Þ þ 1 with h ¼ 0; . . . ; K  1, the minimum flow will follow the less costly Pnj 2 ð2ðh þ 1Þ þ 1Þ ¼ edges for a total cost: 1 þ h¼0 Pnj 1 2 1 þ h¼1 ð2h þ 1Þ ¼ nj . An example is illustrated in Fig. 8. However, the graph in Fig. 7 allows computation of the optimal mapping from targets to agencies without taking into account the occlusion constraints.

. . .

Aj

1|1|1

B1

3|1|1

0|3|1 Aj,1 0|2|0

5|1|1

Aj,2

B2

7|1|0 9|1|0

0|1|0 Aj,3

Fig. 8. An example to illustrate how the graph structure accounts for the first term of the cost function (2). Lets assume K ¼ N t  N a ¼ 4. If the flow reaching node Aj is P equal to 3 (i.e., e2inðAj Þ f ðeÞ ¼ 3) the minimum cost of the flow leaving Aj is obtained by pushing the flow along the three cheapest edges having cost respectively, 1 (towards B1 ), 3 and 5 (towards B2 ), for a total of 9. The edge labels are composed by three fields: cost, capacity and flow (in red). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 9. Handling occlusion constraints by decomposing graphs. In the case target t 3 is occluded by t1 in a1 , the initial graph (a) is decomposed into two separate graphs (b) and (c) (only Vt nodes and Va nodes are drawn).

We deal with dependencies among occluding targets by decomposing the MCF problem into subproblems. First we build an occlusion table which lists all the occlusion triples ðai ; tj ; t k Þ which mean that target tj occludes target t k in one or more of the views in agency ai . Then we build a recursive algorithm that takes into consideration occlusion triples one by one. Fig. 9(a) gives an example of a weighted network for tracking 3 targets with 2 agencies. Let the first occlusion triple be ða1 ; t 1 ; t 3 Þ, i.e., target t1 occludes target t3 in agency a1 , meaning that if target t 3 is tracked by agency a1 , then target t1 must also be tracked by a1 . According to whether target t 3 is tracked by a1 or not, we can divide the MCF problem into two subproblems. The first one manages the case of assigning t3 to agency a1 by removing all the edges going from t3 and t 1 to the agencies other than a1 (Fig. 9(b)). The second subproblem deal with the case of not assigning t3 to a1 by just removing the corresponding edge (Fig. 9(c)). For each occlusion constraint a problem is decomposed into two smaller subproblems: in many cases this reduction procedure results in subproblems having no solutions. This fact is checked by counting the number of edges leaving Vt nodes (T cc ðiÞ) and the number of edges entering Va nodes (Acc ðjÞ); if one of this counters goes to zero the problem has no solution and we can stop the procedure along that path. Algorithms 1 and 2 show the pseudo code of the dynamic task decomposition and of the recursive procedure. Algorithm 1. Task decomposition by minimum cost flow. compute the camera partition set Sc ;// can be done offline generate occupancy maps for each target-camera pair costmin ¼ 1 for each camera partition PC i 2 Sc do compute Em costs from set of occupancy maps build the weighted network WN build the occlusion table Otab , containing On triples set the occlusion index Oidx ¼ 1 initialize counters T cc to N a initialize counters Acc to N t {call recursive procedure} ðcost; mapÞ ¼ Rec_MCF(WN; Oidx ; T cc ; Acc ) if cost < costmin then costmin ¼ cost mapbest ¼ map end if end for return mapbest

98

T. Hu et al. / Computer Vision and Image Understanding 134 (2015) 89–104

Algorithm 2. Recursive procedure to deal with occlusions. Rec_MCF(WN; Oidx ; T cc ; Acc ) if one of the counters in T cc or Acc is zero then return ð1; ;Þ else if Oidx ¼¼ On then {all occlusion triples have been managed; solve the MCF problem obtaining minimum cost and best map} ðcost; mapÞ = MCF(WN) return ðcost; mapÞ else {subProblem 1} update counters T cc ; Acc build the modified network WN1 ðcost1 ; map1 Þ ¼ Rec_MFC(WN1 ; Oidx þ 1; T cc ; Acc ) restore counters T cc ; Acc {subProblem 2} update counters T cc ; Acc build the modified network WN2 ðcost2 ; map2 Þ ¼ Rec_MFC(WN2 ; Oidx þ 1; T cc ; Acc ) restore counters T cc ; Acc if cost1 < cost2 then return ðcost1 ; map1 Þ else return ðcost2 ; map2 Þ end if end for

5. Experiments and results To evaluate the performance of our algorithm, we carried out a number of experiments with two challenging datasets that we make available with this publication, and with the standard PETS_2009_S2_L1 tracking benchmark using View_001 to View_004. We give a brief description of the implementation details of the overall tracking system and the resources utilized to carry out real-time tracking experiments on these datasets. We evaluate the proposed methods from three aspects: algorithm complexity, effectiveness of the cost measure, and tracking performance and real-time behavior with an experimental comparison to

a state-of-the-art camera set selection method for distributed tracking [13]. 5.1. Datasets With this paper, we make publicly available two high-quality multi-camera datasets FBK LAB and FBK HALLWAY for evaluating multiple people tracking in challenging multi-camera environment. Each dataset features 15 Hz frame-synchronized jpeg recordings from four cameras observing the scene from a distance with orthogonal viewing directions. Both datasets come with highquality calibration data and manually annotated references (ground location of people in a common reference aligned with calibration) for evaluation. FBK LAB sequence (9 K 4-view frames): acquired in a lab environment of about 5  6 m, at resolution 640  480. Through the sequence, people enter, walk around, sit down and exit the room randomly, causing frequent and persistent occlusions in all four views. The maximum number of people in the scene at the same time is 7. Annotations (one per second) are provided for the most challenging part of the sequence of about 3.5 min. FBK HALLWAY sequence (11 K 4-view frames): acquired in a large open space equipped with chairs, desks and poster walls, at resolution 1280  1024. The sequence shows people freely interacting in an exhibition and informal meeting space inside a large hallway, with people entering the area, moving, standing still, forming into groups and interacting with each other, sitting on chairs and leaving the area. The cameras capturing the scene are distant about 12–19 m one from the other, and people’s bodies may appear in some views at resolutions as low as 15  60 pixels. The maximum number of people present in the scene at the same time is 9. Annotations (one per second) are provided for the most challenging part of the sequence of about 5 min. Fig. 10 shows sample images from LAB and HALLWAY recordings. 5.2. Experiments on complexity of the algorithms To analyze the complexity of the algorithms in a realistic scenario, we did experiments on the time used for task decomposition. The experiments were done with the ground truth annotations of HALLWAY dataset using the parameters reported in Table 1. We computed the average time used for task

Fig. 10. Sample images from FBK LAB and FBK HALLWAY datasets with tracking particles and MAP estimates. Not always all cameras are used to track all target groups (MIN c ¼ 3, see second view second row). The datasets can be downloaded at http://tev.fbk.eu/datasets/multi-camera-people-tracking.

99

T. Hu et al. / Computer Vision and Image Understanding 134 (2015) 89–104

Table 1 Parameters for tracking experiments: frame rate, image resolution used for the experiments (may differ from dataset original resolution such as for HALLWAY), size of tracking area, maximum inter-camera distance, N 0 number of particles, r bandwidth of Gaussian kernel for ground occupancy map generation, MIN c minimum number of cameras per agency, R radius of support circles for occlusion prediction in task decomposition. R ¼ Dt  v max þ R0 with Dt ¼ 333 ms (task decomposition update frequency is 3 Hz), v max ¼ 1:5 m=s is the maximum speed of people (same value is used in the prediction step of the HJS PF), R0 is the variance of HJS base trackers (15 cm for LAB and HALLWAY, 30 cm for PETS_2009_S2_L1). Smart camera PC is the computer used to simulate Smart Camera Infrastructure in Fig. 2, while Host computer is a 3.30 GHz Bi-QuadCore in all experiments. Dataset

fps

Image

Ground (m2 )

d (m)

N0

r (m)

MIN c

R (m)

Smart camera PC

FBK LAB PETS_2009_S2_L1 FBK HALLWAY

15 7 15

640  480 768  576 800  600

4:75  6:12 12:0  13:0 14:5  15:0

7.22 128. 18.7

150 300 150

0.20 0.35 0.20

2 3 3

0.65 0.80 0.65

3.50 GHz Bi-QuadCore 2.27 GHz 4-Core 3.50 GHz Bi-QuadCore

Fig. 11. Comparison of time used for task decomposition with the exhaustive searching method and the minimum cost flow method. (Computed on ground truth data of HALLWAY dataset, i.e., for exact task decomposition on the sequence.)

decomposition for different number of targets in the scene (measured time covers all steps of the decomposition, including rendering and computation of ground occupancy maps). Fig. 11 shows that for numbers of targets N t 6 9 in the experiments, the decomposition time for our minimum cost flow method grows almost linearly with the number of targets while for the exhaustive searching method the time grows exponentially. However, when the number of targets is smaller than 5, the exhaustive searching method uses less time because the number of target partitions is small and it takes time for the minimum cost flow method to build graphs. Since normally there is no need to do task decomposition when there are a small number of targets, in the following experiments we only used the minimum cost flow method for evaluation.

5.3. Experiments on effectiveness of the cost measure To verify the effectiveness of the proposed cost function in Eq. (2), we did an experiment with the ground truth data of the

(a) Frame 2660

(b) Frame 2690

(c) Frame 2720

HALLWAY sequence. In the experiment the following constraints are imposed: MAX a ¼ 3; MINc ¼ 2; MAX c ¼ 4; Mi ¼ 2 ði 2 f1; 2; 3; 4gÞ. We computed the best way of task decomposition for each ground truth entry, as is shown in Fig. 12. At frame 2660 (Fig. 12(a)), targets 2, 3, 4, 5 and targets 6, 7, 8, 9 form into two groups (g 1 ; g 2 respectively) while target 1 stands isolate from them. Initially the system allocates 3 agencies to track them separately. From frame 2660 on, target 1 walks towards group g 2 by passing group g 1 . During this period, it is first tracked together with g 1 (Fig. 12(c)) when they are close and then it is tracked together with g 2 when it joins g 2 at frame 2885 (Fig. 12(e)). After some while, it leaves g 2 and is tracked separately again (Fig. 12(f)). Given the imposed occlusion constraint, from this experiment we can see that the proposed cost measure generates reasonable decompositions of the task. By the way, we should note that we do not just track people according to their closeness. The generation of agencies and groups of people as well as their optimal association is jointly determined by the cost function, the occlusion constraints together with the constraints about resources imposed (e.g., MAX a ). It could happen that when two targets are far away but are still tracked together under the imposed constraints. However, it does not decrease the performance as long as within the subtask there are no occluders from people tracked by other subtasks. Moreover, the radius R of the estimated supports of targets impacts on the potential occlusions among targets and on the camera utility rate used in the cost function and, thus, on the resulting optimal task decomposition. In this paper, we assume a constant speed for all targets within a small time interval (333 ms in all our experiments) so the R is fixed (R ¼ 65 cm for HALLWAY dataset used in this experiment, see Section 5.4.1). Fig. 13 shows two interesting cases to discuss the benefits of deriving camera utility from computational cost associated with adapting the number of particles to uncertainty of ground occupancy predictions. By considering only computational cost with fixed-size particle filtering, the cheapest solution with the targets in (a)–(c) satisfying occlusion constraints is dividing the 5 targets into 3 groups such as in (b), (c): complexity / 32 þ 1 þ 1 ¼ 11. With solution (c) there is large predicted uncertainty making fixed-size particle tracking with these groupings unreliable, and a better solution with equal complexity is (b). However, our

(d) Frame 2750

(e) Frame 2885

(f) Frame 3125

Fig. 12. Reconfiguration of subtasks at 6 sample frames. C 1  C 4 are the four cameras. Each circle depicts the support of a target. The lines connecting the targets and cameras indicate the targets are tracked by the corresponding cameras. Different colors of targets mean they are associated to different subtasks.

100

T. Hu et al. / Computer Vision and Image Understanding 134 (2015) 89–104

(a) best

(b) 2nd-best

(c) sub-optimal

(d) best

(e) best no occlusion

(f) 2nd-best no occlusion

Fig. 13. Ranking of decomposition hypotheses with the proposed cost measure goes beyond simply selecting the best cameras for tracking with the lowest computational cost: it considers predicted uncertainty (a)–(c) as well as occlusions (d)–(f) to self-regulate the trade-off between complexity and robustness.

definition of cost measure goes beyond simply selecting the best cameras for performing tracking with the lowest computational cost: the optimal solution is found by a more expensive 22 þ 32 ¼ 13 decomposition (a) which is much more convenient in terms of tracking failure risk. Our parameter-free cost measure self-regulates this trade-off between complexity and robustness. In (d)–(f) we can observe a similar case: occlusions-aware generation of ground occupancy maps ranks the solutions differently than if occlusions are ignored.

5.4. Tracking performance of the proposed method We demonstrate the tracking performance of our approach by measuring the standard MOT (Multiple Object Tracking) metrics as proposed in [31], which provides a systematic evaluation to compare the performance of different trackers in terms of tracking accuracy and consistent labeling. MOT Accuracy combines false

alarms (false-positive rate in the result tables, in %), missed targets (miss rate, in %) and identity switches (mismatches, count); the higher the better. MOT Precision (in cm) measures the misalignment of the predicted track with respect to the ground truth trajectory; the lower the better. We used the original implementation of MOT evaluation tool provided by the authors of [31]. We further evaluate real time performance in terms of tracking rate, i.e., number of particle filter iterations (or position updates) per second per target, computed by the HJS PF process (tracking rate 6 nominal frame rate of dataset). We compare our approach (‘Proposed method’ in the result tables) with the centralized HJS tracker in [22], where a single HJS base tracker is executed to track all targets using all cameras (‘Centralized’) – also in this case, all (four) HJS likelihood processes run in parallel. To verify how close we are to nominal performance (‘Baseline’), we ran the centralized HJS tracker with N 0 ¼ 500 in off-line mode, i.e., giving the tracker enough time and resources to fine-process every single frame of the sequence.

101

T. Hu et al. / Computer Vision and Image Understanding 134 (2015) 89–104

Table 2 Real time tracking results on LAB dataset: centralized (first row) and proposed decentralized tracking (third row). For the comparison with 2D-coverage based camera utility term P P in [48] (second row) the cost function used for decomposition is gj Cardðg j Þ2 þ a ti ð1  Cov ðtj jaj ÞÞ and Cov ðtj jaj Þ is the 2D-coverage measure in Fig. 15, with a ¼ 5 found empirically to be a good value. Results are given also for real time experiments with the proposed method on 1.3 times faster host processors (simulated). Every experiment, except for Baseline (in italic), is executed real-time on the same computing infrastructure with equal parameter setting (Table 1), and repeated 5 times. See Fig. 14 for tracking rate.

Centralized Camera selection based on [48] Proposed method Proposed method on 1.3x faster CPU Baseline

Precision

Miss

False-positive

Mismatches

MOT accuracy

6.3 ± 0.5 8.7 ± 0.4 6.6 ± 0.1 6.8 ± 0.1 6.5 ± 0.2

20.1 ± 2.7 9.9 ± 1.5 6.9 ± 0.3 5.1 ± 0.5 3.6 ± 0.4

6.0 ± 4.4 3.1 ± 0.6 2.4 ± 0.3 2.4 ± 0.2 2.6 ± 0.5

7.8 ± 2.2 6.4 ± 4.9 4.2 ± 1.9 5.2 ± 2.1 3.6 ± 1.4

73.8 ± 6.5 87.0 ± 1.5 90.6 ± 0.5 92.5 ± 0.6 93.7 ± 0.8

5.4.1. Implementation details Parameters and configurations used for the experiments are listed in Table 1. We simulate the smart camera network plus host computer environment in Fig. 2 using two connected PCs as specified in the table. Additional parameters not listed in the table are: MAX c ¼ 4; MAX a ¼ 3; Mi ¼ 3; R ¼ 0:8 m, q ¼ 400/m2, decomposition update frequency 3 Hz, grid size for ground occupancy maps is 0.10 m, predicted target outlines for ground occupancy generation are rendered at half of image resolution. Detection and track termination parameters are empirically set for each sequence. We pre-loaded image sequences in memory (jpeg format) to eliminate access to physical storage that could impact real time behavior. On the choice of MAX c ; MINc ; MAX a parameters. MAX c and MINc (max and min number of HJS likelihood processes per HJS PF process) are upper-bounded by the number N c of cameras capturing the environment, all our experiments have N c ¼ 4 cameras. In theory, one can choose a higher value but that would mean that the same camera is ‘cloned’ in the agency, i.e., the same image is used multiple times to compute likelihoods for the targets in a base tracker, resulting in useless additional computations (no further information gain). In our task decomposition algorithms we eliminate agencies with clones when we compute the camera partition set Sc (see Algorithm 1 and Section 4.4). MINc is the expected minimum number of views to obtain reliable estimates from base trackers. It mostly depends on the acquisition setup (image resolution, distance and relative orientations with which the cameras observe the scene) and in our cases 2 (for LAB, small environment) and 3 (for HALLWAY and PETS_2009_S2_L1) are reasonable values. MAX a , the maximum number of base trackers instantiated by the P decomposition, is upper-bounded by MAX a 6 i M i =MINc where Mi is the number of HJS likelihood processes associated to camera i. MAX a ; M i essentially depend on two factors: (i) the available processing resources to run the HJS PF and likelihood processes associated to the agencies, and (ii) the scenario, i.e., the expected maximum number of non-interacting groups of peoples one may expect in the monitored scene. In practice, one can set MAX a to meet the scenario and dimension the computing infrastructure to P execute the resulting MAX a HJS PF processes and M i HJS likelihood processes. 5.4.2. Tracking performance on LAB sequence Table 2 shows the evaluation results with 5 runs of real time tracking. As can be seen, the average MOT Accuracy score of the proposed approach has more than 16% gain compared to the centralized approach, resulting from a significant reduction in miss and false-positive rate. This is due to the loss of frames with the centralized tracker. As the number of targets gets large, the centralized tracker is no more able to handle the computational burden and has to lose frames, thus producing unreliable tracking results (the MOT Accuracy score is very low sometimes as can be seen in the Table). However, the proposed approach is able to decompose the task into a number of subtasks (maximum

Fig. 14. Tracking rate on LAB dataset for the experiments in Table 2.

Fig. 15. For experimental comparison with [48] in Table 2 (second row), we replaced the camera utility term in Eq. (2) with a measure of visual coverage of a target on the ground: Cov ðAjC1; C2Þ ¼ L _ =C, where L _ is the length of the arc _ ACBD ACBD ACBD and C is the circumference of the support outline.

N a ¼ 3) and map them into different processors. Therefore, the evaluation scores are very stable. They are very close to baseline performance with only 4% of MOT Accuracy drop, while this difference reduces to about 2% with 1.3 faster processors. Fig. 14 compares tracking rate with the different experiments, confirming that with the decomposition a close to nominal performance is reached. While task decomposition with the entropy based camera utility measure results in no notable difference of MOT Precision (around 6.5 cm for all experiments in Table 2), if instead a simpler utility measure is used as in Fig. 15, the localization error increases by about 40% (still below 10 cm, though). Also, MOT Accuracy decreases by more than 3 points, indicating the importance of

102

T. Hu et al. / Computer Vision and Image Understanding 134 (2015) 89–104

Fig. 16. Sample of PETS_2009_S2_L1 View_001–View_004 sequence as used in the evaluation. Tracking area is shown in yellow, and we back-projected two people from View_001 to the other views using the 3D shape model to verify calibration quality. Further, color based tracking is challenged by the different color response of the cameras. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 3 Real time tracking results on PETS_2009_S2_L1 using View_001–View_004: centralized (first row) and proposed decentralized tracking (second row). To account for calibration and synchronization issues with the dataset we have changed the threshold used in the MOT evaluation tool to establish correspondence between tracking hypotheses and ground truth annotation from the standard value of 50– 80 cm. Every experiment, except for Baseline (in italic), is executed real-time on the same computing infrastructure with equal parameter setting (Table 1), and repeated 10 times. See Fig. 17 for tracking rate.

Centralized Proposed method Baseline

Fig. 17. Tracking rate on PETS_2009_S2_L1 for the experiments in Table 3.

designing expressive utility measure that resembles key aspects of the imaging process impacting on performance. 5.4.3. Tracking performance on PETS_2009_S2_L1 dataset using View_001 to View_004 With PETS_2009_S2_L1 multi-view dataset we observed two main issues which make the evaluation of multi-camera ground tracking approaches challenging. Firstly, overall calibration quality appears just sufficient for View_001 to View_004 (back-projection of scene points from those views – especially on the periphery of images – disagree with an error of up to several tens of centimeters), while calibration quality of View_005 to View_008 is too poor to be used with our approach to ground tracking – we use the 12  13 m2 central area around the crossing with View_001 to View_004 as the tracking domain in our experiments, see Fig. 16. Secondly, synchronization issues exist across the views (this was noted also in [46]) – many frame drops were observed in the different views which, combined with the low frame rate of the recordings (7 Hz), poses additional challenges to prediction based tracking approaches such as particle filtering – we manually synchronized View_002 to View_004 by replicating the previous image whenever a frame drop was observed in relation to View_001. To obtain the references for evaluation, we (i) back-projected to ground coordinates the mid-point of the base segment of 2D bounding boxes provided with the dataset, and (ii) manually corrected the ground positions in the tracking area obtained this way by visual inspection in View_001 to View_004 using our annotation tool. We increased the kernel bandwidth in the ground occupancy map generation (see Table 1) to account for calibration and synchronization issues discussed above. For evaluation with this sequence, we change the threshold used to establish

Precision

Miss

Falsepositive

Mismatches

MOT accuracy

27.1 ± 1.1 26.5 ± 0.5

32.1 ± 2.0 19.6 ± 1.3

2.3 ± 0.6 3.5 ± 0.6

16.8 ± 2.0 20.1 ± 1.8

64.8 ± 2.1 76.2 ± 1.0

66.8 ± 1.2

15.5 ± 0.6

3.0 ± 0.5

24.0 ± 3.0

80.7 ± 0.5

correspondence between tracking hypotheses and ground truth annotation from 0.5 m (the standard value – used for LAB and HALLWAY experiments) to 0.8 m in the MOT evaluation tool. Table 3 reports our real time tracking results. We observe again a significant increase – more than 11% – in MOT Accuracy with the decentralized approach. Overall, there is unbalance between miss to false-positive rates, also in the baseline. This behavior could be attributed in part to calibration issues and, in particular, synchronization: while frames have been temporally aligned through frame copies to fill drops, this surely results in sub-optimal detection performance (which is based on motion – there is 0 motion on copied frames). Further, the focus of the paper is not on absolute tracking performance itself, but on the scalability issue of multicamera tracking leveraging real-time performance by task decomposition leading to a decentralized solution of existing tracking approach. We also believe that from a practical perspective, in a realistic real time scenario it should be possible to eliminate calibration and synchronization issues at deployment time to ensure appropriate data quality ahead of tracking system installation. We took care of these aspects with the acquisition of FBK HALLWAY dataset, on which we provide a detailed evaluation next. 5.4.4. Tracking performance on HALLWAY sequence This sequence is more challenging than LAB due to the much larger tracking area and inter-camera distance (Table 1), higher number of people in the scene, and unconstrained daylight illumination. Table 4 shows experimental results of our proposed task decomposition method with a comparison to the camera selection method in [13] under real time tracking conditions. Our proposed method achieves high precision (below 10 cm with inter-camera distance of up to about 19 m, see Table 1) with well balanced miss to false-positive rates and an average MOT Accuracy above 85%, compared to 77% for centralized tracking, indicating that the tracker configured this way operates close to nominal performance. The average number of mismatches (instantaneous identity switches) is very low (25) compared to the total number (2139) of

T. Hu et al. / Computer Vision and Image Understanding 134 (2015) 89–104 Table 4 Real time tracking results on HALLWAY dataset: centralized tracking (first row), decentralized tracking with camera utility term from [13] (second row), and our proposed decentralized tracker (third row). For the comparison with [13] the cost P P function used for decomposition is gj Cardðg j Þ2 þ a ti ð1  AUðti jai Þ=log2 5 and AU is the Aggregated Uncertainty from [13] with same parameters used by the authors in their paper. We used a ¼ 5 found empirically to be a good value. We also report results for the second-best and third-best solution found by the proposed decomposition method. Every experiment, except for Baseline (in italic), is executed real-time on the same computing infrastructure with equal parameter setting (Table 1), and repeated 10 times. See Fig. 18 for tracking rate.

Centralized Camera selection based on [13] Proposed method Proposed method 2nd-best Proposed method 3rd-best Baseline

Precision

Miss

Falsepositive

Mismatches

MOT accuracy

9.0 ± 0.2 9.8 ± 0.4

18.0 ± 2.9 18.2 ± 4.0

2.6 ± 1.6 4.0 ± 1.0

26.8 ± 4.9 30.2 ± 4.5

77.6 ± 1.8 75.9 ± 4.7

9.4 ± 0.2

6.5 ± 1.8

6.4 ± 1.6

25.8 ± 2.4

85.5 ± 2.0

9.5 ± 0.3

9.3 ± 2.3

6.4 ± 1.8

25.6 ± 4.2

82.7 ± 2.0

9.7 ± 0.3

10.1 ± 2.5

6.0 ± 2.0

29.5 ± 3.2

82.0 ± 1.3

8.6 ± 0.2

5.8 ± 3.1

5.6 ± 0.9

23.5 ± 2.7

88.2 ± 0.7

ground truth annotations used in the evaluation. This can be attributed to the strengths of the color based HJS base tracker featuring identity-preserving long term tracking even when appearance among targets is ambiguous and under occlusion, and this is preserved with the proposed decentralized solution. For the comparative analysis in Table 4 we replaced the camera utility term in Eq. (2) with the Dempster–Shafer (DS) based Camera Suitability Value proposed in [13]. We observed a drop in performance, combined with a higher variance over repeated experiments (10 independent runs). The method in [13] requires a coarse subdivision of the ground into five cells resulting in a significant quantization of the information available for view selection. Instead, our method is able to predict camera utility at a much finer spatial resolution. Another notable difference is the imaging model utilized: as with HJS tracking [22] our model considers visible and occluded regions of a hypothesis in the ground occupancy computation (see Section 4.2), while in [13] only the visible part of multi-target hypothesis is measured for camera utility prediction. Furthermore, we provide tracking results for the second-best and third-best decomposition solution found by our method.

103

While MOT Precision remains about the same, both MOT Accuracy and the ratio between miss and false positive rates improve progressively with the solution rank – ‘Centralized’ behaves in this sense like the worst solution, confirming the trend. It can be concluded that the entropy based camera utility term captures essential information from view geometry of the multicamera environment available via calibration, and that the efficient graph-based minimum cost flow method build on top of it ranks consistently task decomposition hypotheses. Further in terms of real time behavior, Fig. 18 shows a comparison of the tracking rate histograms with the different experiments in Table 4. From the histograms we can see that the decentralized tracker performs real time tracking, i.e., tracking at nominal frame rate, for more than 75% of the sequence while the centralized tracker only performs real time tracking for about 20% of the sequence. This can be explained by observing that for a large part of the sequence there are 9 people and the HJS base tracker has a complexity which grows quadratically with the number of targets which, although efficient, prevents from real time tracking all targets with a single HJS filter. However, by decomposing the joint task into several independent subtasks satisfying independence assumptions (Section 3.2) while guaranteeing proper occlusion handling and executing them on different processors, tracking of different groups of people can be performed in parallel and thus the loss of frames is minimized. From the histogram we can also see that the decentralized tracker does not always track in real time. This is because when people are close together the overall task cannot be decomposed into subtasks. Overall, the decentralized solution proposed in this paper achieves practical precision and accuracy with close-to-full frame rate tracking on this challenging sequence.

6. Conclusion Visual tracking is challenging in large complex scenes where there are many people occluding each other. Some major issues include how to maintain reliable real-time tracking, balance system computational load and data flow, and thus enhance system scalability. In this paper, we proposed a distributed multi-target multi-object tracking system, which has a three-layer architecture. With the proposed cost measure, the approach is able to dynamically decompose the overall task into a number of nearly independent subtasks, each of which tracks a subset of targets with an agency. The association of agencies and groups of people is based on sensing geometry and the estimated target positions. Experimental results demonstrate that the method is able to reduce task complexity and boost parallelization while maintaining overall tracking accuracy comparable to that of a centralized implementation. Consequently, the proposed decentralized tracking framework leads to a scalable solution with superior real-time performance. The paper has also set the basis for further research on resource allocation in a holistic sense, by designing novel cost functions and associated subtask reconfiguration algorithms that are sensitive to asymmetric processing (dynamic load balancing for distributed processing) and communication infrastructures (dynamic routing for distributed image transfer).

References

Fig. 18. Tracking rate on HALLWAY dataset for the experiments in Table 4.

[1] P. Perez, C. Hue, J. Vermaak, M. Gangnet, Color-based probabilistic tracking, in: Proc. European Conference on Computer Vision (ECCV), 2002. [2] K. Nummiaro, E. Koller-Meier, L.J. Van Gool, An adaptive color-based particle filter, Image Vision Comput. 21 (1) (2003) 99–110. [3] S. Arulampalam, S. Maskell, N.J. Gordon, T. Clapp, A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking, IEEE Trans. Signal Process. 50 (2) (2002) 174–188.

104

T. Hu et al. / Computer Vision and Image Understanding 134 (2015) 89–104

[4] Z. Khan, T. Balch, F. Dellaert, An MCMC-based particle filter for tracking multiple interacting targets, in: European Conference on Computer Vision (ECCV), 2004. [5] F. Rezaei, B.H. Khalaj, Distributed human tracking in smart camera networks by adaptive particle filtering and data fusion, in: ACM/IEEE Intl. Conf. on Distributed Smart Cameras (ICDSC), 2012. [6] B. Dieber, L. Esterle, B. Rinner, Distributed resource-aware task assignment for complex monitoring scenarios in Visual Sensor Networks, in: ACM/IEEE Intl. Conf. on Distributed Smart Cameras (ICDSC), 2012. [7] L. Esterle, P. Lewis, M. Bogdanski, B. Rinner, X. Yao, A socio-economic approach to online vision graph generation and handover in distributed smart camera networks, in: ACM/IEEE Intl. Conf. on Distributed Smart Cameras (ICDSC), 2011. [8] M. Bramberger, B. Rinner, H. Schwabach, A method for dynamic allocation of tasks in clusters of embedded smart cameras, in: IEEE Intl. Conf. on Systems, Man and Cybernetics, 2005. [9] Y. Li, B. Bhanu, Task-oriented camera assignment in a video network, in: IEEE Intl. Conf. on Image Processing (ICIP), 2009. [10] Y. Li, B. Bhanu, Utility-based dynamic camera assignment and hand-off in a video network, in: ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC), 2008. [11] F.Z. Qureshi, D. Terzopoulos, Multi-camera control through constraint satisfaction for persistent surveillance, in: IEEE Intl. Conf. on Advanced Video and Signal Based Surveillance (AVSS), 2008. [12] Y. Li, B. Bhanu, A comparison of techniques for camera selection and handoff in a video network, in: ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC), 2009. [13] L. Tessens, M. Morbee, H. Aghajan, W. Philips, Camera selection for tracking in distributed smart camera networks, ACM Trans. Sensor Networks 10 (2) (2014). [14] A. Cenedese, F. Cerruti, M. Fabbro, C. Masiero, L. Schenato, Decentralized task assignment in camera networks, in: IEEE Conference on Decision and Control, 2010. [15] D.R. Karuppiah, R.A. Grupen, Z. Zhu, A.R. Hanson, Automatic resource allocation in a distributed camera network, Mach. Vis. Appl. 21 (2010) 517– 528. [16] M. Kushwaha, X.D. Koutsoukos, 3D target tracking in distributed smart camera networks with in-network aggregation, in: ACM/IEEE Intl. Conf. on Distributed Smart Cameras (ICDSC), 2010. [17] C. Soto, B. Song, A.K. Roy-Chowhury, Distributed multi-target tracking in a selfconfiguring camera network, in: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2009. [18] A. Kembhavi, W. Schwartz, L. Davis, Resource allocation for tracking multiple targets using particle filters, in: Intl. Workshop on Visual Surveillance (VS2008), 2008. [19] S. Spurlock, R. Souvenir, Dynamic subset selection for multi-camera tracking, in: ACM-SE Proceedings of the 50th Annual Southeast Regional Conference, 2012. [20] O. Lanz, R. Manduchi, Hybrid joint-separable multibody tracking, in: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2005. [21] T. Hu, S. Mutlu, O. Lanz, Multicamera people tracking using a locus-based probabilistic occupancy map, in: Intl. Conf. Image Analysis and Processing (ICIAP), 2013. [22] O. Lanz, Approximate Bayesian multi-body tracking, IEEE Trans. Pattern Anal. Mach. Intell. 28 (9) (2006) 1436–1439. [23] H. Medeiros, J. Park, A.C. Kak, Distributed object tracking using a cluster-based kalman filter in wireless camera networks, IEEE J. Sel. Top. Signal Process. 2 (4) (2008) 448–463. [24] C. Song, J. Son, S. Kwak, B. Han, Dynamic resource allocation by ranking SVM for particle filter tracking, in: British Machine Vision Conference (BMVC), 2011. [25] L. Bazzani, M. Cristani, V. Murino, Decentralized particle filter for joint individual-group tracking, in: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2012.

[26] C.H. Chen, Y. Yao, D. Page, B. Abidi, A. Koschan, M. Abidi, Camera handoff with adaptive resource management for multi-camera multi-object tracking, Image Vis. Comput. 28 (6) (2010) 851–864. [27] T. Chen, T.B. Schon, H. Ohlsson, L. Ljung, Decentralized particle filter with arbitrary state decomposition, IEEE Trans. Signal Process. 59 (2) (2011) 465– 478. [28] O. Lanz, T. Hu, Dynamic resource allocation for probabilistic tracking via attentive sensing and sampling, in: IEEE Intl. Conf. on Advanced Video and Signal-Based Surveillance (AVSS), 2011. [29] M. Taj, A. Cavallaro, Distributed and decentralized multi-camera tracking, IEEE Signal Process. Mag. 28 (3) (2011) 46–58. [30] R.K. Ahuja, T.L. Magnanti, J.B. Orlin, Network Flows: Theory, Algorithms, and Applications, Prentice-Hall Inc., 1993. ISBN 0-13-617549-X. [31] K. Bernardin, R. Stiefelhagen, Evaluating multiple object tracking performance: the CLEAR MOT metrics, J. Image Video Process. 3 (2008). [32] A. Mittal, L.S. Davis, M2Tracker: a multi-view approach to segmenting and tracking people in a cluttered scene, Int. J. Comput. Vis. 51 (3) (2003). [33] D. Arsic, E. Hristov, N. Lehment, B. Hornler, B. Schuller, G. Rigoll, Applying multi layer homography for multi camera person tracking, in: ACM/IEEE Intl. Conf. on Distributed Smart Cameras (ICDSC), 2008. [34] Y. Zhou, H. Nicolas, J. Benois-Pineau, A multi-resolution particle filter tracking in a multicamera environment, in: IEEE Intl. Conf. on Image Processing (ICIP), 2009 Conference Publications, Cairo, Egypt, pp. 4065–4068. [35] F. Fleuret, J. Berclaz, R. Lengagne, P. Fua, Multi-camera people tracking with a probabilistic OccupancyMap, IEEE Trans. Pattern Anal. Mach. Intell. 30 (2) (2008). [36] F. Cupillard, F. Bremond, M. Thonnat, Tracking groups of people for video surveillance, in: European Workshop on Advanced Video-based Surveillance Systems, 2001. [37] S. Zaidenberg, B. Boulay, C. Garate, D.P. Chau, E. Corvee, F. Bremond, Group interaction and group tracking for video-surveillance in underground railway stations, in: Intl. Workshop on Behaviour Analysis and Video Understanding, 2011. [38] R. Mazzon, F. Poiesi, A. Cavallaro, Detection and tracking of groups in crowd, in: IEEE Intl. Conf. on Advanced Video and Signal based Surveillance (AVSS), 2013. [39] B. Han, S.W. Joo, L.S. Davis, Multi-camera tracking with adaptive resource allocation, Int. J. Comput. Vis. 91 (1) (2011) 45–58. [40] T.M. Cover, J.A. Thomas, Elements of Information Theory, Wiley, NY, 1991. ISBN 0-471-06259-6. [41] O. Lanz, An information theoretic rule for sample size adaptation in particle filtering, in: Intl. Conf. Image Analysis and Processing (ICIAP), 2007. [42] R. Burkard, M. Dell’Amico, S. Martello, Assignment problems, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, 2009, Revised reprint, ISBN: 978-1-61197-222-1. [43] O. Lanz, S. Messelodi, A sampling algorithm for occlusion robust multi target detection, in: IEEE Intl. Conf. on Advanced Video and Signal based Surveillance (AVSS), 2009. [44] A. Mavrinac, X. Chen, Optimizing load distribution in camera networks with a hypergraph model of coverage topology, in: ACM/IEEE Intl. Conf. on Distributed Smart Cameras (ICDSC), 2011. [45] A. Mavrinac, X. Chen, Y. Tan, Coverage quality and smoothness criteria for online view selection in a multi-camera network, ACM Trans. Sensor Networks 10 (2) (2014). [46] Á. Utasi, C. Benedek, A multi-view annotation tool for people detection evaluation, in: Intl. Workshop on Visual Interfaces for Ground Truth Collection in Computer Vision Applications, 2012. [47] T. Hu, S. Messelodi, O. Lanz, Dynamic task decomposition for probabilistic tracking in complex scenes, in: Intl. Conf. Pattern Recognition (ICPR), 2014. [48] T. Hu, S. Messelodi, O. Lanz, Wide-area multi-camera multi-object tracking with dynamic task decomposition, in: ACM/IEEE Intl. Conf. Distributed and Smart Cameras (ICDSC), 2014.