Visual tracking in video sequences based on biologically inspired mechanisms

Visual tracking in video sequences based on biologically inspired mechanisms

Accepted Manuscript Visual tracking in video sequences based on biologically inspired mechanisms Alireza Sokhandan, Amirhassan Monadjemi PII: DOI: Re...

3MB Sizes 0 Downloads 63 Views

Accepted Manuscript Visual tracking in video sequences based on biologically inspired mechanisms Alireza Sokhandan, Amirhassan Monadjemi

PII: DOI: Reference:

S1077-3142(18)30389-8 https://doi.org/10.1016/j.cviu.2018.10.002 YCVIU 2724

To appear in:

Computer Vision and Image Understanding

Received date : 22 February 2018 Revised date : 5 October 2018 Accepted date : 14 October 2018 Please cite this article as: Sokhandan A., Monadjemi A., Visual tracking in video sequences based on biologically inspired mechanisms. Computer Vision and Image Understanding (2018), https://doi.org/10.1016/j.cviu.2018.10.002 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Highlights

Research Highlights (Required) To create your highlights, please type the highlights against each \item command. It should be short collection of bullet points that convey the core findings of the article. It should include 3 to 5 bullet points (maximum 85 characters, including spaces, per bullet point.) • Inspiring the biological vision function to achieve a real-time tracking algorithm • Coping with challenges like background clutter, scale variation, occlusion, etc • Employing short-term memory like structure for handling complete occlusion • Achieving high precision in target tracking and act in a real-time manner

Manuscript - 2nd revision 1

Computer Vision and Image Understanding journal homepage: www.elsevier.com

Visual Tracking in Video Sequences Based On Biologically Inspired Mechanisms Alireza Sokhandan, Amirhassan Monadjemi∗∗ Department of Artificial Intelligence, Faculty of Computer Eng., University of Isfahan, Isfahan, Iran

ABSTRACT

Visual tracking is the process of locating one or more objects based on their appearance. The high variation in the conditions and states of a moving object and presence of challenges such as background clutter, illumination variation, occlusion, etc. makes this problem extremely complex, and hard to achieve a robust algorithm in this field. However, unlike the machine vision, in the biological vision, the task of visual tracking is ideally conducted even in the worst conditions. Consequently, in this paper, taking into account the superior performance of biological vision in visual tracking, a biologically inspired visual tracking algorithm is introduced. The proposed algorithm inspiring the task-driven recognition procedure of the primary layers of the ventral pathway, and visual cortex mechanisms including spatial-temporal processing, motion perception, attention, and saliency to track a single object in the video sequence. For this purpose, a set of low-level features including the oriented-edges, color, and motion information (inspired by the layer V1) extracted from the target area and based on the discrimination rate that each feature creates with the background (inspired by the saliency mechanism), a subset of these features are employed to generate the appearance model and identify the target location. Moreover, by memorizing the shape and motion information (inspired by the short-term memory) scale variation and occlusion are handled. The experimental results showed that the proposed algorithm can well handle most of the visual tracking challenges, achieve high precision in target locating and act in a real-time manner. c 2018 Elsevier Ltd. All rights reserved.

1. Introduction The use of visual features for location estimation of one or more objects in a video sequence is called visual tracking which is one of the underlying problems of machine vision. Visual tracking is employed in a wide range of industries such as movie industries (Moeslund et al., 2006; Barber et al., 2016), game development (Li et al., 2010; Huang et al., 2017), sports (Lun, 2016), medicine (Koorehdavoudi et al., 2017; Jackson et al., 2017), surveillance (Chen et al., 2013; Nazare et al., 2014), autonomous vehicles (Fu et al., 2014; Yin et al., 2016), etc. Although visual tracking is one of the issues of interest among machine vision researchers, and many studies have been conducted in this area, it continues to be recognized as an open problem because of the many challenges it faces; challenges such as background clutter, target deformation, change in poses, illumination variation, occlusion, etc. (Maggio and Cavallaro, 2011). All those studies have attempted to overcome the existing challenges by improving the performance of different parts

∗∗ Corresponding

author: Tel.: +98-31-3793-4035 e-mail: [email protected] (Alireza Sokhandan), [email protected] (Amirhassan Monadjemi)

of the algorithms and increasing its precision and robustness. In this regard, some researchers focused on the visual features extracted from the target region to improve the recognition precision by identifying strong features. The use of dense appearance features like multi-features histograms (Adam et al., 2006) or response of spatiotemporal filters in different scale and band (Chao He et al., 2002), and local features like the HOG features (Ruan and Wei, 2016), transform-invariant features such as SIFT or SURF (Sakai et al., 2015), regional features such as MSER (Tran and Davis, 2007), and dense descriptor of extracted super-pixel from the target region (Wang et al., 2011) are among those attempts. Dense features are computationally efficient, which make them suitable for fast tracking. But they do not contain enough spatial information, make them sensitive to the object appearance change occurred in the situation, such as occlusion and illumination variation. On the other hand, local features are less sensitive to appearance change situations, such as rotation, scale variation, and even occlusion. But as these methods are relied on to the feature point and high-energy area extraction, they are sensitive to noise and background clutter (Li et al., 2013b). Other studies conducted on the target shape representation to prevent interfering background information in the target appear-

2 ance model in order to overcome the shift problem by more precise recognition of target region, including fluid (Kanhere and Birchfield, 2008), silhouette (Plaenkers and Fua, 2002), cantors (Nguyen et al., 2002), point distribution (Xiao et al., 2004) and articulated models (Sundaresan and Chellappa, 2009). Assuming the target as the non-rigid object is the most important feature of these representation methods, which in addition to preventing background information from interfering the target appearance model, it can be efficient in coping with challenges such as target deformation or out-of-plane rotation. Furthermore, in the articulated models, since each section has its own appearance model, in case of failure of recognizing the occlusion, the appearance information of occlusion agent only affects the appearance model of the occluded section and does not spread to other ones. Another group of researchers focused on target recognition methods and the use of stochastic models, which resulted in generative, discriminative and mixture recognition methods. In generative methods, the main goal is to precisely map a given information onto the target class, which only was generated by the appearance information of the target itself and ignored the background. Mixture models such as GMM (Han et al., 2008) and WSLMM (Jepson et al., 2003), Kernel models (Leichter et al., 2010, 2009), and subspace learning based on techniques such as PCA (Wen et al., 2012), LLE (Gao and Bi, 2010), KPCA (Liu et al., 2015) and sparse coding (Mei and Ling, 2011) are a different type of generative models employing in visual tracking. Furthermore, discriminative methods interpret the recognition problem as a binary classification with the aim of achieving maximum discrimination between the object and the background information. Self-learning and co-learning Boosting (Avidan, 2007; Liu et al., 2009; Zeisl et al., 2010), SVM (Bai and Tang, 2012; Tang et al., 2007), Randomized learning (Jiang et al., 2012) and Discriminant analysis such as LDA (Xu et al., 2008) and FDA (Zha et al., 2010) are the techniques categorized in this group. Finally the mixture methods which mixed the generative and discriminative models in the form of a single-layered (Yang et al., 2009) (Combining the results of methods to increase the final accuracy) or multi-layered model (Shen et al., 2010; Yang et al., 2009) (Using the output of one model as the input of the other one, using one model for validating another model or using the results of one model in the learning phase of second model) to access the benefits of both of them. The generative models, have a higher processing speed due to the use of only target area information, but on the other hand, there is no criterion for assessing the accuracy of the trained model and also, ignoring the background information increase the possibility of the wrong detection in the face of background clutter, especially when the target has the large displacement between two frames because of sudden camera movement or target high speed. On the other hand, because of utilizing background information the discriminative models are more resistant to background clutter challenge, but the biggest limitation of these methods is their high dependence on the precision of the samples used in the training phase which, in case of rapid change of the background or object, it may face loss of precision in the detection phase. Finally, mixture models, while

trying to take advantage of both methods, are less flexible than the other two models due to the use of more parameters and assumptions (Li et al., 2013b). In all of these recognition methods, updating target appearance model is an important strategy for dealing with the target appearance change challenge occurs for various reasons from deformation to illumination variations. Using basic methods such as updating based on a learning coefficient in the generative models (Ross et al., 2008; Porikli et al., 2006) and using self-learning and co-learning methods in discriminative models are some of these update strategies (Liu et al., 2009; Zeisl et al., 2010) The motion model was another part of the tracking algorithms which was considered. The motion model is responsible to model the target motion pattern to better capture an estimation of its probable location in future frames and also play an important role in handling occlusion situation to keep tracking trajectory. A different approach has been introduced for motion modeling can be categorized into three groups. Singlehypothesis models algebraically estimate the target location based on the relationship between the feature space and the state space, gradient-based methods like KLT (Chen et al., 2010), and mean-shift (Comaniciu et al., 2003) and Markov based methods like Kalman filter (Stefanov et al., 2007; Ndiour and Vela, 2010) are placed in this group. Multi-hypothesis models which estimate target location based on several hypotheses in which each frame has been done by observation, validation, and transmission of stronger hypotheses to the next frame. The particle filter is the most famous multi-hypothesis model (Del Bimbo and Dini, 2011). Hybrid models combine these two approaches in a way that estimate target location based on hypotheses of multi-hypothesis methods that have been improved by strategies of single hypothesis ones (Khan et al., 2005; Maggio and Cavallaro, 2005). The gradient methods are fast and computationally efficient, although their precision is very low in occlusion conditions, and if the algorithm converges to a wrong location, then it will be nearly impossible to recover and continue tracking. The Kalman filter model assumes hypotheses such as the linearity of the dynamic and observation functions, and the Gaussian distribution of system noise, which if not satisfied in the environment in hand, the estimated location is likely to be wrong. Using multi-hypothesis methods makes the algorithm better suited to a situation such as occlusion and background clutter. However, they have a higher computational complexity than single-hypothesis methods and grows exponentially with the state space dimension. In addition to all mentioned efforts, even researchers employed the contextual information to facilitate the process of various stages of the tracking algorithm via basic information and hypotheses from the environment, its structure, and existing objects. This information is defined as a set of constraints used in the object recognition phase to increase tracking precision (Stalder et al., 2010; Grabner et al., 2010). In the recent decade, with the advent of research done on the field of neuroscience and cognitive science as well as the discovery of new secrets of brain functions, a new window has been opened towards artificial intelligence and especially ma-

3 chine vision. Researchers, inspired by the brain functions in different tasks, provided new solutions to machine vision problems, which are often referred to as ”biologically inspired algorithms.” In this paper, inspired by the biological vision system and how it functions, a visual tracking algorithm is presented. The proposed algorithm uses the 4-layer BIT (Biologically Inspired Tracker) algorithm (Cai et al., 2016) –which is implemented based on the HMAX recognition model (Serre et al., 2005)– to identify the target location. In the proposed algorithm, in addition to this recognition core, inspired by the mechanisms in the visual cortex, some mechanisms are introduced to address the challenges of visual tracking. Inspired by the function of the retina and using the Michaelis-Menten model (Beaudot, 1996), a mechanism is introduced to cope with local and global illumination variation, inspired by the top-down saliency mechanism and the center-surrounded operator employing the discriminative saliency model (Gao and Vasconcelos, 2009) to tack with the background clutter problem and the similarity of the target features with its surroundings. In order to cope with the occlusion challenge, the short-term memory like structure is used to memorize the general appearance and motion characteristics of the target in order to identify the occluded areas for excluding visual information of those ones from the model updating process. Also, a biologically inspired motion model is employed to handle the complete occlusion situation by predicting the probable target location in future frames makes it possible to continue the tracking after occlusion. In addition, in the proposed algorithm, attention is paid to the size variation challenge in which any possible change in the target size is recognized by evaluating the target shape in different scales in each frame. The use of the HMAX model and the aforementioned mechanisms, all inspired by the visual cortex, causes to yield a robust tracking algorithm. To evaluate the performance of the proposed algorithm in terms of precision, robustness, and speed, it was tested using the ”visual tracking benchmark” (Wu et al., 2015) experimental set. The experimental results indicated that the proposed algorithm was able to achieve high precision as well as keeping real-time execution by adequate management of most of the visual tracking challenges using methods and mechanisms inspired from the biological vision system. Moreover, the proposed algorithm was compared with different visual tracking algorithms so that the results of this comparison indicated that the proposed algorithm is able to achieve better results than those ones. In a nutshell, in this paper, the procedure of the visual tracking task conducted in the biological visual system is studied in the modular view and divided into the several executive blocks. So by employing computational models inspired by the biological mechanism which can sit in the place of these blocks and establishing the information flow between them a biological inspired visual tracking algorithm is introduced. Overcoming the background clutter, illumination, and scale variation, and occlusion challenges were the main concentration of the proposed algorithm done by employing the biologically inspired mechanisms of visual search, top-down saliency, retina adaptation system, and short-term memory. The rest of this paper is organized as follows: in the second

section, the previous biologically inspired machine vision algorithms especially the visual tracking ones are studied. In section 3 the biological background of the proposed method and the mode of performing the visual tracking task in the biological visual system is described. In section 4, the procedure of the proposed algorithm, its various parts and how to handle various challenges are described in detail. The experiments, their results, and comparisons with other visual tracking algorithms are discussed in section 5; and finally, in the final section, the conclusion is provided. 2. Literature Review As mentioned, recent studies conducted in the field of neuroscience and cognitive science has led to a greater understanding of how the brain functions in different tasks. These new achievements have caused that researchers of the fields of artificial intelligence and machine vision come up with new solutions to address existing problems. In the machine vision field, biologically inspired algorithms can be generally classified into two groups: the first group covers the algorithms inspired by the processing of optical signals in the visual cortex pathways and how the features are extracted and combined together. They are employed in problems such as object recognition (Kheradpisheh et al., 2016a; Zhang et al., 2016a), face recognition (Li et al., 2013a; Wang et al., 2013), scene classification (Han and Liu, 2013; Song and Tao, 2010), image enhancement (Bai et al., 2016; Wang et al., 2016), etc. those algorithms are often inspired by the processes of the ventral pathway of the visual cortex which is responsible for shape processing and recognition (Kruger et al., 2013). The second group consists of the algorithms inspired by the structure of the neural system, which are often referred to as deep neural networks (DNN). Networks with high learning abilities to do a specific task by training in order to imitate different functions of the visual cortex (Kheradpisheh et al., 2016b). Those networks are mostly used in recognition applications, such as categorization (Rawat and Wang, 2017), recognition (Erhan et al., 2013), and segmentation (Shelhamer et al., 2017). Deep neural networks are also employed in applications such as coloring grayscale images (Varga and Szirnyi, 2016) or even the automatic object or face image production (Karras et al., 2017), which in some way can be considered as inspirational methods of brain imagining. Most of the mentioned algorithms are presented for tasks defined in the spatial domain and applied to the single images. Moreover, with new research in the field of temporal processing on the visual cortex, there is a better understanding of temporal processing and motion perception which are most likely to occur in the dorsal pathway. These achievements made it possible for researchers to be able to inspire brain function on problems in which temporal processing plays an essential role. One of the most important of those problems is visual tracking. Ellis and Ferryman (Ellis and Ferryman, 2014), inspired by the method of extraction of basic shape and motion information on the primary stage of the biological visual system (retina and LGN), presented a segmentation algorithm for extracting moving ob-

4 jects. Their algorithm executes a background subtraction operation and generates a binary mask of moving objects per frame. The high execution speed due to the use of basic features and the no need to set any parameters are the positive feature of this algorithm. However because of the nature of the algorithm that only performs a background subtraction, appearance model or visual descriptor is not generated for moving objects, so their identity cannot be detected and separated from each other. Mahadevan and Vasconcelos (Mahadevan and Vasconcelos, 2013) provided a framework for visual tracking, based on a guided search model (Wolfe, 2007). Inspired by the way how tracking operation is executed in the visual cortex –which is done based on the attention mechanism, the appearance features of the object, and the existence of a distinction between the spatial-temporal features of the object and the background– , a two-step visual tracking algorithm is presented so that in the two stages, the algorithm tracks the target by applying the bottom-up and top-down saliency respectively. Employing the information of the last frame to train the statistical model which reduced the generality of the algorithm, considering the singlepoint shape representation method, ignoring the motion information of the object, and neglecting the problem of occlusion are the main disadvantages of Mahadevans tracking algorithm. Zhang et al. (Zhang et al., 2017) used a 5-layer model to produce a series of features inspired by primary layers of the visual cortex to describe the target and track it using particle filter. Zhang pointed out that the neurons existing in the visual cortex adapt themselves to the statistical characteristics of the visual stimulus to extract the robust features. The action which cannot be done by applying a Gabor filter bank to the image. By combining the oriented-edge features with sparse coding, Zhang shows that it was possible to add this adaption into the appearance model. Zhang also indicated that the introduced appearance model can properly model the target, and because of the application of various mechanisms such as the removal of intensity dependency, local coding, and discriminant features, the introduced tracking algorithm can cope with various challenges such as illumination intensity variations, background clutter, and partial occlusion. But since in the generation of the appearance model, in each frame, the located target area is considered as the positive sample and accordingly, coding is applied, if, for any reason, the positioning precision is reduced, the algorithm faces the drift problems. Cai et al. (Cai et al., 2016) introduced a multi-layered tracking algorithm inspiring the primary layers of the visual cortex as well as short-term memory performance. In the primary layers of their multi-layer model (S1 and C1), they extract the lowlevel features of the target area and in upper layers (S2 and C2), simulate a high-level learning system to track these features. They also used the fast Fourier transform and fast Gabor approximation to achieve high processing speed. The experiments indicated that this four-layer model has a high degree of precision in modeling the appearance features of the target, but this algorithm has not employed any mechanism for facing different challenges and has addressed only the power of extracted features and the detection process for coping with them. But if there are several different challenges in the experiment environ-

ment, the Cai algorithm is not able to handle them adequately. In addition to strategies inspiration from the visual cortex and the dorsal pathway, tracking algorithms have also been proposed that take advantage of the power of deep neural networks (Fan et al., 2010; Wang and Yeung, 2013). In those algorithms, using an extensive training set, the network is trained to extract efficient features, and then it is used in tracking the target. Not paying sufficient attention to the appearance and geometric dependency of the information on successive frames and using a massive network that reduces processing speed are major problems of those algorithms. Another group of algorithms uses a set of extracted patch from the target area in the first frame to train the deep network instead of using an external dataset, but because of using raw intensity information of a small area extracted only from first frame, these features are not robust in face of appearance change challenge (Zhang et al., 2016b). 3. Visual Tracking, from Biological to Machine Vision In this section, the biological background of the proposed tracking algorithm is discussed; first, the process of performing visual tracking task in the visual cortex is examined and then from the modular view, the inspiration made for presenting the proposed method is explained. 3.1. Visual Tracking in biological vision In the biological vision, the goal of visual tracking is keeping the object of interest in the fovea, which requires to estimate the objects motion pattern (Kowler, 2011). This ideal tracker which performs accurately even under the most challenging conditions works based on the complicated motion perception system using not only the visual information received through the eyes but also the visual features and motion patterns of the object which are learned and stored in short-term memory (Masson and Perrinet, 2012). Understanding the pathway that the optical signals received through the eye are passed to convert into the motion signals transmitted to the eye muscles is the key to inspire from this biological procedure. Although plenty of researches have been conducted in the neuroscience and cognitive science in the last decade, this dense neural networks function in the tracking of a moving object and the mode of handling existing challenges is unclear and not been completely identified (Cox, 2014). So some level of simplification is inevitable to meet the computational view of this visual system. Carandini suggests that the solution of achieving this computational view is understanding the computations carried out at the level of individual neurons or neural populations, which help to divide a processing flow into computational blocks such as linear filtering, normalization, recurrent amplification, pooling, etc. and identifying the meaning of signals and data passed between them (Carandini, 2012). Moreover, in contrast to the classic view of the processing trend in the biological vision –which was considered to be feedforward and in two separate ventral and dorsal pathways– in every task both of the ventral and dorsal pathways are involved. Also, there is always top-down communication between different sections of a pathway, as well as parallel communication between the ventral and dorsal pathways (Medathati et al.,

5 Eye Movement Information Retina LGN

Color Intensity Signal

MT

Low3level Color I nte ns ity Feature

V4 Semantic Shape Information

PFC

LGN STS

V4

PIT/AIT S EF/ FEF

V2 High3level Shape Information

SEF/FEF

V1 V2

V1 Low3level Shape/Mo tion Information

MST

M T Local Structure S hortDTe rm Memory PFC

PIT/A IT

High3level Motion Information Local Motion Information MS T Natural/Complex Motio n Be hav io r S S T

Objects & Meaning

Fig. 1. The procedure of tracking task performed in the visual cortex and the involved sections

2016). With this view, as well as according to studies on different parts of the biological vision system and the way in which they are interacting (Medathati et al., 2016; Kruger et al., 2013; Serre, 2014; Montagnini et al., 2015-08; Silvanto, 2015), one can model the visual tracking task in biological vision as shown in Fig. 1. The visual Processing in biological vision starts with receiving the light signals by the retina and converting them into the electrical signals to transmit to the primary V1 layer in the visual cortex system via optical nerves through the LGN. However, the retina itself and LGN also play important roles in the processing of visual signals and their preparation and are not just simple transmitters of electrical signals (Gollisch and Meister, 2010). The mammalian retina includes five layers (photoreceptors, horizontal cells, bipolar cells, amacrine cells, and ganglion cells) and consists of a complicated and recurrent nervous system performing local and global processing on optical signals in order to transmit the most optimal information to higher levels (Marc et al., 2013). One of the excellent performances of the retina is its adaptation power in case of changes in intensity, in such a way that it can be adapted to different illumination levels existing in the scene and produce an output with high quality and excellent contrast (Shapley and Enroth-Cugell, 1984). The signals emitted from the retina are transmitted through ganglion cells to LGN units. The LGN combines two retinal signals and generates three type of signals which transmitted to the V1 layer through three Magnocellular, Parvocellular, and Koniocellular pathways including intensity and color information in different resolutions (Dacey, 2000). As mentioned, visual tracking is defining as fixing the object of interest in the fovea, by synchronizing the speed of

eye movements and that of the object of interest, the operation which is done by motion perception units located in the dorsal pathway. Smooth pursuit and Saccadic eye movement are the mechanisms employed by the visual system to achieve this synchronization (Rashbass, 1961). The smooth pursuit synchronizes the speed of eye movements with that of the moving object using the object current movement information and its motion pattern learned and stored in short-term memory. The motion estimation occurs in the dorsal pathway, particularly in V1 and MT layers. In the V1 layer, using two sets of simple and complex cells, the direction of movement, and in the MT layer, the speed or magnitude of movement are obtained based on calculating the edges movement by applying spatiotemporal filters over intensity information provided by the Magnocellular pathway (Wallace, 2004). This motion information is transmitted into the MST where receptive fields cover a much larger portion of the visual field and encode basic optic flow patterns such as rotation, translation or expansion. In addition, in MST and STS layer, more complex motion patterns such as biological movement is decoded by analyzing the local motion vector of different regions of the object, employing shape information provided by the ventral pathway. (Medathati et al., 2016; Kruger et al., 2013; Giese and Poggio, 2003). The saccadic eye movement correcting the estimation error using visual search mechanism to achieve the exact motion vector of the object of interest. Visual search is the mechanism performed in the ventral pathway and responsible for localizing the object or the event of interest from among other objects existing at the scene (which are called distractors). This task is conducted based on the saliency and center-surrounded operator, and a mechanism called feature search which identifies and utilizes the features (such as color, intensity, oriented edges, etc.) that make distinctions between the target and distractors. There can be only one feature or a collection of several features, and the tracking task can be continued until the object of interest has the at least one spatiotemporal feature separable from its surroundings (Pylyshyn and Storm, 1988; Sekuler and Sekuler, 1999). In this visual search mechanism, the saliency is calculated based on the combination of the bottom-up saliency map and top-down information including appearance features of the object of interest. In other words, the bottom-up saliency model which generated in the primary layer of the ventral pathway is guided by top-down information in order to identify and distinguish the object of interest from the distractors. Applying this top-down information is conducted in the higher layer of the ventral pathway (AIT and PIT) based on appearance features of the object of interest which are generated and memorized in the earlier time interval (Sekuler and Sekuler, 1999; Treisman and Gelade, 1980; Wolfe, 2007). Finally, after localizing the stimulus and estimating its movement, the obtained information is sent to Frontal Eye Fields (FEF) and Supplementary Eye Fields (SEF) which are responsible for controlling eye movements (Krauzlis, 2003). Occlusion –where some part or whole of the stimuli is covered by another object– is the situation can disrupt overall of this procedure by interrupting in receiving visual information from stimuli. In the visual cortex, occlusion detected based

6 Input Frames

T arget Region Global Illuminat ion Stat e

Ret ina Illuminat io n Enha nceme nt

Mic haelisƒ Menten Model

T arget Motion Behav ior

V1/ MT / MST Locat io n Pred ict io n

Fast Ga bor Approx imation

Bogad hi Mot io n Model DISƒFast Opt ica l Flow

V ent ra l / Do rsa l Pat hway

V1/ V2/ MT / MST Feat ure Ext ract io n

Colo r Na ming Model

Feat ure Discrimination Rat e

T arget A ppearance Model

V4 / AIT / PIT Ta rget Loca lizat io n

V4 / MST Occ lus io n Detect io n

BIT A lgo rit hm

Feedƒ Fo rwa rd Neura l Net

V 2/ V4 Sca le C ha nge Detect io n

Template Matc hing

Disc riminat ive Sa liency

Update

Fig. 2. The proposed visual tracking framework in the modular view. The rectangles, ellipses, and clouds respectively represent the processing units, employed computational models and stored information.

on shape and motion information stored in short-term memory. Occlusion occurs when any inconsistency in the visual information received from the stimuli appear with information stored in short-term memory. In addition to the visual and motion information of stimuli, the visual system aims to use the visual and motion information of its surroundings and occluding objects in detecting the occurrence of this contradiction. In this case, the biological vision estimates the location of the stimuli by performing an extrapolation. However, it is not clear yet how this extrapolation is done efficiently and exactly which part of the biological visual system and mechanism responsible for this job. In the case of total occlusion, after its detection (due to the loss of all visual information), the visual system uses a scheduling mechanism to estimate the object’s exit time from occlusion and simultaneous continue smooth pursuit eye movements at the same velocity during the initial period of occlusion in other to continue the tracking when the object comes out of the occlusion. This timing is calculated based on various factors such as speed and the motion pattern of the target object and the occlusion agent (Khoei et al., 2013).

target location using these features and the appearance model produced from the target (top-down saliency mechanism). Different parts of the identified region are evaluated in order to detect occlusions based on their degree of similarity to the target appearance model (inconsistency in the received visual information with the stored on in the short-term memory). In this stage, the complete or partial occlusion is also detected based on the size of occluded area. Finally, the identified region is considered as the target location and the target appearance and motion model is updated (memorizing general characteristics of objects). Based on the described process, from the modular view, the proposed approach has a schema shown in Fig. 2. In this schema, in addition to the operational units, the computational algorithms used in each operational unit, as well as its biological inspired mechanism, are specified.

3.2. Biological Background of Proposed Algorithm The proposed algorithm undergoes a similar processing to the biological visual system to detect and track the target location. In an overview, the procedure of the proposed algorithm starts by improving the intensity information of the input image (retina self-adaptation function). Then, according to the motion history of the target and the local motion information received from the input signals, an initial estimation of the target area is made (smooth pursuit mechanism). In the following, the shape, color, and motion information are extracted from the predicted area (low-level features extraction in V1 layer) and the visual search mechanism is employed in order that a region with the highest degree of similarity can be a candidate as the

The proposed tracking algorithm operates based on the BIT model which is inspired by the biological visual system. BIT is an algorithm that performs tracking operations only by relying on the similarity of the dense appearance model of the target area, but there is no mechanism for managing the challenges such as background clutter, illumination variation, the changes in the target size, and occlusion. In the proposed method, the BIT algorithm is modified to deal with existing challenges and increase its precision and robustness by employing several biologically inspired mechanisms. In the following section, the BIT algorithm is briefly introduced, and then the employed mechanisms, the way to handle the challenges and how they are integrated with the BIT algorithm are introduced.

4. Proposed Algorithm

7 4.1. BIT Model The BIT tracking algorithm (Cai et al., 2016) operates based on the HMAX recognition model (Serre et al., 2005). This algorithm has 4 processing layers including layers S1 and C1 for creating the target appearance model and layers S2 and C2 for estimating the target location. In S1 and C1 Layers, a dense appearance model of the target is generated using oriented-edge and color features. In the S1 layer, the operation of feature extraction from the target region is performed. To extract the oriented-edges features the Gabor filter bank including 60 filters is used. Those filters are defined in 5 scales and at each scale in 8 orientation for odd filters (Godd ) and 4 orientation for even filters (Geven ). To calculate this filter bank, Equation 1 is employed based on the parameters shown in Equation 2. Those parameters are chosen in such a way that they have the most similarity with the way of extraction of the edge features in the visual cortex (Serre and Riesenhuber, 2004).    2 2  2π X +Y    Geven (x, y; λ, θ, σ) = exp − 2σ2  × cos λ X    Godd (x, y; λ, θ, σ) = exp − X 2 +Y2 2 × sin 2π X λ 2σ where  (1)  X = x cos θ + y sin θ     Y = −x sin θ + y cos θ  i h     x, y ∈ − ξ−1 , ξ−1 2 2   s = {1, 2, 3, 4, 5}       ξ = 5 + (2 ∗ s)    σ = 0.0036 ξ2 + 0.35 ξ + 0.18    (2)  λ= σ 0.8 n o  π π 3π    θeven = n 0, 4 , 2 , 4 o    θ = 0, ± π , ± π , ± 3π , π odd

4

2

4

In Equations 1 and 2 λ, θ, σ and ξ respectively represents the wavelength of the sinusoidal factor, the orientation of the normal to the parallel stripes of a Gabor function, the standard deviation of the Gaussian envelope and size of filter image in pixel. The oriented-edge features are produced via convolution of intensity channel of the target area (I(x, y)) with these 60 Gabor filter: S 1gabor (x, y; λ, θ, σ) = I (x, y) ∗ Geven/odd (x, y; λ, θ, σ)

(3)

The BIT model uses the CN (Color Naming) description system to obtain color features from the target area (Weijer et al., 2009). In this system, there are 11 basic colors that humans use to describe the color of objects in the environment (white, black, gray, yellow, green, blue, purple, red, pink, orange, and brown). Image Pixels are classified stochastically in those 11 classes (shown with c) based on their color information (in the RGB format). As a result, 11 color feature maps are obtained. The classification is done via a mapping matrix that is trained based on images retrieved from Google Images search. S 1color (x, y; c) = Map (RGB (x, y) , c) , c ǫ [1, 11]

(4)

To obtain the scale-invariant features, in a 4x4 window (shown with ∆), the max-pooling operator is applied to the

edges features in layer C1. To conduct the max-pooling operator, an improved STD algorithm (Guo et al., 2009) is employed. In this algorithm, first, the features are normalized based on the scale factor (N s (x, y)) and then the max-pooling applied to the selected window. In each feature map, the scale factor is equal to that of the Gabor filter (s) applied to it.  S 1 (x,y)  C1gabor (x, y) = max(x,y)∈ ∆ Ngabor s (x,y) N s (x, y) =

qP

2 δ x ,δy ∈{−s,0,s} S 1gabor



x + δ x , y + δy



(5)

Moreover, for the purpose of noise reduction, averagepooling is performed on the color features in a 4x4 window (shown with ∆): C1color (x, y) =

1 |∆|

X

S 1color (x, y)

(6)

(x,y)∈ ∆

So far, the obtained features include 60 oriented edge and 11 color maps. To create the final appearance model, a zero channel is added to the list of color channels, and 12 channels are repeated 5 times (in the number of Gabor filters scale) to achieve 60 color channels. By mixing 60 oriented edge channels and 60 color channels in the complex form, the final appearance model is obtained: C1 (x, y, k) = C1gabor (x, y, k) + C1color (x, y, k) i k ∈ [1, 60]

(7)

In the new frame, the target location is estimated based on the target appearance model generated in previous frames in the next two layers. S2 is a localization layer based on observational information and aims to find a location of the image frame, which, based on the mentioned model, is most similar to the target appearance. For this purpose, in a new frame around the prior target location, a region is selected and its C1 feature is extracted (C1 f rame ). A convolution is performed between the C1 features of the selected area and the appearance model of the target (C1 features of the target shown with C1target ) so that S2 map, illustrating the probability of the target present in each location based on the similarity of the features, is achieved. S 2 (x, y) =

K 1 X C1target (x, y, k) ∗ C1 f rame (x, y, k) K k=1

(8)

Layer C2 contains a task-driven process, employing a convolutional neural network to determine the final location of the target in the selected window. In this way, a convolution is applied to S2 map using a weight matrix W to obtain a C2 map. The location corresponding to the maximum value of this map represents the location of the target in the current frame: C2 (x, y) = S 2 (x, y) > W (x, y) T arget location = arg max(x,y) C2 (x, y)

(9)

W is a weight matrix defined by the tracking task and must specify the location of the target so that with its convolution in S2 map, in the output, each location indicates the probability of the presence of the target in it. So, in the first frame where

8 the target location is known, C2 map is generated as a Gaussian function (Equation 11), which is the optimal output of the convolutional network in this task. Employing the generated C2 map and S2 map, the initial W matrix can be obtained (Equation 12). Then, in each frame, the value of W can be updated by finding the target location.    f (x, y) = exp − 1 2 x − xt 2 + y − yt 2 C2 2σ (10) where t t th x , y = T arget location o f t f rame i  h f (x, y)   F C2  W (x, y) = F −1     (11) F S 2 (x, y)  In Equation 11, F is the transformation operator from the spatial domain to the frequency domain. In the BIT algorithm, in order to increase processing speed, all convolutional operations are transmitted to the frequency domain (using a two-dimensional fast Fourier transform) to be replaced by a multiplication operation. As a result, Equations 8, 9 and 11 are respectively changed into Equations 12, 13 and 14 F [S 2] =

K i i h 1 X h F C1target (k) × F C1 f rame (k) K k=1

      F C2 (x, y) = F S 2 (x, y) × F W (x, y) h i f (x, y)  F C2  F W (x, y) =   F S 2 (x, y)

(12) (13) (14)

Finally by recognizing the target location in each frame, the target appearance model and weight W update via the learning factor ρ in a classical form.  h i i h i h t     F C1target =h ρi F C1target + (1 − ρ) F C1target (15) f  F C2    F [W] = ρ F[S 2t ] + (1 − ρ) F [W]

In Equation 15 C1target is the appearance model of the target created based on the C1 feature of target areas in previous frames and the C1ttarget is the C1 features of the target area in the tth frame (current one). The schematic of the BIT algorithm procedure is shown in Fig. 3. 4.2. Challenges and Coping Mechanisms As mentioned the proposed algorithm uses the BIT recognition model as the target localization core and employs several mechanisms to handle various challenges faced with tracking problem. In this subsection, these mechanisms and how they cooperate with recognition core is described. 4.2.1. Local and Global Illumination Variations In machine vision algorithms, it is necessary to reduce the severity of the effects of problems such as the presence of different illumination levels, inappropriate contrast ratio, and variations in illumination conditions before starting the main process, by applying a set of preprocessing operations. In the proposed algorithm, the Michaelis-Menten model (Beaudot, 1996)

Update Stage Initial Location: First Frame: Time:

Recognition Stage

! ! ," #$%&' ! (=1

Target location :7 / , " / 8 Max response of C2 map is target location 7 / , " / 8 = %R& max 0>/

Select Target Region )%(*+ = -.# #$%&' / ;

/

7L,P8

, "/

,"

Calculate C2 Map 0>/ = 9 K! 9[D Q ]9 <>/

Generate C1 Features 01/2 = 345'67)%(*+8

Calculate S2 Map Calculate S2 Map !

9[<>/2 ] = ? @A 9 01/2 B

C

9 <> / ! = @A 9 01 / B × 9 01L MNO,PMNO / B ?

Calculate W Map

Generate C1 Features

F/ E9 <>/2 9[D / ] = 9 0>

/ 017L MNO ,P MNO 8 = 345'67)%(*+8

Update Model

Select region around previous location

9 012 = G 9 01/2 H 1 I G 9 012 9 D = 9 D/ H 1 I G 9 D

)%(*+ = -.# #$%&' / ;

/K!

, " /K!

Next Jrame: ( = t H 1 Fig. 3. The overall procedure of the BIT algorithm

and Histogram matching are employed to overcome mentioned problems. The Michaelis-Menten model locally improves illumination levels, inspiring the retina self-regulation function (Shapley and Enroth-Cugell, 1984). In this model, as shown in Equation 16, the value of each pixel is regulated based on its own value and a density parameter –which linearly correlates with the values of neighboring pixels– in order to improve the image contrast and balance the illumination levels: RGBx,y = RGBx,y ×

Vmax + RGBx,y R x,y + RGBx,y

(16)

In the Equation 16, Vmax represents the maximum pixel value in the image (assuming the normalized image, it is equal to one, Vmax = 1), and R represents the density parameter map (R x,y is the density parameter of the image pixel in the position of (x, y)) obtained by applying a low-pass filter to a neighborhood. For flexibility, a regulation parameter V0 ∈ [0, 1] is considered in density parameter calculation to control the degree of interference of the neighboring pixel information. The suggested value of this parameter is 0.9 (V0 = 0.9) in the Michaelis-Menten model:   R x,y = RGBx,y × V0 + (Vmax × (1 − V0 )) (17)

In Equation 17, RGBx,y represents the average values of the image pixels around the position (x, y) in an n×n neighborhood. The small size of this neighborhood (n) does not have an appropriate effect on the illumination and contrast enhancement. On the other hand, too large one interferes with the image information due to excessive interference with neighborhood information. Based on the experiments, the proper value of this neighborhood is 5 pixels (n = 5). Furthermore, in addition to local variations, global illumination variations must also be managed (similar to that of the

9 retina which manages global variations of illumination with the control of the amount of light reaching the photoreceptors by opening and closing the iris) so that the illumination condition of the images should be similar to those of the prior frames as much as possible. To achieve this goal, the image density parameter in the current frame (Rt ) should be similar to the image density parameter in the prior frames as much as possible, which can be achieved by performing a histogram matching on density image received by Equation 17. In order to perform a histogram matching, a reference image or histogram required. To have an appropriate reference for this histogram matching, the histogram of first frame density image is selected as the reference (shown with R) and is updated based on the final density image of each frame using learning factor ρ: Rt = Histogram Match Rt , R R=

(

ρ Rt + (1 − ρ) R Rt



t>1 t=1

(18)

Summarily, in this phase, in order to improve the illumination conditions of the input image, first, based on Equation 17, the density image of the current frame is obtained. Then, according to Equation 18, a histogram matching is applied to this density image so that the illumination conditions of the image in the current frame can be similar to prior frames. Finally, with the application of Equation 16, the density parameter applied to the input frame, and the improved image is obtained. 4.2.2. Background Clutter In the tracking tasks, the appearance of the target (in terms of intensity, color, texture, and other features) can be similar to other objects in the scene or background. As a result, it becomes difficult to discriminate target from other objects or backgrounds. So it is necessary to rely on those features that generate the higher degree of discrimination with their background. Consequently, in the proposed algorithm, unlike the BIT algorithm, among the available features, those ones with the higher degree of discrimination are employed. The BIT algorithm extracts 60 oriented edges and 11 color maps (similar to the features extracted from the primary layers of the visual cortex in the ventral pathway). In the proposed algorithm, in addition to those features, the motion information similar to the ones extracted from the primary layers of the dorsal pathway of the visual cortex is also employed. In a background clutter which the appearance of the target is quite similar to background objects, it is a good discrimination feature. To generate motion features, a similar method with the fast Gabor approximation method –introduced in the BIT model– is used. First, using the optical flow algorithm (explained in Section 4.2.3), 2 horizontal (V x ) and vertical (Vy ) motion maps are obtained. Then, the motion direction and magnitude map are calculated based on the Equation 19:    −1 Vy (x,y)    (x,y)  Θ (x, y) = tan V x q    V x2 (x, y) + Vy2 (x, y)  A (x, y) =

(19)

Based on these two maps, the S1 motion features are obtained by discriminating the motion vectors in eight different directions: i h ( A (x, y) Θ (x, y) ∈ θ − π8 , θ + π8 S 1motion (x, y, θ) = 0 otherwise (20) where n o π π 3π θ = 0, ± 4 , ± 2 , ± 4 , π

Similar to color features to eliminate possible noise, by applying an average-pooling operator in a 4x4 window, the C1 motion features are obtained: 1 X (21) S 1motion (x, y) C1motion (x, y) = |∆| (x,y)∈ ∆ Finally, by combining the oriented edge, color, and motion features a feature bank includes 79 feature maps is formed: i h (22) C1 = C1gabor C1color C1motion

Now, based on the degree of discrimination that those features generate between the target area and its surroundings, each of those features is assigned a weight to control their participation in recognizing the target area in future frames. To assign this degree of discrimination, the discriminative saliency model (Gao and Vasconcelos, 2009) inspired by the top-down saliency mechanism of the visual cortex is used. The discrimination saliency of a feature for the given area is defined as the degree of discrimination that the feature generates between this area and its surroundings. Therefore in order to calculate the discriminative saliency of the features, two center (the target region) and surrounding (a border around the target region with the width of the half size of the target region) windows are considered and for each feature, their distribution in these two windows are estimated. The discriminative saliency of the feature (shown with DS) is the discrimination rate between these two distribution calculated by mutual information. The 9D descriptors, achieved by applying the 3x3 sliding windows on the feature map in the two defined windows are used in the estimation of the feature distribution. By calculating discriminative saliency of all features, they intercepted as the weight of each feature map in calculating S2 map as shown in Equation 23. The notation of this equation is the same as the Equation 12. In the calculation of S2 map, the features with a very low degree of discrimination can be ignored because of their very low influence. In this way, the T Hds threshold is defined so that features with a lower degree of discrimination than this threshold are marked as ineffective ones. In addition to helping to resolve the background clutter problem, this action also increases the processing speed by reducing the number of features involved in the detection procedure. i i h h 1 P F [S 2] = |L| k∈L DS (k) × F C1target (k) × F C1 f rame (k) where L = {k | DS (k) > T Hds }

(23)

It should be noted that in the calculation process of discriminative saliency in addition to the degree of discrimination of

10

Extra-Retinal Network

posterior

Prior

Local Motion vector

Retinal Network

δ δ MAP

posterior

Prior

MAP

Sum Estimated Motion Vector

R

Visual Search Fig. 4. The overall schematic of the employed motion model

each feature, also the presence or absence of a feature in the target area can be determined, resulting in only the features that are present in the target area significantly, but their presence is negligible in surrounding one are employed. Consequently, features whose presence in the background area is much bolder than the target area are excluded from the processing. How to calculate the discriminative saliency and how to recognize ineffective features are explained in the Appendix A. 4.2.3. Occlusion The most important challenge facing tracking algorithms is occlusion where some part or the whole of the target is covered by an occluding object and consequently disrupting the recognition process due to changes in target appearance. In a partial occlusion, the occluding object can be identified based on the difference in appearance, and thus prevented the interference of the features of the occluded area in the target’s appearance model. However, to deal with complete occlusion, visual information cannot provide any solutions. In this case, similar to the operation of the visual cortex, the use of high-level information, such as the object motion behavior, can be remedied (Weiss et al., 2002). Thus, using a motion model in the tracking algorithm seems mandatory. In the following subsections, first the motion model used in the proposed method is introduced, then the occlusion handling method is investigated. Motion Model. The motion model is a mechanism providing an estimation of target location in the future frame based on the motion vector of the current frame and its motion behavior. In the visual cortex, the motion perception is performed on the dorsal pathway of the visual cortex and specifically on the V1,

MT, and MST layers. The input of this layer is the intensity information provided by Magnocellular pathway, which the local motion vector is extracted by applying spatial-temporal filters on them. The extracted motion vectors are similar to the optical flow vectors (Wallace, 2004). In addition to the local motion information received from visual stimuli, the motion behavior of an object stored in the short-term memory affected as the regularization factor in the objects motion perception (Weiss et al., 2002). In the proposed algorithm, a two-part motion model with the same function as the motion perception of the visual cortex (Bogadhi et al., 2013) is used. This model has 2 retinal and extra-retinal recurrent network. In the retinal network, similar to the function of the V1, MT, and MST layers, the local motion vectors of the stimulus are calculated. The estimated motion vector of this section is a Gaussian distribution calculated via a MAP estimation based on the optical flow vectors and a prior distribution. This prior distribution is the Gaussian distribution estimated in the prior frame by this network. The optical flow vectors used in the MAP estimation are two dense local motion maps along the horizontal and vertical direction, calculated based on the intensity information of the target area in the previous frame and the intensity information of the same area in the current frame. The extra-retinal network has a similar function to the shortterm memory. The output of this network is also a Gaussian distribution whose value is updated based on the final motion vector of the target (after precisely estimating its position in future operational units). Finally, the motion vectors estimated by these two networks (mean value of output Gaussian distribution) are combined based on their trustworthiness (which correlates with the variance of the distributions) in order to arrive at an estimated final motion vector. Considering N(µtR , σtR ) as the output of the retina network and N(µtE , σtE ) as the output of the extra-retina network in the tth frame, the final output of the motion model which is predicted displacement vector in the current frame (mvt ) is calculated via Equation 24. Fig. 4 illustrates the overall schematic of this model. t

mv =

1 1+σtR 1 1+σtR

+

1 1+σtE

×

µtR

+

1 1+σtE 1 1+σtR

+

1 1+σtE

× µtE

(24)

Under the conditions such as occlusion where there is no local motion information available, the extra-retinal network shows its positive effects. In this case, the output distribution of the retinal network is gradually converted into a zero mean and infinite variance distribution which is completely unreliable information. But on the other hand, the extra-retinal Network keeps estimating the movement of the object in the occlusion state based on its prior motion pattern. Therefore, when the object comes out from the occlusion, its location can be estimated and its tracking will be continued. Extracting Optical Flow Vectors. Various optical flow algorithms have been introduced that are inspired by the process of extraction of local motion vectors in a biological vision (Solari et al., 2015; Chessa et al., 2015), but in these algorithms, a large

11 number of convolution operators and spatiotemporal filters are employed. As a result, they have a high computational cost and a low speed, making them incapable of being used in real-time algorithms, especially where the optical flow estimation is not the main process and it is employed by the other algorithm or method (such as visual tracking) as a sub-process. Thus, regarding the fact that the calculation of local motion vectors is provided by a sub-stage in the tracking algorithm and should be executed very quickly, and, considering the need to calculate the motion vectors of all the pixels existing in the examined area (requiring a fairly uniform distribution of motion vectors, not sparse features, which are used to extract motion features as well as to detect occlusion) local vectors should be obtained densely in the least possible time. Considering these prerequisites, the DIS-Fast method (Kroeger et al., 2016) has been used in the proposed tracking algorithm. The DIS-Fast method is a dense optical flow calculation algorithm focused on maintaining high processing speeds along with maintaining precision so that it can be employed in real-time applications. The DISFast calculate optical flow in three stages using inverse search for fast patch correspondences, multi-scale aggregation for fast dense flow estimation and variational refinement to improve the final result. The optical flow vectors are used in three different part of the proposed algorithm: predicting the probable target location in the new frame, the visual search for the precise identification of the target position, and updating the target appearance model. In each of these parts, the local motion vector of different part of the image plane is required. In these three phase, the local motion vectors respectively extracted from the areas around the target position in the previous frame, the probable position of the target in the current frame, and the exact position of the target in the current frame. These areas may be overlapping, but the amount of time saved in identifying overlaps and no repeated generation of optical flow vectors for those areas are so low that it can be ignored. Occlusion Detection. Similar to the biological visual system function in handling occlusion explained in Section 3, a mechanism is needed to provide the general characteristic of the target so that it can be used to investigate visual information after target recognition, and detect occluding areas by detecting contradictions. For this purpose, a neural network is employed and the color and motion features extracted in layer C1 are used to train this network. For this purpose In the first frame, which assumes no occlusion, a training set consisting of motion and color information of layer C1 (8 motion features and 11 color features that form the 19-dimensional features) from the target area and from a neighboring object around the target are used as positive and negative classes respectively and the neural networks are trained. Moreover, in the first frame there is no motion information, but in future frames with the availability of this information, the neural network is updated. It should be noted that, the biological visual system also uses the shape information and geometric structure of the target in the occlusion detection process, but, unlike the biological vision, which performs the high-level shape perception, in the proposed algorithm available shape information are only the

low-level local edge features which not uniformly distributed in the target area (unlike the color and motion information); therefore, using them in training the neural network is not appropriate. So, after estimating the target location (output of C2 Layer), by feeding the C1 motion and color features of the located area into the neural network, the occluded areas and their percentage are obtained. Two thresholds for partial occlusion (T H po ) and complete occlusion (T Hco ) are employed for occlusion detection. If the percentage of occluding pixels is lower than that of T H po , then no occlusion occurrence is considered, and the process of updating the algorithm is performed without any changes. If the percentage of occluded pixels is in the range between 2 thresholds, a partial occlusion is considered, the position identified as the target position is acceptable, but in model updating stage, the information of occluded area is filtered. Finally, if the percentage of occluded pixels exceeds the T Hco threshold, the complete occlusion is considered, the estimated location by the motion model is determined as the target location and the update stage is not performed. In addition, in the absence of complete occlusion, the neural network is updated with the new C1 information. Analyzing the video sequence including the occlusion challenge, 0.2 and 0.8 are the appropriate values for the threshold level of partial and complete occlusion respectively. 4.2.4. Target size variation In real-time visual tracking applications, target size variation is one of the most common target state changes occurred because of a change in the target’s distance from the camera. In the case that this variation is not properly recognized, the background information is involved in the target area or vice versa, which some part of the target is considered as the background. In both cases, the recognition is disturbed (in particular if the tracking algorithm uses the discriminative recognition method) and cause a drift problem. Shape information is the most suitable features for detecting size variations. Accordingly, in the proposed method, the oriented edge features are used to detect the size variations. In this way, after recognizing the target location in the current frame, the shape feature maps (60 oriented edge features) of the detected area is resized in a discrete space s ∈ [smin , smax ] and its multiplication (dot product) with the targets shape feature maps is calculated. The optimal target size in the current frame is the one with the maximum multiplication value. Since the size variations between the frames are negligible and on the other hand, it is time-consuming to search within the large resize space, size variations are examined in the interval of one pixel:   δs∗ = arg max(δsx ,δsy ) scale(sx ,sy ) C1ttarget/g . C1target/g

where     s x , sy = S ize(M) + δs x , δsy   δs x , δsy ∈ {−1, 0, 1}

(25)

In Equation 25, C1target/g and C1ttarget/g , respectively, represent the oriented edges features of the appearance target model

12 Update Stage

Recog nitio n Stage

Initial Locat io n: First Frame : Time

Target locat io n :

Se lect Ta rget Region

Apply Reve rse Sca le Illuminat io n Enha nceme nt

Ge nerate C1 Feat ures C hec kfo r Occ lus io n

S ize Variat io n Est imatio n

Calc ulate S2 Ma p

S caleT = NO

YES C hec kfo r Occ lus io n Max Res po nse of C2 Ma p is Ta rget Locatio n Calc ulate C2 Ma p Ca lc ulate S2 Ma p

Ca lc ulate W Ma p

Ge ne rate C1 Feat ures

Ca lc ulate Feat ure Salie ncy

Illuminat ion Enhancement

Update Mot io n Mode l

Apply Reverse Sca le

Update Occ lus io n Mode l Update Appea ra nce Mode l

S tf rame = S tf rame × where

Co mplete Occ lus io n

NO YES Remove Occ luded Reg io n

S) can always be considered constant throughout the algorithm. This strategy also makes it reasonable to consider the maximum resizing of one pixel.

Select Reg io n A ro und Pred icted Pred ict Target Locat io n

N Fig. 5. The overall procedure of the proposed algorithm. In compression with The BIT algorithm (Fig. 3) the dashed block, dotted blocks and the ones with solid border respectively represent the unchanged operations, operation with modification and the newly added operations.

and the C1 features extracted from the estimated target region in the tth frame. The scale(sx ,sy ) (M) is the resize function applied to the M feature map (which is a multi-channel image) in which s x , sy are the size which the M map should resize to. The final output of Equation 25 is the size variation between the tth and (t + 1)th frames. Considering the dense appearance model, target resizing cause a problem in the updating process. Therefore, to avoid this problem, instead of resizing the target area, the image size is changed to keep the target size constant from the algorithm view. In such a way that before the processing of each frame, the size of the image is multiplied by the inverse of target size variation (Equation 26), so that the target size (represented by

QT  t=1

1 S calet

1+

t

δs S target



(26)

In Equation 26, S tf rame , δst , S target , and S calet respectively, represent the tth input frame size, target size variation in tth frame calculated based on Equation 25, the target size which is constant, and the overall target size variation in tth frame with respect to the first one. In estimating the target size variations, a model similar to that of a motion model has been employed for reducing the possible error by interfering the target size variations in the prior frames. This is similar to the 2-part motion model in which the retinal and extra-retinal network respectively reports target size variations based on received visual stimulus information and the history of target size variations. By combining the output of these two networks, the optimal target size variation in the current frame is estimated. 4.3. Overall procedure of the proposed algorithm In each frame, the proposed algorithm predicts the probable location of the target using the motion model, selects a neighborhood around it, and reverses the target size variations over that area to keep the target size constant from the algorithm view. Then, it improves the illumination and contrast of the image and balances the illumination variations. After that, the appearance model of the selected area is extracted (C1 features) includes color, shape, and motion features which are not marked as the inefficient one (the features with the discriminative rate lower than the T Hds threshold) based on the discriminative saliency calculated in the previous frames. By applying S2 and C2 layers, the algorithm recognizes the precise location of the target on the image plane. Using the trained neural network (based on the target color and motion information in the previous frames), the target area is investigated and the occluded area and its percentage is detected (using the neural network which serves as a binary classifier mapping each pixel into the foreground or background classes based on their motion and color features). If the occlusion percentage is higher than the T H co threshold, complete occlusion occurs, the recognized location is invalid, and the predicted location by the motion model is considered as the target location in the current frame and the algorithm does the new frame processing. If this threshold is not met, the target area in terms of resizing the 1pixel range is examined, and the optimal target size is detected in the current frame. Once the target location and its size are detected, the algorithm enters the update phase. In the update phase initially, a procedure similar to the identification phase applied around the target location in the current frame; includes selecting a neighborhood, applying inverse size variations, and improving its illumination conditions. After that, The appearance model of the selected area is calculated

13 Basketball [524]

Bolt [151]

Boy [596]

Car4 [378]

CarDark [274]

CarScale [246]

Coke [54]

Couple [47]

David [291]

David2 [364]

David3 [165]

Deer [26]

Dog1 [1030]

Doll [883]

Dudek [895]

FaceOcc1 [589]

Fish [262]

FleetFace [463]

Football [130]

Football1 [73]

Freeman1 [208]

Freeman3 [403]

Freeman4 [220]

Girl [472]

Jogging-1 [88]

Jogging-2 [91]

Jumping [82]

Lemming [673]

Liquor [1411]

Matrix [85]

Mhyang [285]

Shaking [202]

Singer1 [253]

Singer2 [271]

Skating1 [202]

Skiing [10]

Soccer [139]

Subway [109]

Suv [574]

Tiger1 [307]

Tiger2 [298]

Trellis [490]

Walking [208]

Walking2 [388]

Woman [157]

Crossing [74]

FaceOcc2 [514]

Ironman [79]

Sylvester [1303]

MotorRolling [79]

Proposed BIT

EDO-HC

STRUCK

IVT

MUSter

RPT

SCM

CPF

DLT

KCF

ASLA

MIL

Fig. 6. A sample qualitative result of 13 algorithms used in the evaluation process over the ”visual tracking benchmark” dataset

(all of the C1 feature maps), and if a partial occlusion occurs (the occlusion percentage is higher than the T H po threshold), information of the occluded areas of the C1 features are removed. Discriminative saliency is applied to the C1 features to calculate the impact factor of each feature in the current frame. Then, based on the C1 features the S2 map (Equation 23) and based on the S2 map the W matrix (Equation 14) is calculated. Finally, with the obtained information, the impact factor of the features, the target appearance model, the weight of the W, as well as the Michaelis-Menten model’s density parameter, are updated. Also based on the final location of the target, the motion model and based on color and motion features, the neural network used for occlusion detection are updated. The schematic of the proposed tracking algorithms procedure is illustrated in Fig. 5.

It should be noted that the proposed algorithm expects the three-channel color input images. However, if the input images are grayscale (single channel), the only difference in the procedure of proposed algorithms is in color feature map production, in which only the white, black and gray color channels are generated as the color features.

5. Evaluations and results In this section, the performance of the proposed algorithm is evaluated by testing on the Visual Tracking Benchmark (VTB) dataset (Wu et al., 2015), and its results are compared to the four classes of classical tracking algorithms (including IVT (Ross et al., 2008), MIL (Babenko et al., 2009), CPF (Prez et al., 2002)), the state-of-art algorithms (including RPT (Li et al., 2015), KCF (Henriques et al., 2015) and ECO-HC (Danelljan et al., 2016)), 3 algorithms with the best results in the VTB dataset (including Struck (Hare et al., 2016), SCM (Zhong et al., 2012), and ASLA (Jia et al., 2012)), and biologicallyinspired tracking algorithms (including MUSTer (Jia et al., 2012), BIT (Cai et al., 2016) and DLT (Wang and Yeung, 2013)). To test the proposed algorithm and compare it with other tracking algorithms, the Intel Core i7-4790S processor with 3.2 GHz processor is employed. 5.1. Experimental dataset The Visual Tracking Benchmark (VTB) dataset was introduced in 2013 by Wu et al., which includes 50 video sequences consisting of the most famous visual trackings test videos. In those video sequences, efforts have been made to address all the visual tracking challenges. The challenges are categorized into

14 Table 1. Challenges label, their description and the number of videos available for each one on the ”Visual Tracking Benchmark” dataset

Table 2. the value of the parameter of the proposed algorithm used in the evaluation section

Label

Description

#

Parameter

Value

IV

Illumination Variation: the illumination in the target region is significantly changed. Scale Variation: the ratio of the bounding boxes of the first frame and the current frame is out of the range [1=ts; ts], (ts=2). Occlusion: the target is partially or fully occluded. Deformation: non-rigid object deformation. Motion Blur: the target region is blurred due to the motion of target or camera. Fast Motion: the motion of the ground truth is larger than tm pixels (tm=20). In-Plane Rotation: the target rotates in the image plane. Out-of-Plane Rotation: the target rotates out of the image plane. Out-of-View: some portion of the target leaves the view. Background Clutters: the background near the target has the similar color or texture as the target. Low Resolution: the number of pixels inside the ground-truth bounding box is less than tr (tr=400).

25

Learning Factor Michaelis-Menten Regulation Parameter Features Impact Factor Threshold Partial Occlusion Threshold Complete Occlusion Threshold

ρ = 0.02 V0 = 0.7 T hds = 0.02 T h po = 0.2 T hco = 0.8

SV OCC DEF MB FM IPR OPR OV BC LR

28 31 18 12 18 31 39 6 22 4

11 classes. Defined labels and the number of videos available for each challenge are illustrated in Table 1. 5.2. Evaluation criteria For evaluating the performance of tracking algorithms, different criteria and benchmarks are defined, each of which has its own strengths and weaknesses. Generally speaking, criteria can be categorized into two classes: criteria that act based on the distance of the centers of the estimated and actual target box, and the criteria that are defined based on the estimated and actual target box overlap. According to these two classes of criteria, two success plot and a precision plot are used for examining the robustness of tracking algorithms (Wu et al., 2015). The precision plot represents the percentage of frames that the center location error (CLE) is within a given threshold, and the success plot shows the percentage of successfully tracked frames by measuring the intersection over union (IOU) metrics. There are different criteria for calculating distance differences and area overlapping, among which are Deviation (Sanin et al., 2012) and PBM (Karasulu and Korukoglu, 2011) for center location error and F1-Score (Karasulu and Korukoglu, 2011) and OTP (Kasturi et al., 2009) for overlapping areas. In both plots, the larger the area under the curve (which is shown as the AUC), the greater the robustness of the algorithm. Since the focus of success plot is more on the target area size, and the precision of the target location is important in the second place, and on the other hand, some of the algorithms that are being compared do not consider target size variations, in the present study, the precision plot is employed. In the compression tables and the numerical comparison between the precision of the mentioned algorithms, the percentage of frames whose target location is correctly detected is used. In a frame, the de-

tection is correct if the center error location is lower than the threshold (which is considered 20px in this paper: e = 20px). 5.3. The parameters of the proposed algorithm In the experiments, the values indicated in Table 2 are attributed to the free parameters of the proposed algorithm. The parameters of the BIT algorithm are equal to its suggested ones. The partial and complete occlusion thresholds are derived from the analysis of the video sequence of the experimental set where the challenge of occlusion occurs. Regarding the threshold of ineffective features, a zero value can be used to ignore only the features that are not present in the target area, but the investigations indicate that the features with normalized discriminative saliency lower than 0.02 have no much effective influence on tracking precision, but their removal can be effective on the algorithm speed. 5.4. Results 50 video sequences of the Visual Tracking Benchmark dataset were evaluated using a One-Pass Evaluation (OPE) strategy with a proposed algorithm and 12 mentioned ones. A visual sample of how to track these 50 video sequences by 13 tracking algorithms is shown in Fig. 6. In addition, the precision of these algorithms in tracking those 50 video sequences based on the represented precision score (the percentage of frames where the CLE is less than 20px) is reported in Table 3. As Table 3 illustrates, the proposed algorithm is able to achieve better results than all other 12 algorithms, i.e. it achieves 0.7% better than the best algorithm in the test set (ECO-HC algorithm). According to the t-test evaluation method (Karasulu and Korukoglu, 2011), the proposed algorithm is more accurate than the ECO-HC algorithm with a probability of 61%. The proposed algorithm has achieved the best results in 21 video sequences and has not scored the worst results in any of the. In terms of the number of best results, it has acted poorly from the BIT (with 22 best results), MUSter (with 22 best results) and ECO-HC (with 33 best results) algorithms. In addition to reporting the precision of tracking algorithms in terms of single video sequences, the performance of algorithms are reported according to the 11 mentioned challenges in Table 4. Also, the precision plots of these 13 algorithms in each challenge are shown in Fig. 7. As indicated in Table 4, the proposed algorithm has been able to achieve the best result in 7 challenges. Based on the experimental results over the VTB training set, the operation of the various mechanisms of the proposed algorithm can be mentioned as follows: the mechanism inspired

15 Table 3. the tracking precision of proposed and other 12 mentioned one for each 50 video sequence of ”visual tracking benchmark” dataset. The precision defined as the percentage of frames where the CLE is less than 20px. In each row, the best and worst result indicated by bold and underlined font respectively

Ours

BIT

DLT

MUSter

ECO-HC

RPT

KCF

Struck

SCM

ASLA

IVT

CPF

MIL

Basketball Bolt Boy Car4 Cardark Carscale Coke Couple Crossing David David2 David3 Deer Dog1 Doll Dudek Faceocc1 Faceocc2 Fish Fleetface Football Football1 Freeman1 Freeman3 Freeman4 Girl Ironman Jogging-1 Jogging-2 Jumping Lemming Liquor Matrix Mhyang Motorrolling Mountainbike Shaking Singer1 Singer2 Skating1 Skiing Soccer Subway Suv Sylvester Tiger1 Tiger2 Trellis Walking Walking2 Woman

0.985 1.000 1.000 0.985 1.000 0.817 0.921 0.843 1.000 0.992 1.000 1.000 1.000 1.000 0.993 0.856 0.911 0.950 1.000 0.760 0.956 0.824 0.960 0.987 0.940 1.000 0.578 0.974 0.993 0.601 0.705 0.914 0.440 1.000 0.409 0.974 0.893 1.000 0.825 0.805 0.148 0.676 1.000 0.979 1.000 0.910 0.611 0.902 1.000 1.000 0.930

1.000 1.000 1.000 0.973 1.000 0.718 0.931 0.607 1.000 1.000 1.000 1.000 0.831 1.000 0.986 0.862 0.877 0.934 1.000 0.581 0.798 0.973 1.000 0.817 0.993 1.000 0.157 0.977 1.000 0.093 0.491 0.986 0.360 1.000 0.049 0.987 0.970 1.000 0.036 1.000 0.136 0.949 1.000 0.979 0.839 0.897 0.449 1.000 1.000 0.440 0.940

0.884 0.066 1.000 1.000 0.710 0.722 0.770 0.321 1.000 0.926 1.000 0.321 0.380 0.994 0.965 0.928 0.533 0.839 0.468 0.403 0.304 0.959 0.402 1.000 0.254 0.778 0.169 0.231 0.186 0.383 0.307 0.191 0.020 1.000 0.037 0.886 0.015 1.000 0.036 0.763 0.123 0.166 0.971 0.824 0.839 0.599 0.244 0.346 0.767 1.000 0.941

1.000 1.000 1.000 1.000 1.000 0.762 0.832 0.843 1.000 1.000 1.000 1.000 0.944 1.000 0.993 0.833 0.871 1.000 1.000 0.608 0.798 0.946 0.893 1.000 0.912 0.990 0.145 0.954 0.980 0.907 0.857 0.980 0.350 1.000 0.238 1.000 0.984 1.000 0.937 1.000 0.086 0.462 1.000 0.977 0.959 0.653 0.460 1.000 1.000 1.000 0.940

0.979 1.000 1.000 1.000 1.000 0.841 0.904 1.000 1.000 0.953 1.000 1.000 1.000 1.000 0.993 0.913 0.950 0.840 1.000 0.621 0.997 0.770 0.819 0.904 0.986 1.000 0.620 0.984 1.000 1.000 0.898 0.990 0.620 1.000 0.043 1.000 0.964 1.000 0.036 0.950 0.111 0.245 1.000 0.971 0.833 0.948 0.904 1.000 1.000 1.000 1.000

0.924 0.017 1.000 0.980 1.000 0.806 0.962 0.679 1.000 1.000 1.000 1.000 1.000 1.000 0.987 0.846 0.583 0.990 1.000 0.583 0.801 0.932 0.972 1.000 0.880 0.924 0.181 0.228 0.179 1.000 0.537 0.979 0.440 0.986 0.055 1.000 0.995 0.558 0.913 1.000 0.136 0.944 1.000 0.979 0.979 0.943 0.814 1.000 1.000 0.684 0.938

0.923 0.989 1.000 0.950 1.000 0.806 0.838 0.257 1.000 1.000 1.000 1.000 0.817 1.000 0.967 0.877 0.730 0.972 1.000 0.460 0.796 0.959 0.393 0.911 0.530 0.864 0.217 0.235 0.163 0.339 0.495 0.976 0.170 1.000 0.049 1.000 0.025 0.815 0.945 1.000 0.074 0.793 1.000 0.979 0.843 0.851 0.356 1.000 1.000 0.440 0.938

0.120 0.020 1.000 0.992 1.000 0.647 0.948 0.736 1.000 0.329 1.000 0.337 1.000 0.996 0.919 0.897 0.575 1.000 1.000 0.639 0.751 1.000 0.801 0.789 0.375 1.000 0.114 0.241 0.254 1.000 0.628 0.390 0.120 1.000 0.085 0.921 0.192 0.641 0.036 0.465 0.037 0.253 0.983 0.572 0.995 0.175 0.630 0.877 1.000 0.982 1.000

0.661 0.031 0.440 0.974 1.000 0.647 0.430 0.114 1.000 1.000 1.000 0.496 0.028 0.976 0.978 0.883 0.933 0.860 0.863 0.529 0.765 0.568 0.982 1.000 0.509 1.000 0.157 0.228 1.000 0.153 0.166 0.276 0.350 1.000 0.037 0.969 0.814 1.000 0.112 0.768 0.136 0.268 1.000 0.978 0.946 0.126 0.112 0.873 1.000 1.000 0.940

0.599 0.017 0.440 1.000 1.000 0.742 0.165 0.086 1.000 1.000 1.000 0.548 0.028 0.997 0.923 0.755 0.180 0.792 1.000 0.301 0.735 0.797 0.390 1.000 0.219 1.000 0.133 0.231 0.182 0.450 0.168 0.226 0.050 1.000 0.061 0.904 0.485 1.000 0.036 0.765 0.136 0.122 0.229 0.575 0.821 0.226 0.142 0.861 1.000 0.404 0.203

0.497 0.014 0.332 1.000 0.807 0.782 0.131 0.086 1.000 1.000 1.000 0.754 0.028 0.980 0.757 0.886 0.645 0.993 1.000 0.265 0.793 0.811 0.807 0.761 0.346 0.444 0.054 0.225 0.199 0.208 0.167 0.207 0.020 1.000 0.030 0.996 0.011 0.963 0.036 0.108 0.111 0.173 0.223 0.447 0.680 0.080 0.082 0.332 1.000 1.000 0.201

0.737 0.906 0.997 0.135 0.165 0.671 0.392 0.871 0.892 0.191 1.000 0.567 0.042 0.906 0.943 0.565 0.322 0.398 0.107 0.161 0.970 1.000 0.764 0.167 0.117 0.744 0.054 0.544 0.844 0.160 0.875 0.518 0.090 0.786 0.061 0.154 0.170 0.994 0.123 0.230 0.062 0.263 0.223 0.777 0.857 0.387 0.107 0.304 1.000 0.364 0.196

0.284 0.014 0.846 0.354 0.379 0.627 0.151 0.679 1.000 0.699 0.978 0.738 0.127 0.919 0.732 0.688 0.221 0.740 0.387 0.358 0.790 1.000 0.939 0.048 0.201 0.714 0.108 0.231 0.186 0.997 0.823 0.199 0.180 0.460 0.043 0.667 0.282 0.501 0.404 0.130 0.074 0.191 0.994 0.123 0.651 0.095 0.414 0.230 1.000 0.406 0.206

Mean Best Worst

0.881 21 0

0.816 22 2

0.588 9 5

0.865 22 0

0.874 33 1

0.810 19 0

0.740 16 1

0.656 13 3

0.649 12 3

0.532 11 6

0.499 8 14

0.488 3 15

0.475 3 10

16

Fig. 7. precision plot of average tracking results of all 13 evaluated algorithm on all video sequence and 11 defined tracking challenge.

from retina self-regulation function well manages local illumination variation and improves the illumination and contrast information of the images. In addition, in video sequences with significant illumination variations over time, the regulation of the density parameter balanced the overtime variations as well as possible. For example, in the Ironman and Matrix videos, both locally and globally, there are significant variations in the illumination conditions. This mechanism has been subtilizing

those changes and, consequently, increasing the precision of the algorithm. In Fig. 8, an example of the function of this mechanism is illustrated in the matrix video sequence. Using the discriminative saliency mechanism for weighting the features based on their presence in the target and surrounding areas, the algorithm has succeeded to deal with the background clutter problem. For example in video sequences such as Deer where background and target information are similar in

17 Table 4. the average tracking precision of proposed and other 12 mentioned one for 11 defined tracking. The precision defined as the percentage of frames where the CLE is less than 20px. In each row, the best and worst result indicated by bold and underlined font respectively.

Ours

BIT

DLT

MUSter

ECO-HC

RPT

KCF

Struck

SCM

ASLA

IVT

CPF

MIL

IV CV OCC DEF ME FE IPR OPR OV BC LR

0.823 0.846 0.894 0.845 0.781 0.796 0.853 0.876 0.774 0.840 0.747

0.763 0.794 0.857 0.781 0.661 0.679 0.788 0.834 0.654 0.775 0.369

0.534 0.595 0.589 0.512 0.453 0.504 0.541 0.572 0.444 0.507 0.396

0.795 0.823 0.858 0.823 0.695 0.733 0.805 0.854 0.709 0.815 0.582

0.793 0.844 0.916 0.843 0.777 0.819 0.807 0.866 0.883 0.802 0.666

0.810 0.797 0.770 0.718 0.786 0.780 0.815 0.801 0.723 0.832 0.480

0.728 0.690 0.757 0.697 0.650 0.651 0.734 0.736 0.650 0.732 0.381

0.558 0.652 0.579 0.475 0.551 0.655 0.608 0.607 0.539 0.594 0.545

0.594 0.683 0.652 0.552 0.339 0.371 0.594 0.628 0.429 0.554 0.305

0.517 0.568 0.478 0.395 0.278 0.309 0.512 0.530 0.333 0.452 0.156

0.418 0.504 0.473 0.357 0.222 0.277 0.467 0.477 0.307 0.371 0.278

0.366 0.464 0.526 0.458 0.261 0.424 0.466 0.538 0.483 0.422 0.130

0.349 0.467 0.428 0.441 0.357 0.444 0.462 0.479 0.393 0.430 0.171

Mean Best Worst

0.825 7 0

0.723 0 0

0.513 0 0

0.772 0 0

0.820 3 0

0.756 1 0

0.673 0 0

0.578 0 0

0.518 0 0

0.412 0 0

0.377 0 6

0.413 0 2

0.402 0 3

terms of color and texture, the proposed algorithm can pursue the target more precisely by giving higher weights to the other features. The size estimation mechanism has generally been able to estimate target size variations. As shown in Table 4, the best result has been achieved in the challenge of scale variation. Of course, because the resizing mechanism is based on the shape and the edge features, if the edge information in the target area is weaker than its surroundings (such as the shaking sequence), or in cases where the target has deformation or in-plane rotation (such as the Mountainbike and Faceocc2 sequences) as a result, the edge information is significantly different from the overall target model, making the estimation of the size imprecise and the precision of the algorithm reduces. In these cases, the target is also well-recognized in terms of location, but the low precision in the size estimation causes a part of the target to be recognized as the target area, resulting in a relatively constant distance between the original location of the target and the estimated location. In rare cases when target size variations occur suddenly (for example, in the woman’s video sequence, due to a sudden increase in the degree of magnification of the camera, the target size is instantaneously multiplied) due to using a model similar to that of the motion model –which records the history of target size variations in estimating the target size in each frame– there is no way to precisely measure size variations. Generally, the proposed method assumes that target size variations occur softly. In video sequences with occlusion challenges, the occlusion mechanism along with the motion model has been able to adequately increase the precision of the algorithm by detecting the occurrence of partial or complete occlusion. Furthermore, in the case where part of the target goes outside the image plane, the proposed algorithm has been also able to estimate the target location more precisely in collaboration with the size estimation mechanism. A low numerical precision in the OV challenge is due to the fact that in frames that the target is out of the image plane, in the ground truth, only the location of the target area which is in the image plane is reported. Fig. 9 illustrates an

Fig. 8. The sample qualitative result of illumination enhancement mechanism over the Matrix Sequence. The first column shows the original frames 77, 79 and 81, and the second row shows the same ones after passing the illumination enhancement stage

Fig. 9. The sample qualitative result of well managing out of view challenge on the SUV video. The dashed and solid region indicate the ground truth and proposed algorithms result respectively

example of this issue.

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

Precision

Precision

18

0.5 v0 = 0.0 v0 = 0.1 v0 = 0.2 v0 = 0.3 v0 = 0.4 v0 = 0.5 v0 = 0.6 v0 = 0.7 v0 = 0.8 v0 = 0.9 v0 = 1.0

0.4 0.3 0.2 0.1

5

10

15

20

25

30

35

40

45

TH

GV

TH

= 0.06

TH

= 0.08

TH

= 0.10

TH

= 0.12

TH

= 0.14

TH

= 0.16

GV GV GV

0.3

GV GV

0.2

GV

THGV = 0.18

0.1

TH

GV

50

Location Error Threshold

Fig. 10. precision plot of average tracking results of the proposed algorithm for different value of the regulation parameter of Michaelis-Menten Model in the range of V0 ∈ [0, 1]

In the proposed algorithm in the case of the complete occlusion, the algorithm follows the target location according to the predicted location by the motion model. However, if at the time of the complete occlusion the motion behavior of the target changes, the algorithm will not be able to detect it after occlusion because the target is no longer in the algorithm’s search space. Of course, a mechanism can be considered to increase the search region after complete occlusion, but this will increase the computational cost and consequently reduce the algorithm speed. In the grayscale video sequences, occlusion detection is based solely on motion information because of the lack of color information. In this situation, if the motion features of the target area with their own surroundings have a sufficient degree of discrimination, the function of the occlusion detecting mechanism is acceptable. 5.5. The effect of parameters According to the Section 5.3, other than the free parameters of the BIT algorithm, the proposed algorithm has 4 other parameters. The precision and speed of the algorithm can be affected by adjusting these parameters. Thus, the experimental set is evaluated with different values of those four parameters, and the effect of different values of each parameter on the precision of the algorithm is studied. Default values of 4 parameters of the algorithm are equal to the ones indicated in Table 2, and in the examination of each parameter, the values of the other 3 parameters are taken as constant. 5.5.1. Regulation parameter of Michaelis-Menten Model In the Michaelis-Menten model, the regulation parameter controlling the interference degree of the neighbor information in adjusting pixel values, which can be somewhere in the range of [0, 1]. The results of examining the different values of this parameter and its effect on the precision of the algorithm are

= 0.02

THGV = 0.04

0.5 0.4

0 0

THGV = 0.00

= 0.20

0 0

5

10

15

20

25

30

35

40

45

50

Location Error Threshold

Fig. 11. precision plot of average tracking results of the proposed algorithm for different value of Impact factor threshold in the range of T Hds ∈ [0, 0.2] Table 5. compression of precision, speed, and number of effective features for different value of impact factor threshold

T Hds

Precision

Speed (fps)

#Effective Features

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20

0.883 0.881 0.821 0.807 0.807 0.772 0.764 0.754 0.752 0.712 0.655

23.1 30.9 32.2 33.7 34.8 35.2 35.6 36.1 36.8 37.4 38.0

68.0 39.9 34.5 29.8 26.5 23.7 21.1 19.0 17.1 15.3 13.7

shown in Fig.10. With the low values of the parameter, the algorithm is not able to properly improve the illumination and local contrast levels and also balance the illumination conditions over time. On the other side, a large value of this parameter causes excessive interference of neighborhood information in pixel values and cause a negative effect on the precision of the algorithm. If the illumination conditions of the video sequences are known, a proper value can be selected for this parameter. 5.5.2. Impact factor threshold This parameter classifies the features into two categories of inefficient and efficient features based on their normalized discrimination degree so that the algorithm avoids using ineffective features. To examine the effect of this parameter, its value has been changed in the range of [0, 0.2]. The average algorithm precision, the number of efficient features, and the execution speed of the algorithm for each value of this parameter are illustrated in Table 5. In addition, the precision plots of the proposed algorithm for different value of this parameter are shown in Fig. 11.

1

1

0.95

0.95

0.9

0.9

0.85

0.85

TH

0.8

TH

po po

Precision

Precision

19

= 0.00

THco = 0.60

= 0.10

THco = 0.70

0.75

THpo = 0.20

0.75

THco = 0.50

0.8

THco = 0.80

THpo = 0.30

TH

0.65 10

po

THco = 1.00

= 0.50

0.65 10 15

20

25

30

35

40

45

THco = 0.90

0.7

THpo = 0.40

0.7

15

20

50

Fig. 12. precision plot of average tracking results of the proposed algorithm for different value of partial occlusion threshold in the range of T H po ∈ [0, 0.5]

The greater the value of this parameter is, the more features are placed in the ineffective category. Moreover, the algorithm works based on the low number of features. Therefore, the algorithm speed increases due to reducing among of information which should be processed, but on the other hand, its precision decreases. It should be mentioned that by weighting features based on their discrimination rate, the algorithm can well handle the background clutter challenge, and using this threshold is only a trade-off between the execution speed and the precision of the algorithm. It is not logical to assign a high value for this parameter, achieved higher speed at the expense of the loss of precision. In tests conducted using a 3.4 GHz processor, assigning a value of 0.02 to this parameter yields the FPS 31 execution speed, and moreover, this threshold is low enough to not significantly affect the precision of the algorithm. 5.5.3. Partial and complete occlusion threshold Considering the fact that in the real applications, recognition and classification algorithms are not 100% precise; in the proposed algorithm and in the occlusion detection section, which uses a neural network to categorize pixels in two target and occlusion groups, one should not expect the algorithm to completely recognize the occluded areas. As a result, the values assigned to the threshold of the partial and complete occlusion should be select so that the algorithm precision is not reduced due to an error in the classification. For examining this issue, for the partial and complete threshold, various values are assigned in the range of [0, 0.5] and [0.5 1] respectively, and the precision of the algorithm is investigated. The results of this study are indicated in Fig. 12 and Fig. 13 respectively for the partial and complete occlusion threshold. The low value of T H po causes any possible target appearance change (because of different reasons such as illumination variation or target deformation) identified as occluded areas and removed from the model updating process. Thus the algorithm

25

30

35

40

45

50

Location Error Threshold

Location Error Threshold

Fig. 13. precision plot of average tracking results of the proposed algorithm for different value of complete occlusion threshold in the range of T Hco ∈ [0.5, 1]

is unable to learn the variations of the target appearance, and consequently, it cannot be detected in future frames. Moreover, the assignment of high values to this parameter causes the features of the areas covered by the occluding objects be involved in the target appearance model and weakening it. In this case, if those occluding areas appear long enough in the target area they completely disrupt the target appearance model. In examination of the complete occlusion, the high value of the T Hco threshold makes it impossible to achieve due to the possible error in the classification algorithm, resulting in accepting an incorrect region as the target. This action, on the one hand, causes the function of the motion model be defected by updating the model based on the wrong locations due to the error in the location estimation. In this case, if the size of the occluding objects are larger than that the search area, there is no other chance to recognize the target after exiting occlusion. Of course, in these conditions, features of the occluding objects do not get involved in target appearance model due to the existence of the partial occlusion threshold (always it is assumed that T H po < T Hco ). On the other hand, with the low values of this threshold, the algorithm ignores the small visible parts of the target on the image plane and only relies on the motion model to predicting the probable target location. As a result, due to the low precision of the motion model, the predicted position is error-prone, and, as noted earlier, if the target motion behavior changes in this case, there is a probability of missing the target after exiting occlusion. 5.6. Runtime analysis From the perspective of the execution time, applications of the tracking algorithm can be divided into two classes: applications such as medical studies and movie industries in which the tracking algorithm operates offline and does not require highspeed processing. The second category consists of the applications such as monitoring and surveillance, the gaming industry,

20 0.9

0.85

0.8

Pareto Front

Precision

0.75

0.7 OURS BIT DLT MUSTER ECO-HC RPT KCF STRUCK SCM ASLA IVT CPF MIL

0.65

0.6

0.55

0.5

0.45 0

20

40

60

80

100

120

140

160

180

FPS

Fig. 14. Execution speed of the 13 mentioned algorithms in relation to their precision.

and virtual reality requiring real-time processing. In this category of applications, algorithms are usually applied to a video sequence that typically has a 25 fps image rate. So, the algorithm must be able to process each frame with at least this rate to not retard the video sequence. The proposed algorithm, like most of the previously introduced tracking algorithms, is presented for being used in the real-time applications. Certainly, if the execution time and processing power are not taken into account, then high-precision algorithms can be achieved. For example, the offline version of the ECO-HC algorithm, known as ECO (Babenko et al., 2009), has reached an average of 0.930 precision (based on the mentioned criterion) in the experimental set, but its processing speed is low and close to 5 frames per second. In the tests conducted in this section, which uses an Intel Core i7-4790S processor running at 3.2 GHz, the proposed algorithm has been able to perform a tracking operation at an average 31 FPS, which is a higher than usual video sequence rate. In Fig. 14, the execution speed of the 13 mentioned algorithms is shown in relation to their precision. The dashed-line curve represents the Pareto front. In different applications, one of the algorithms on this front can be selected based on the required speed or precision. In the algorithm set, the proposed algorithm has the highest precision and the KCF algorithm has the highest execution speed. 5.7. Employing other Computational Models As illustrated in Fig. 2, inspired by the biological vision and the various mechanisms of the visual cortex, a visual tracking algorithm has been proposed, which its procedure can be divided into seven operational units (explained in Section 4 in detail) employed eight computational models to perform its task. In choosing these computational models, attempts have been made to use methods that are similar to the biological visual system mechanisms as much as possible. However, this modular structure allows for employing other computational models and algorithms to achieve higher precision or execution speed,

or take more attention to a specific challenge (due to the application in hand). Using other intensity dependency removal method (Kessy et al., 2017), employing other motion models, or considering more complex motion patterns (Khoei et al., 2013; Faria et al., 2018), using other feature extraction methods, employing other classification algorithms for discriminating features or using them in the occlusion detection, etc. are some of these alternative methods. However, it should be considered that in the most cases achieving higher precision is accompanied by higher computational cost and lower processing speed. Moreover, instead of using mentioned computational and statistical methods that are mostly hand-crafted, offline/online learning methods can be employed in any one of these operational units if properly trained. Among these methods, using deep neural networks, especially convolutional neural networks (CNN), can be a good choice. employing The CNN and Auto-Encoder networks to; extract robust and transforminvariant features (Simonyan and Zisserman, 2014; Mahdi and Qin, 2017) in the feature extraction unit, select discriminative features (Han and Vasconcelos, 2014) in the update and location estimation unit, extract local motion vectors (optical flow) (Fischer et al., 2015) in the motion model and prediction of possible target location, are some of the available choices. It should be noted that using deep neural networks in the problems such as tracking –which involves a variety of different challenges and can be used in different environments with different conditions and structures– face some challenges arise the training approach and the similarity of the training set to the environment in hand. Commonly two strategies used in network training: network trained based on a pre-defined dataset and the trained network employed in the algorithms procedure, or it trained by limit among of observations during the algorithms procedure. Using pre-trained networks makes them work well only in environments that have the same conditions as training data. Moreover, in online training, since there is a very small among of samples, the network is usually not properly trained, also the training computational cost should be considered. One solution of this problem is using the middle method, which employs a pre-trained network, and then adjusts and updates the middle layers or some of the network parameters to adapt the network to the environment in hand based on available observations. For example, Wang uses a pre-trained network to detect objects in his tracking algorithm (Wang et al., 2015) and based on the fact that features generated in the primary layers of the CNN network have the discriminative nature, updating only applied to these layers. It should be noted that such mechanisms have the problem of being handcrafted. 6. Conclusion In the present paper, inspired by the recognition process in the ventral pathway of the visual cortex and the various mechanisms involved in it, a visual tracking algorithm is introduced. The biological vision system operates the visual tracking task ideally by employing various mechanisms, including the extraction of robust visual features, bottom-up, and top-down saliency for detecting discriminable features, motion model for

21 understanding the target behavior and occlusion management, which has a very accurate and robust performance in the most challenging conditions. Considering these points, the proposed algorithm inspires those mechanisms to inject the power of biological systems into the traditional computing ones so that it can achieve a high degree of precision. Extracting low-level features similar with the extracted ones in primary layer of the visual cortex, selecting the most appropriate subset of those features based on their discrimination degree, improving the illumination and local contrast conditions, and modulating illumination variations over time by inspiring from the operation of the ganglion layer in the retina, memorizing shape, color, and motion information for detecting variations in size and occlusion similar to those of short-term memory, and managing complete occlusion using target motion behavior is a set of mechanisms employed in the proposed algorithm. The results of the experiments conducted on the algorithm as well as the comparison with other well-known tracking algorithms indicated that the proposed algorithm has been well coping with the challenges facing the tracking problem in terms of precision and robustness as well as execution time. It worth mentioning that because of the dependence on the low-level shape, color, and motion information, the appearance and motion pattern of the target do not ideally learn, and in circumstances such as deformation or sudden changes in the target appearance, detection may face defections and precision of the algorithm drops. Adding a processing layer on low-level features to interpret this information for achieving high-level semantic information and features (similar to the performance of layers V2 and V4 in the visual cortex that applies to the extracted features of layer V1 (Serre, 2014)) can be a solution to this problem. Also, one can add an identity management phase and parallelized some processing blocks to extend the proposed algorithm for multiobject tracking.

rate: S l = I (Xl ; C) =

1 Z X



i=0

The discriminative saliency of each area is defined as the degree of discrimination that this area generates from its surroundings. From a computational view, discriminative saliency is equal to the probability that an error may occur in the discrimination of that area from its surroundings. By assuming Y as the visual stimuli, or the image and L as the study area, two windows (similar to RF) are defined around the L region with different sizes; the center and the surrounding window which shown as w1l and w0l respectively. Moreover, the combination of these two windows that includes the entire under process area is shown as w0l ∪ w1l = wl . To calculate the saliency of each visual feature, the descriptor of the center window, the surrounding window, and the entire area are extracted as Xl1 , Xl0 and Xl . According to these descriptor three p x|c (Xl1 |1), p x|c (Xl0 |0) and p x (Xl ) distribution is estimated. Discrimination degree of this feature is defined as the discrimination rate which it can create between the center and surrounding class. the mutual information is the solution to calculate this discrimination

  p x,c Xli , i dx log   p x Xl pc (i)

(A.1)

Considering the definition of the distanceR between   the two   probability distributions as KL p k q = p log qp dx, the saliency can be calculated according to Equation A.2 as the distance between the distributions of the features. S l = I (Xl ; C) =

1 X i=0

 i   h pc (i) KL p x|c Xli |i || p x Xl

(A.2)

For calculating the mutual information and consequently gaining the saliency value, it is necessary to estimate the probability distribution of the feature descriptor. Study on the statistical property of natural images has shown that the band-pass features of natural images can be adequately estimated with the GGD (General Gaussian Distribution) (Do and Vetterli, 2002). The GGD equation is shown in Equation A.3.  !   |x| β  β   exp − p x (x; α, β) = (A.3)  α 2α Γ 1β

R∞ Where Γ = 0 e−t tz−1 dt is the gamma function, β is the shape parameter, and α is the scale parameter. Considering the probability GGD model for p x|c (Xli |i) and p x (Xl ) distributions, the calculation of saliency requires the calculation of the KL distance between these two distributions. The KL distance for two GGDs p x (X) and q x (X) with the parameters {α p , β p } and {αq , βq } are calculated according to Equation A.4. (Do and Vetterli, 2002):      (βq +1)  β αq Γ 1  ! β Γ q βq  βp  p αp   1     +   − (A.4) KL p x ||q x = log  αq βp  β α Γ 1  Γ 1 q p

Appendix A. Discriminative Saliency



p x,c Xli , i

βp

βp

And with the knowledge that pc (i) denotes the probability of the occurrence of the class i, it is obtained as shown in Equation A.5: wi pc (i) = πi = l (A.5) wl By combining Equations A.2, A.4 and A.5, the saliency value is calculated via Equation A.6:       )   βi α Γ  1    !β Γ (β +1 1 i X i    β α 1 β    Sl =    +   − i  (A.6) πi log     α β  i=0 β αi Γ 1  Γ 1 βi

βi

Where {αi , βi } and {α, β} represent the distribution parameters of p x|c (Xli |i) and p x (Xl ) respectively. The calculation of the β parameter is not possible in a closed form, and its calculation needs a large computational cost. But given the fact that the shape of band-pass features distribution of natural images is relatively constant, in the above calculations, the shape value is considered constant and estimated using a collection of natural

22 images. Tests have indicated that a value in the interval [0.5, 0.8] is an appropriate one for this parameter (Gao and Vasconcelos, 2009) (in the proposed algorithm the value of this parameter is 0.7). So by assuming the parameter β is known, the value of the parameter α is estimated according to Equation A.7 (n is the number of observed information and µ is their mean value):   1β n  β X  β  α =  |xi − µ|  n i=1

(A.7)

The high discrimination rate between the center and surrounding windows achieved when the investigated feature is boldly present in one window while its presence is negligible in the other one, which can happen in two cases: the feature is boldly present in center windows or boldly present in surrounding one. While in the proposed algorithm the presence of the desired feature in the target class is essential, in selecting features, in addition to generating a high degree of discrimination, they should significantly present in the target class, not the surrounding one. This issue can be identified by the GGD shape. The GGD has a different shape and entropy in the presence or absence of features so that in the presence of the feature, the distribution has a higher entropy and a narrower shape while in its absent the distribution has a lower entropy and wider shape (Gao et al., 2009). Consequently, if the H(X|i) represent the entropy of feature X in the class i (i accept value of 0 and 1 which respectively represent the surrounding and target class), the features are selected which satisfy the condition H(X|1) > H(X|0), means the features that their GGD has the higher entropy in the target window than the surrounding one. The entropy of a GGD is calculated as shown in Equation A.8 (Nadarajah, 2005).     i   1 β   (X|i)   H = i − log   i 1  β 2α Γ

(A.8)

βi

Since for all distributions the value of the shape parameter (β) is assumed constant, the entropy condition is minimized to 1 the log( αα0 ) < 0. The discrimination value of the features that do not satisfy this condition considered as zero. References Adam, A., Rivlin, E., Shimshoni, I., 2006. Robust Fragments-based Tracking using the Integral Histogram, in: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), pp. 798–805. doi:10.1109/CVPR.2006.256. Avidan, S., 2007. Ensemble tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 261–271. Babenko, B., Yang, M.H., Belongie, S., 2009. Visual tracking with online Multiple Instance Learning, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 983–990. doi:10.1109/CVPR.2009.5206737. Bai, X., Priyanka, S.A., Tung, H.J., Wang, Y., 2016. Bio-Inspired Night Image Enhancement Based on Contrast Enhancement and Denoising, in: Cognitive Systems and Signal Processing, Springer, Singapore. pp. 82–90. doi:10. 1007/978-981-10-5230-9_9. Bai, Y., Tang, M., 2012. Robust tracking via weakly supervised ranking SVM, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1854–1861. doi:10.1109/CVPR.2012.6247884.

Barber, A., Cosker, D., James, O., Waine, T., Patel, R., 2016. Camera Tracking in Visual Effects an Industry Perspective of Structure from Motion, in: Proceedings of the 2016 Symposium on Digital Production, ACM, New York, NY, USA. pp. 45–54. doi:10.1145/2947688.2947697. Beaudot, W.H.A., 1996. Sensory coding in the vertebrate retina: towards an adaptive control of visual sensitivity. Network: Computation in Neural Systems 7, 317–323. doi:10.1088/0954-898X_7_2_012. Bogadhi, A.R., Montagnini, A., Masson, G.S., 2013. Dynamic interaction between retinal and extraretinal signals in motion integration for smooth pursuit. Journal of Vision 13, 5–5. doi:10.1167/13.13.5. Cai, B., Xu, X., Xing, X., Jia, K., Miao, J., Tao, D., 2016. BIT: Biologically Inspired Tracker. IEEE Transactions on Image Processing 25, 1327–1339. doi:10.1109/TIP.2016.2520358. Carandini, M., 2012. From circuits to behavior: a bridge too far? Nature Neuroscience 15, 507–509. doi:10.1038/nn.3043. Chao He, Zheng, Y., Ahalt, S., 2002. Object tracking using the Gabor wavelet transform and the golden section algorithm. IEEE Transactions on Multimedia 4, 528–538. doi:10.1109/TMM.2002.806534. Chen, Y.L., Chen, T.S., Huang, T.W., Yin, L.C., Wang, S.Y., Chiueh, T.C., 2013. Intelligent Urban Video Surveillance System for Automatic Vehicle Detection and Tracking in Clouds, in: 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA), pp. 814–821. doi:10.1109/AINA.2013.23. Chen, Z.W., Chiang, C.C., Hsieh, Z.T., 2010. Extending 3d LucasKanade tracking with adaptive templates for head pose estimation. Machine Vision and Applications 21, 889–903. doi:10.1007/s00138-009-0222-y. Chessa, M., Medathati, N.V.K., Masson, G.S., Solari, F., Kornprobst, P., 2015. Decoding MT motion response for optical flow estimation: An experimental evaluation, in: 2015 23rd European Signal Processing Conference (EUSIPCO), pp. 2241–2245. doi:10.1109/EUSIPCO.2015.7362783. Comaniciu, D., Ramesh, V., Meer, P., 2003. Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 564–577. doi:10.1109/TPAMI.2003.1195991. Cox, D.D., 2014. Do we understand high-level vision? Current Opinion in Neurobiology 25, 187–193. doi:10.1016/j.conb.2014.01.016. Dacey, D.M., 2000. Parallel pathways for spectral coding in primate retina. Annual Review of Neuroscience 23, 743–775. doi:10.1146/annurev. neuro.23.1.743. Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M., 2016. ECO: Efficient Convolution Operators for Tracking. arXiv:1611.09224 [cs] ArXiv: 1611.09224. Del Bimbo, A., Dini, F., 2011. Particle filter-based visual tracking with a first order dynamic model and uncertainty adaptation. Computer Vision and Image Understanding 115, 771–786. doi:10.1016/j.cviu.2011.01.004. Do, M.N., Vetterli, M., 2002. Wavelet-based texture retrieval using generalized Gaussian density and Kullback-Leibler distance. IEEE Transactions on Image Processing 11, 146–158. doi:10.1109/83.982822. Ellis, A.L., Ferryman, J., 2014. Biologically-inspired robust motion segmentation using mutual information. Computer Vision and Image Understanding 122, 47–64. doi:10.1016/j.cviu.2014.01.009. Erhan, D., Szegedy, C., Toshev, A., Anguelov, D., 2013. Scalable Object Detection using Deep Neural Networks. arXiv:1312.2249 [cs, stat] ArXiv: 1312.2249. Fan, J., Xu, W., Wu, Y., Gong, Y., 2010. Human Tracking Using Convolutional Neural Networks. IEEE Transactions on Neural Networks 21, 1610–1623. doi:10.1109/TNN.2010.2066286. Faria, F.d.C.e.C., Batista, J., Arajo, H., 2018. Biologically inspired computational modeling of motion based on middle temporal area. Paladyn, Journal of Behavioral Robotics 9, 60–71. doi:10.1515/pjbr-2018-0005. Fischer, P., Dosovitskiy, A., Ilg, E., Husser, P., HazrbaÅ, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T., 2015. FlowNet: Learning Optical Flow with Convolutional Networks. arXiv:1504.06852 [cs] ArXiv: 1504.06852. Fu, C., Carrio, A., Olivares-Mendez, M.A., Campoy, P., 2014. Online learningbased robust visual tracking for autonomous landing of Unmanned Aerial Vehicles, in: 2014 International Conference on Unmanned Aircraft Systems (ICUAS), pp. 649–655. doi:10.1109/ICUAS.2014.6842309. Gao, D., Han, S., Vasconcelos, N., 2009. Discriminant saliency, the detection of suspicious coincidences, and applications to visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 989–1005. doi:10.1109/TPAMI.2009.27. Gao, D., Vasconcelos, N., 2009. Decision-theoretic saliency: computational principles, biological plausibility, and implications for neurophysiology and psychophysics. Neural Computation 21, 239–271. doi:10.1162/neco.

23 2009.11-06-391. Gao, J., Bi, D., 2010. A new approach to object tracking using local linear embedding method, in: 2010 3rd International Congress on Image and Signal Processing, pp. 279–284. doi:10.1109/CISP.2010.5648262. Giese, M.A., Poggio, T., 2003. Neural mechanisms for the recognition of biological movement. Nature Reviews Neuroscience 4, 179–192. Gollisch, T., Meister, M., 2010. Eye smarter than scientists believed: Neural computations in circuits of the retina. Neuron 65, 150–164. doi:10.1016/ j.neuron.2009.12.009. Grabner, H., Matas, J., Gool, L.V., Cattin, P., 2010. Tracking the invisible: Learning where the object might be, in: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1285–1292. doi:10.1109/CVPR.2010.5539819. Guo, G., Mu, G., Fu, Y., Huang, T.S., 2009. Human age estimation using bio-inspired features, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 112–119. doi:10.1109/CVPR.2009.5206681. Han, B., Comaniciu, D., Zhu, Y., Davis, L.S., 2008. Sequential Kernel Density Approximation and Its Application to Real-Time Visual Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 1186–1197. doi:10.1109/TPAMI.2007.70771. Han, S., Vasconcelos, N., 2014. Object recognition with hierarchical discriminant saliency networks. Frontiers in Computational Neuroscience 8. doi:10.3389/fncom.2014.00109. Han, Y., Liu, G., 2013. Biologically inspired task oriented gist model for scene classification. Computer Vision and Image Understanding 117, 76–95. doi:10.1016/j.cviu.2012.10.005. Hare, S., Golodetz, S., Saffari, A., Vineet, V., Cheng, M.M., Hicks, S.L., Torr, P.H.S., 2016. Struck: Structured Output Tracking with Kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 2096–2109. doi:10.1109/TPAMI.2015.2509974. Henriques, J.F., Caseiro, R., Martins, P., Batista, J., 2015. High-Speed Tracking with Kernelized Correlation Filters. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 583–596. doi:10.1109/TPAMI.2014. 2345390. arXiv: 1404.7584. Huang, C.H., Allain, B., Boyer, E., Franco, J.S., Tombari, F., Navab, N., Ilic, S., 2017. Tracking-by-Detection of 3d Human Shapes: from Surfaces to Volumes. IEEE Transactions on Pattern Analysis and Machine Intelligence PP, 1–1. doi:10.1109/TPAMI.2017.2740308. Jackson, R.C., Yuan, R., Chow, D.L., Newman, W.S., avuÅoglu, M.C., 2017. Real-Time Visual Tracking of Dynamic Surgical Suture Threads. IEEE Transactions on Automation Science and Engineering PP, 1–13. doi:10. 1109/TASE.2017.2726689. Jepson, A., Fleet, D., El-Maraghi, T., 2003. Robust online appearance models for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 1296–1311. Jia, X., Lu, H., Yang, M.H., 2012. Visual tracking via adaptive structural local sparse appearance model, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1822–1829. doi:10.1109/CVPR.2012. 6247880. Jiang, N., Liu, W., Wu, Y., 2012. Order determination and sparsity-regularized metric learning adaptive visual tracking, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1956–1963. doi:10.1109/CVPR. 2012.6247897. Kanhere, N.K., Birchfield, S.T., 2008. Real-Time Incremental Segmentation and Tracking of Vehicles at Low Camera Angles Using Stable Features. IEEE Transactions on Intelligent Transportation Systems 9, 148–160. doi:10.1109/TITS.2007.911357. Karasulu, B., Korukoglu, S., 2011. A software for performance evaluation and comparison of people detection and tracking methods in video processing. Multimedia Tools and Applications 55, 677–723. doi:10.1007/ s11042-010-0591-2. Karras, T., Aila, T., Laine, S., Lehtinen, J., 2017. Progressive Growing of GANs for Improved Quality, Stability, and Variation. arXiv:1710.10196 [cs, stat] ArXiv: 1710.10196. Kasturi, R., Goldgof, D., Soundararajan, P., Manohar, V., Garofolo, J., Bowers, R., Boonstra, M., Korzhova, V., Zhang, J., 2009. Framework for Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 319–336. doi:10.1109/TPAMI.2008.57. Kessy, A., Lewin, A., Strimmer, K., 2017. Optimal whitening and decorrelation. The American Statistician , 0–0doi:10.1080/00031305.2016. 1277159. arXiv: 1512.00809.

Khan, Z., Balch, T., Dellaert, F., 2005. MCMC-based particle filtering for tracking a variable number of interacting targets. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 1805–1819. doi:10.1109/ TPAMI.2005.223. Kheradpisheh, S.R., Ganjtabesh, M., Masquelier, T., 2016a. Bio-inspired unsupervised learning of visual features leads to robust invariant object recognition. Neurocomputing 205, 382–392. doi:10.1016/j.neucom.2016.04. 029. Kheradpisheh, S.R., Ghodrati, M., Ganjtabesh, M., Masquelier, T., 2016b. Deep Networks Can Resemble Human Feed-forward Vision in Invariant Object Recognition. Scientific Reports 6, srep32672. doi:10.1038/ srep32672. Khoei, M.A., Masson, G.S., Perrinet, L.U., 2013. Motion-based prediction explains the role of tracking in motion extrapolation. Journal of PhysiologyParis 107, 409–420. doi:10.1016/j.jphysparis.2013.08.001. Koorehdavoudi, H., Bogdan, P., Wei, G., Marculescu, R., Zhuang, J., Carlsen, R.W., Sitti, M., 2017. Multi-fractal characterization of bacterial swimming dynamics: a case study on real and simulated Serratia marcescens. Proc. R. Soc. A 473, 20170154. doi:10.1098/rspa.2017.0154. Kowler, E., 2011. Eye movements: The past 25years. Vision Research 51, 1457–1483. doi:10.1016/j.visres.2010.12.014. Krauzlis, R.J., 2003. Recasting the smooth pursuit eye movement system. Journal of Neurophysiology 91, 591–603. doi:10.1152/jn.00801.2003. Kroeger, T., Timofte, R., Dai, D., Van Gool, L., 2016. Fast Optical Flow Using Dense Inverse Search, in: Leibe, B., Matas, J., Sebe, N., Welling, M. (Eds.), Computer Vision ECCV 2016. Springer International Publishing, Cham. volume 9908, pp. 471–488. doi:10.1007/978-3-319-46493-0_29. Kruger, N., Janssen, P., Kalkan, S., Lappe, M., Leonardis, A., Piater, J., Rodriguez-Sanchez, A.J., Wiskott, L., 2013. Deep Hierarchies in the Primate Visual Cortex: What Can We Learn for Computer Vision? IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1847–1871. doi:10.1109/TPAMI.2012.272. Leichter, I., Lindenbaum, M., Rivlin, E., 2009. Tracking by Affine Kernel Transformations Using Color and Boundary Cues. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 164–171. doi:10.1109/ TPAMI.2008.194. Leichter, I., Lindenbaum, M., Rivlin, E., 2010. Mean Shift tracking with multiple reference color histograms. Computer Vision and Image Understanding 114, 400–408. Li, M., Bao, S., Qian, W., Su, Z., Ratha, N.K., 2013a. Face recognition using early biologically inspired features, in: 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS), pp. 1–6. doi:10.1109/BTAS.2013.6712711. Li, X., Hu, W., Shen, C., Zhang, Z., Dick, A., Van Den Hengel, A., 2013b. A Survey of Appearance Models in Visual Object Tracking. ACM Transactions on Intelligent Systems and Technology 4, 58:1–48. Li, Y., Shark, L.K., Hobbs, S.J., Ingham, J., 2010. Real-Time Immersive Table Tennis Game for Two Players with Motion Tracking, in: 2010 14th International Conference Information Visualisation, pp. 500–505. doi:10.1109/ IV.2010.97. Li, Y., Zhu, J., Hoi, S.C.H., 2015. Reliable Patch Trackers: Robust visual tracking by exploiting reliable patches, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 353–361. doi:10.1109/CVPR. 2015.7298632. Liu, H., Li, S., Fang, L., 2015. Robust Object Tracking Based on Principal Component Analysis and Local Sparse Representation. IEEE Transactions on Instrumentation and Measurement 64, 2863–2875. doi:10.1109/TIM. 2015.2437636. Liu, R., Cheng, J., Lu, H., 2009. A robust boosting tracker with minimum error bound in a co-training framework, in: 2009 IEEE 12th International Conference on Computer Vision, pp. 1459–1466. doi:10.1109/ICCV.2009. 5459285. Lun, L., 2016. Basketball training via behaviour analysis in basketball match video. BioTechnology: An Indian Journal 10. Maggio, E., Cavallaro, A., 2005. Hybrid Particle Filter and Mean Shift tracker with adaptive transition model, in: Proceedings. (ICASSP ’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., pp. 221–224. doi:10.1109/ICASSP.2005.1415381. Maggio, E., Cavallaro, A., 2011. Video Tracking: Theory and Practice. 1st ed., John Wiley & Sons, USA. Mahadevan, V., Vasconcelos, N., 2013. Biologically Inspired Object Tracking Using Center-Surround Saliency Mechanisms. IEEE Transactions on Pat-

24 tern Analysis and Machine Intelligence 35, 541–554. doi:10.1109/TPAMI. 2012.98. Mahdi, A., Qin, J., 2017. DeepFeat: A Bottom Up and Top Down Saliency Model Based on Deep Features of Convolutional Neural Nets. arXiv:1709.02495 [cs] ArXiv: 1709.02495. Marc, R.E., Jones, B.W., Watt, C.B., Anderson, J.R., Sigulinsky, C., Lauritzen, S., 2013. Retinal connectomics: Towards complete, accurate networks. Progress in Retinal and Eye Research 37, 141–162. doi:10.1016/ j.preteyeres.2013.08.002. Masson, G.S., Perrinet, L.U., 2012. The behavioral receptive field underlying motion integration for primate tracking eye movements. Neuroscience & Biobehavioral Reviews 36, 1–25. doi:10.1016/j.neubiorev.2011.03. 009. Medathati, N.V.K., Neumann, H., Masson, G.S., Kornprobst, P., 2016. Bioinspired computer vision: Towards a synergistic approach of artificial and biological vision. Computer Vision and Image Understanding 150, 1–30. doi:10.1016/j.cviu.2016.04.009. Mei, X., Ling, H., 2011. Robust Visual Tracking and Vehicle Classification via Sparse Representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 2259–2272. doi:10.1109/TPAMI.2011.66. Moeslund, T.B., Hilton, A., Krger, V., 2006. A survey of advances in visionbased human motion capture and analysis. Computer Vision and Image Understanding 104, 90–126. doi:10.1016/j.cviu.2006.08.002. Montagnini, A., Perrinet, L.U., Masson, G.S., 2015-08. Visual motion processing and human tracking behavior, in: Cristbal, G., Perrinet, L., Keil, M.S. (Eds.), Biologically Inspired Computer Vision. Wiley-VCH Verlag GmbH & Co. KGaA, pp. 267–294. Nadarajah, S., 2005. A generalized normal distribution. Journal of Applied Statistics 32, 685–694. doi:10.1080/02664760500079464. Nazare, A.C., Santos, C.E.d., Ferreira, R., Schwartz, W.R., 2014. Smart surveillance framework: A versatile tool for video analysis, in: IEEE Winter Conference on Applications of Computer Vision, pp. 753–760. doi:10.1109/ WACV.2014.6836027. Ndiour, I.J., Vela, P.A., 2010. A local extended Kalman filter for visual tracking, in: 49th IEEE Conference on Decision and Control (CDC), pp. 2498–2504. doi:10.1109/CDC.2010.5717339. Nguyen, H.T., Worring, M., Boomgaard, R.v.d., Smeulders, A.W.M., 2002. Tracking nonparameterized object contours in video. IEEE Transactions on Image Processing 11, 1081–1091. doi:10.1109/TIP.2002.802522. Prez, P., Hue, C., Vermaak, J., Gangnet, M., 2002. Color-Based Probabilistic Tracking, in: Computer Vision ECCV 2002, Springer, Berlin, Heidelberg. pp. 661–675. doi:10.1007/3-540-47969-4_44. Plaenkers, R., Fua, P., 2002. Model-Based Silhouette Extraction for Accurate People Tracking, in: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (Eds.), Computer Vision ECCV 2002, Springer Berlin Heidelberg, Berlin, Heidelberg. pp. 325–339. Porikli, F., Tuzel, O., Meer, P., 2006. Covariance Tracking using Model Update Based on Lie Algebra, in: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), pp. 728–735. doi:10.1109/CVPR.2006.94. Pylyshyn, Z.W., Storm, R.W., 1988. Tracking multiple independent targets: Evidence for a parallel tracking mechanism*. Spatial Vision 3, 179–197. doi:10.1163/156856888X00122. Rashbass, C., 1961. The relationship between saccadic and smooth tracking eye movements. The Journal of Physiology 159, 326–338. Rawat, W., Wang, Z., 2017. Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review. Neural Computation 29, 2352– 2449. doi:10.1162/neco_a_00990. Ross, D.A., Lim, J., Lin, R.S., Yang, M.H., 2008. Incremental Learning for Robust Visual Tracking. International Journal of Computer Vision 77, 125– 141. doi:10.1007/s11263-007-0075-7. Ruan, Y., Wei, Z., 2016. Real-Time Visual Tracking through Fusion Features. Sensors (Basel, Switzerland) 16. doi:10.3390/s16070949. Sakai, Y., Oda, T., Ikeda, M., Barolli, L., 2015. An Object Tracking System Based on SIFT and SURF Feature Extraction Methods, in: 2015 18th International Conference on Network-Based Information Systems, pp. 561–565. doi:10.1109/NBiS.2015.121. Sanin, A., Sanderson, C., Lovell, B.C., 2012. Shadow detection: A survey and comparative evaluation of recent methods. Pattern Recognition 45, 1684– 1695. doi:10.1016/j.patcog.2011.10.001. Sekuler, A.B., Sekuler, R., 1999. Collisions between moving visual targets: What controls alternative ways of seeing an ambiguous display? Perception

28, 415–432. doi:10.1068/p2909. Serre, T., 2014. Hierarchical Models of the Visual System, in: Jaeger, D., Jung, R. (Eds.), Encyclopedia of Computational Neuroscience. Springer New York, pp. 1–12. DOI: 10.1007/978-1-4614-7320-6 345-1. Serre, T., Riesenhuber, M., 2004. Realistic Modeling of Simple and Complex Cell Tuning in the HMAX Model, and Implications for Invariant Object Recognition in Cortex. Defense Technical Information Center. GoogleBooks-ID: i7jSnQAACAAJ. Serre, T., Wolf, L., Poggio, T., 2005. Object recognition with features inspired by visual cortex, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005, pp. 994–1000 vol. 2. doi:10.1109/CVPR.2005.254. Shapley, R., Enroth-Cugell, C., 1984. Chapter 9 Visual adaptation and retinal gain controls. Progress in Retinal Research 3, 263–346. doi:10.1016/ 0278-4327(84)90011-7. Shelhamer, E., Long, J., Darrell, T., 2017. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 640–651. doi:10.1109/TPAMI.2016.2572683. Shen, C., Kim, J., Wang, H., 2010. Generalized Kernel-Based Visual Tracking. IEEE Transactions on Circuits and Systems for Video Technology 20, 119– 130. doi:10.1109/TCSVT.2009.2031393. Silvanto, J., 2015. Why is blindsight blind? a new perspective on primary visual cortex, recurrent activity and visual awareness. Consciousness and Cognition 32, 15–32. doi:10.1016/j.concog.2014.08.001. Simonyan, K., Zisserman, A., 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs] ArXiv: 1409.1556. Solari, F., Chessa, M., Medathati, N.K., Kornprobst, P., 2015. What Can We Expect from a V1-MT Feedforward Architecture for Optical Flow Estimation? Image Commun. 39, 342–354. doi:10.1016/j.image.2015.04.006. Song, D., Tao, D., 2010. Biologically Inspired Feature Manifold for Scene Classification. IEEE Transactions on Image Processing 19, 174–184. doi:10.1109/TIP.2009.2032939. Stalder, S., Grabner, H., Van Gool, L., 2010. Cascaded Confidence Filtering for Improved Tracking-by-Detection, in: Daniilidis, K., Maragos, P., Paragios, N. (Eds.), Computer Vision ECCV 2010, Springer Berlin Heidelberg, Berlin, Heidelberg. pp. 369–382. Stefanov, N., Galata, A., Hubbold, R., 2007. A real-time hand tracker using variable-length Markov models of behaviour. Computer Vision and Image Understanding 108, 98–115. doi:10.1016/j.cviu.2006.10.017. Sundaresan, A., Chellappa, R., 2009. Multicamera tracking of articulated human motion using shape and motion cues. IEEE Transactions on Image Processing 18, 2114–2126. doi:10.1109/TIP.2009.2022290. Tang, F., Brennan, S., Zhao, Q., Tao, H., 2007. Co-Tracking Using SemiSupervised Support Vector Machines, in: 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. doi:10.1109/ICCV.2007. 4408954. Tran, S., Davis, L., 2007. Robust Object Trackinng wvith Regional Affine Invariant Features, in: 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. doi:10.1109/ICCV.2007.4408948. Treisman, A.M., Gelade, G., 1980. A feature-integration theory of attention. Cognitive Psychology 12, 97–136. doi:10.1016/0010-0285(80) 90005-5. Varga, D., Szirnyi, T., 2016. Fully automatic image colorization based on convolutional neural network, in: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 3691–3696. doi:10.1109/ICPR.2016. 7900208. Wallace, J.M., 2004. Object Motion Computation for the Initiation of Smooth Pursuit Eye Movements in Humans. Journal of Neurophysiology 93, 2279– 2293. doi:10.1152/jn.01042.2004. Wang, L., Ouyang, W., Wang, X., Lu, H., 2015. Visual Tracking with Fully Convolutional Networks, in: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 3119–3127. doi:10.1109/ICCV.2015.357. Wang, N., Yeung, D.Y., 2013. Learning a Deep Compact Image Representation for Visual Tracking, in: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (Eds.), Advances in Neural Information Processing Systems 26. Curran Associates, Inc., pp. 809–817. Wang, S., Lu, H., Yang, F., Yang, M.H., 2011. Superpixel tracking, in: 2011 International Conference on Computer Vision, pp. 1323–1330. doi:10.1109/ ICCV.2011.6126385. Wang, S., Xia, X., Huang, Y., Le, J., 2013. Biologically-Inspired Aging Face Recognition Using C1 and Shape Features, in: 2013 5th International Conference on Intelligent Human-Machine Systems and Cybernetics, pp. 574–

25 577. doi:10.1109/IHMSC.2013.285. Wang, Y., Wang, H., Yin, C., Dai, M., 2016. Biologically inspired image enhancement based on Retinex. Neurocomputing 177, 373–384. doi:10. 1016/j.neucom.2015.10.124. Weijer, J.v.d., Schmid, C., Verbeek, J., Larlus, D., 2009. Learning Color Names for Real-World Applications. IEEE Transactions on Image Processing 18, 1512–1523. doi:10.1109/TIP.2009.2019809. Weiss, Y., Simoncelli, E.P., Adelson, E.H., 2002. Motion illusions as optimal percepts. Nature Neuroscience 5, 598–604. doi:10.1038/nn858. Wen, L., Cai, Z., Lei, Z., Yi, D., Li, S.Z., 2012. Online Spatio-temporal Structural Context Learning for Visual Tracking, in: Hutchison, D., Kanade, T., Kittler, J., Kleinberg, J.M., Mattern, F., Mitchell, J.C., Naor, M., Nierstrasz, O., Pandu Rangan, C., Steffen, B., Sudan, M., Terzopoulos, D., Tygar, D., Vardi, M.Y., Weikum, G., Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (Eds.), Computer Vision ECCV 2012, Springer Berlin Heidelberg, Berlin, Heidelberg. pp. 716–729. doi:10.1007/ 978-3-642-33765-9_51. Wolfe, J., 2007. Guided Search 4.0: Current Progress with a model of visual search, in: Gray, W. (Ed.), Integrated Models of Cognitive Systems. Oxford, pp. 99–119. Wu, Y., Lim, J., Yang, M.H., 2015. Object Tracking Benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 1834–1848. doi:10.1109/TPAMI.2014.2388226. Xiao, J., Baker, S., Matthews, I., Kanade, T., 2004. Real-time combined 2d+3d active appearance models, in: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., pp. II–535–II–542 Vol.2. doi:10.1109/CVPR.2004.1315210. Xu, Z., Shi, P., Xu, X., 2008. Adaptive Subclass Discriminant Analysis Color Space Learning for Visual Tracking, in: Advances in Multimedia Information Processing - PCM 2008, Springer, Berlin, Heidelberg. pp. 902–905. doi:10.1007/978-3-540-89796-5_110. Yang, M., Fan, Z., Fan, J., Wu, Y., 2009. Tracking Nonstationary Visual Appearances by Data-Driven Adaptation. IEEE Transactions on Image Processing 18, 1633–1644. doi:10.1109/TIP.2009.2019807. Yin, Y., Wang, X., Xu, D., Liu, F., Wang, Y., Wu, W., 2016. Robust Visual Detection-Learning -Tracking Framework for Autonomous Aerial Refueling of UAVs. IEEE Transactions on Instrumentation and Measurement 65, 510– 521. doi:10.1109/TIM.2015.2509318. Zeisl, B., Leistner, C., Saffari, A., Bischof, H., 2010. On-line semi-supervised multiple-instance boosting, in: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1879–1879. doi:10.1109/ CVPR.2010.5539860. Zha, Y., Yang, Y., Bi, D., 2010. Graph-based transductive learning for robust visual tracking. Pattern Recognition 43, 187–196. doi:10.1016/j.patcog. 2009.06.011. Zhang, H.Z., Lu, Y.F., Kang, T.K., Lim, M.T., 2016a. B-HMAX: A fast binary biologically inspired model for object recognition. Neurocomputing 218, 242–250. doi:10.1016/j.neucom.2016.08.051. Zhang, K., Liu, Q., Wu, Y., Yang, M.H., 2016b. Robust Visual Tracking via Convolutional Networks Without Training. IEEE Transactions on Image Processing 25, 1779–1792. doi:10.1109/TIP.2016.2531283. Zhang, S., Lan, X., Yao, H., Zhou, H., Tao, D., Li, X., 2017. A Biologically Inspired Appearance Model for Robust Visual Tracking. IEEE Transactions on Neural Networks and Learning Systems 28, 2357–2370. doi:10.1109/ TNNLS.2016.2586194. Zhong, W., Lu, H., Yang, M.H., 2012. Robust object tracking via sparsity-based collaborative model, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1838–1845. doi:10.1109/CVPR.2012.6247882.