Toward analyzing mutual interference on infrared-enabled depth cameras

Toward analyzing mutual interference on infrared-enabled depth cameras

Computer Vision and Image Understanding ( ) – Contents lists available at ScienceDirect Computer Vision and Image Understanding journal homepage: ...

2MB Sizes 0 Downloads 50 Views

Computer Vision and Image Understanding (

)



Contents lists available at ScienceDirect

Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu

Toward analyzing mutual interference on infrared-enabled depth cameras Lucas Adams Seewald a , Vinicius Facco Rodrigues a , Malte Ollenschläger a,b , Rodolfo Stoffel Antunes a , Cristiano André da Costa a ,∗, Rodrigo da Rosa Righi a , Luiz Gonzaga da Silveira Jr. a , Andreas Maier b , Björn Eskofier b , Rebecca Fahrig c a

Software Innovation Laboratory - SOFTWARELAB, Universidade do Vale do Rio dos Sinos - Unisinos, São Leopoldo, Brazil Friedrich-Alexander-Universität Erlangen-Nürenberg - FAU, Erlangen, Germany c Siemens Healthcare GmbH, Forchheim, Germany b

ARTICLE

INFO

Communicated by N. Paragios MSC: 68U05 65D18 68T45 13C15 Keywords: Depth camera Multi-camera Interference Evaluation Computer vision

ABSTRACT Camera setups with multiple devices are a key aspect of ambient monitoring applications. These types of setups can result in data corruption when applied to recent RGB-D camera models because of mutual interference of the infrared light emitters employed by such devices. Consequently, the behavior of such interference must be appropriately evaluated to provide data that will allow monitoring systems to handle possible errors introduced into the data captured by depth sensors. However, multi-device setups have been explored in few studies in the current literature, especially in terms of the detailed measurements of the interference’s effect during long-term usage of RGB-D cameras. In this context, a methodology to evaluate the effect of mutual interference on the accuracy and precision of measured depth values is proposed in this article. The results of a series of experiments with different setups based on multiple depth cameras are explored. These setups include three devices that were widely used in studies in the computer vision literature related to depth imaging: the Microsoft Kinect v2 and two Intel RealSense models: R200 and D415. The experimental results indicate that the Kinect v2 yields considerably more stable depth readings than the RealSense R200 in single-camera scenarios, even considering the influence of the warm-up time that is characteristic of time-of-flight devices such as the Kinect v2. In multi-device setups, the Kinect v2 displays periodic peaks of mutual interference that increase in intensity depending on the distance between the cameras, with short-range setups yielding higher interference peaks. Further, the addition of more devices can potentially increase the duration of some interference peaks, albeit their intensity is not greatly affected. In long-range setups, the measured interference is small considering the experiments’ length, with the proportion of bad pixels among all captured frames ranging from 3.74% to 3.97% in a setup comprising three depth cameras. In turn, multi-device setups comprising the RealSense models are not affected by prejudicial interference peaks. In long-range setups, the instability of the R200 leads to its results being less accurate and precise than those of the Kinect v2 under mutual interference. However, in close range multi-device setups, the high interference peaks observed with the Kinect v2 render the RealSense models a more stable alternative.

1. Introduction In recent work on computer vision technologies, the capabilities of depth imaging cameras for purposes such as gesture recognition (Yao et al., 2014), activity recognition (Kerola et al., 2017; Zhang and Tian, 2015), face recognition (Kim et al., 2017), and 3D modeling of indoor environments (Henry et al., 2012) were explored. For devices that capture both RGB and depth data, also known as RGB-D cameras, different techniques were explored for estimating the distance between the image sensors and observed surfaces. The obtained depth data can

be extrapolated to specialized data such as skeletons or point clouds, which are the basis for advanced computer vision applications. Most RGB-D cameras are based on one of the following three different technologies: stereo vision (SV) (Hussmann et al., 2008), structured light (SL) (Scharstein and Szeliski, 2003), and time-of-flight (ToF) (Gokturk et al., 2004). SV cameras evaluate the disparity between two RGB images to approximate the distances between the cameras and objects. SL devices emit infrared (IR) patterns, the distortion of which from the cameras’ perspective allows the distance to be estimated. Finally, ToF

∗ Correspondence to: Universidade do Vale do Rio dos Sinos. Av. Unisinos, 950. 93022-750. São Leopoldo, RS, Brazil. E-mail address: [email protected] (C.A.d. Costa).

https://doi.org/10.1016/j.cviu.2018.09.010 Received 25 January 2018; Received in revised form 26 September 2018; Accepted 29 September 2018 Available online xxxx 1077-3142/© 2018 Elsevier Inc. All rights reserved.

Please cite this article in press as: Seewald L.A., et al., Toward analyzing mutual interference on infrared-enabled depth cameras. Computer Vision and Image Understanding (2018), https://doi.org/10.1016/j.cviu.2018.09.010.

L.A. Seewald et al.

Computer Vision and Image Understanding (

Table 1 Overview of articles addressing one or several Kinect v2 devices.

Capecci et al. (2016) Fankhauser et al. (2015) Breuer et al. (2014) Corti et al. (2016) Wasenmüller and Stricker (2017) Geiselhart et al. (2016) Otto et al. (2015) Kowalski et al. (2015) Kunz et al. (2016) Sarbolandi et al. (2015)

1 1 1 1 1 >1 >1 4 2 2

Multi-device

Investigated feature

Carfagni et al. (2017) Gonzalez-Jorge et al. (2015) Zennaro et al. (2015)

1 1 1

Accuracy Precision Object color Multiple path Temperature

Paper

• A methodology for measuring the impact of mutual interference of RGB-D cameras (Section 3); • An analysis of three different camera models, two based on SL and one based on ToF, focused on the stability of camera measurements (Section 5); • A detailed analysis of depth camera behavior in multi-device setups focused on the influence of IR light interference on measured depth values (Section 6).

Other

  Angular resolution    Object material     Distance to object

        



system quality) that are not directly related to the respective technology they implement. Our article offers the following three contributions:

Skeletal tracking    

)

The remainder of the article is organized as follows. In Section 2, previous work related to our study is reviewed. In Section 3, our proposed methodology to measure mutual interference among depth cameras is introduced. Section 4 describes the experimental setups employed in our evaluation. Section 5 presents the results of individual camera evaluations. In Section 6, we analyze the results of multi-device interference experiments. Finally, Section 7 concludes the article with final remarks.

Motion tracking Markerless tracking 3D scan

  Multi-device interference,

Ambient background light, Intensity, Dynamic scenery 3D scanning Outdoor tracking

2. Related work cameras calculate the time delay or phase difference between emitted and captured IR signals to measure the distance from their targets. SV cameras have the advantage of being passive, i.e., they can provide depth information without introducing IR light into the scene. However, they require additional computational power to extract depth data from the disparity of the stereo images and are very susceptible to ambient conditions such as poor lighting (Scharstein and Szeliski, 2003). SL cameras are more resilient to poor lighting, but still require additional computational power, because they employ stereo algorithms for depth evaluation. In turn, ToF cameras retain the resilience to poor lighting and are capable of deriving depth information without employing additional stereo algorithms (Hussmann et al., 2008). It is important to note that the use of IR emitters renders SL and ToF cameras susceptible to interference from other IR-based sources. In particular, the location of these devices in external environments with direct sunlight incidence is known to cause considerable deviations in depth readings (Zennaro et al., 2015). IR light interference is particularly relevant in scenarios where multiple depth cameras are employed. For example, a set of two or more devices may be used to monitor a room from multiple perspectives, thus providing robustness against occlusion. In such scenarios, it is likely that the frustums from the IR light emitters and sensors will overlap, resulting in mutual interference. As a result, the collected depth data could include incorrect readings; algorithms depending on these readings would then produce an unreliable output. To avoid such an outcome, it is important to understand the behavior of depth camera sensors when under mutual IR light interference. However, few studies in the current literature examined multi-device setups, especially in terms of the detailed measurements of the interference’s effect during long-term usage of RGB-D cameras. For instance, Geiselhart et al. (2016) and Otto et al. (2015) stated that interference between several equivalent cameras is negligible; however, they did not experimentally measure the actual effect. Therefore, the main goal of this article is to present an in-depth study of the behavior of well-known depth cameras in multi-device setups susceptible to IR light mutual interference. We focus specifically on three well-known RGB-D devices that have been studied: the Microsoft Kinect v2 (Lachat et al., 2015), which employs ToF technology; the Intel RealSense R200 (Culbertson, 2015), which is based on SL; and the Intel RealSense D415, which employs a combination of SL and SV. It is important to note that the results obtained with these cameras models cannot be generalized to all devices in which the respective depth sensing method is implemented, because there are technical aspects specific to these camera models (e.g., the IR emitter potency or optical

Table 1 summarizes the work related to our study, which is further analyzed in the remainder of this section. The Kinect v2 (Microsoft, 2014) was released in 2014 and is used for various applications. These include mobile robot navigation, as described by Fankhauser et al. (2015), and 3D modeling, as described by Kowalski et al. (2015). Alabbasi et al. (2015) showed that it can also be used for human motion tracking. Furthermore, the Kinect v2 is used in medical scenarios, such as gait detection and analysis (Geerse et al., 2015; Liu and Mehrotra, 2016), sleep monitoring (Lee et al., 2015), and surgery analysis (Schreiter et al., 2016). In the evaluation of the captured data, the camera characteristics (e.g., accuracy) are relevant aspects. Although several of these applications that use the Kinect v1 have been extensively investigated, Capecci et al. (2016) and Moon et al. (2016) indicated that almost no studies have been conducted to investigate the use of Kinect v2 in such areas. The Kinect v2 is a ToF-based device and differs greatly from the Kinect v1, which is based on SL. For this reason, a number of studies were focused on a comparison of the depth reading results of the two generations of the device. Gonzalez-Jorge et al. (2015) presented a detailed analysis of the accuracy and precision of the two Kinect generations. They focused on evaluating the extent to which their angle and their distance from the subject influence the devices’ depth readings. Their results indicate that both devices lose accuracy when their distance from the subject increases. However, the Kinect v2 is influenced less, with an offset variation of less than 10 mm between the minimum and maximum evaluated distances. The Kinect v2 was thus shown to be more accurate overall than its predecessor. Zennaro et al. (2015) also compared the two generations of the device, focusing on use cases involving 3D reconstruction and people tracking. The authors verified that the Kinect v2 offers better results, being two to six times more accurate than the Kinect v1. They also concluded that the newer device is more robust to artificial illumination and sunlight. Wasenmüller and Stricker (2017) presented an in-depth analysis of both Kinect generations considering aspects such as warm-up, subject distance, and subject color. Their results also indicate that the Kinect v2 is more accurate overall. However, the newer device is sensitive to a warm-up effect : during the warm-up period, there is a deviation of approximately 20 mm in its depth readings. Additional studies exist on Kinect v2’s behavior in single-camera setups. High-level investigations were undertaken by Capecci et al. (2016). They calculated the distances between body parts using a Kinect v2, utilizing the results of a motion capture system as the gold standard. They stated that the Kinect v2 performs better for upper than for lower limbs. Toward low-level exploration of the Kinect v2, Breuer et al. (2014) 2

Please cite this article in press as: Seewald L.A., et al., Toward analyzing mutual interference on infrared-enabled depth cameras. Computer Vision and Image Understanding (2018), https://doi.org/10.1016/j.cviu.2018.09.010.

L.A. Seewald et al.

Computer Vision and Image Understanding (

)



– Interference between three cameras.

analyzed the camera’s accuracy, precision, and angular resolution. The authors reported a warm-up behavior similar to that observed in the results presented in Wasenmüller and Stricker (2017). For accuracy, they captured a marker board and stated that the accuracy changes during the first 1000 s after power-up because of the warm-up interval of the sensor. During this interval, a characteristic fluctuation of distance estimation occurs because of the activation of Kinect v2’s internal cooling fan. Also by using a single-camera setup, Corti et al. (2016) showed that multiple path errors exist when the camera’s IR light is reflected from one surface to a second one before returning to the camera instead of being reflected back by a single surface. An algorithm to mitigate these multi-path errors was presented by Freedman et al. (2014). The aforementioned articles presented results similar to ours with respect to single-camera setups. In particular, Wasenmüller and Stricker (2017) obtained similar conclusions regarding the warm-up behavior of the Kinect v2. However, in the studies described above, the long-term behavior of multi-device setups including IR-enabled cameras was not extensively investigated. Thus, there are relevant aspects related to interference that have not been comprehensively addressed. For example, it is not known whether multi-path errors are increased by inter-camera interference. Some authors have stated that the interference between several Kinect v2 cameras is negligible; however, they did not describe experiments regarding this topic, as, for example, did Geiselhart et al. (2016) and Otto et al. (2015). Kowalski et al. (2015) used four Kinect v2 cameras to produce 3D scans of humans. The cameras were positioned in a circular setup and tilted such that they faced downwards. The authors argued that the tilt reduces interference, but they did not provide experimental results to verify this. Experiments regarding a lowlevel feature were performed by Kunz et al. (2016). They investigated the influence of interference between two cameras on accuracy. Two groups of experiments were conducted. During the first, the cameras were facing the same target, whereas in the second they were facing each other. The authors advised that the angle between two cameras should not be in the range of 10◦ to 40◦ in order to avoid randomly occurring errors, although they did not describe these errors in great detail. Data captured when only one camera was turned on were used as ground truth. Various experiments were conducted by Sarbolandi et al. (2015). Among other results, they found periodic interference between two Kinect v2 cameras during 25% of the time during which data were captured in a recording approximately 3 min in length. Work related to RealSense devices has also been presented in the literature. Carfagni et al. (2017) explored various aspects of the RealSense SR300, a device similar to the RealSense R200 model but designed for close-range use cases. The authors focused on the use case of 3D scanning based on RGB-D devices. Their results indicate that the RealSense SR300 cannot be compared to a professional 3D scanner. However, the results of other studies on RGB-D-based 3D scanning showed that the RealSense SR300 presents a quality equivalent to that of other camera models. In the specific case of the RealSense R200, Fanello et al. (2017) investigated the accuracy and precision of the device in short duration experiments. Our results, which include an analysis of the long-term behavior of the RealSense R200 model, complement those of Fanello et al. (2017). Although a considerable amount of information about warm-up behavior and multiple path interference is available for the Kinect v2, no in-depth analysis of multi-device setups has thus far been conducted. This is especially true for low-level features, such as accuracy and precision. The following gaps can be found in the research presented in the literature:

To address these gaps, techniques suggested in the literature can be used, such as using the time averaged depth values for a pixel and assuming a 1-h warm-up interval in the case of experiments not related to warm-up, as suggested by Wasenmüller and Stricker (2017). Furthermore, ground truth data may be captured from single-camera setups, as suggested by Kunz et al. (2016). 3. Proposed methodology This section describes the experimental methodology used to evaluate the selected cameras. In summary, the goal was to measure the relative accuracy and relative precision of their depth sensor readings. These metrics are defined as relative, because our goal was to compare individual camera readings with a reference value obtained from the camera itself. They are calculated using the mean and standard deviation values of sets of depth frames against the reference values. Consequently, the error measured in our experiments is also relative, because it is a measure of the extent to which the readings of an individual camera deviate from its average behavior. The methodology used to calculate these metrics, summarized in Fig. 1, is described in detail in the remainder of this section. In general, in the methodology, the captured data are compressed into chunks of 10 s, as proposed by Wasenmüller and Stricker (2017). Therefore, several frames are combined into a chunk frame, as illustrated in Fig. 2. The shown chunk 𝑐 is represented by three frames 𝑓 (𝑡), where 𝑡 is the time. Each frame in Fig. 2 has one depth value 𝑑𝑓𝑝 for each of the nine pixels 𝑝. To summarize the behavior of a pixel 𝑝 within a chunk frame 𝑐, we calculate its time-averaged depth value 𝜇𝑐𝑝 , as described in Eq. (1). The depth values’ standard deviation for each pixel 𝜎𝑐𝑝 is calculated using Eq. (2). In both of these equations, 𝐹 is the number of frames in a considered chunk. The values 𝜇𝑐𝑝 and 𝜎𝑐𝑝 are averaged over all pixels to obtain 𝜇𝑐 and 𝜎𝑐 , which describe the averaged mean value and averaged standard deviation for the chunk. 1 ∑ 𝑝 𝑑 𝐹 𝑓 =0 𝑓 √ √ −1 √ 𝐹∑ √1 𝜎𝑐𝑝 = √ (𝑑 𝑝 − 𝜇𝑐𝑝 )2 𝐹 𝑓 =0 𝑓 𝐹 −1

𝜇𝑐𝑝 =

(1)

(2)

For each experiment, we generated a reference value for 𝜇 and 𝜎. To achieve this, the cameras were positioned according to the specific setup, but only the camera that would be evaluated in the following experiment was activated for 2 h. In Fig. 3, it is shown that the first hour was regarded as the warm-up interval, as suggested by Wasenmüller and Stricker (2017). The second hour was handled as a single chunk and Eqs. (1) and (2), as well as consecutive averaging, were used to calculate reference values 𝜇𝑅 and 𝜎𝑅 . In our analysis, the comparative values of 𝜇𝑐 and 𝜇𝑅 provide the relative accuracy of a device, because they identify the difference between its reference readings (𝜇𝑅 ) and those contained in a chunk’s frames (𝜇𝑐 ). In turn, the comparative values of 𝜎𝑐 and 𝜎𝑅 provide the relative precision of a device, because they identify the distance of the variation in a chunk’s depth readings (𝜎𝑐 ) from the obtained reference variation (𝜎𝑅 ). To compare 𝜇𝑐 and 𝜎𝑐 of a chunk 𝑐 with the reference values 𝜇𝑅 and 𝜎𝑅 , we use the absolute distance (Eq. (3)) and the Mahalanobis distance (McLachlan, 1999). The latter is defined for two vectors 𝑥⃗ and 𝑦⃗ from a set 𝑍, as described by Eq. (4). Here, 𝛴 is the covariance matrix of the vectors in the set 𝑍. Since in our case 𝜇 and 𝜎 are one-dimensional, the pixel-wise Mahalanobis distance can be calculated by the fraction in Eq. (5).

• Individual Camera Performance – Long-term evaluation (8-h stress test) – Warm-up behavior

𝜀𝑥𝑐 = |𝑥𝑐 − 𝑥𝑅 |, 𝑥 = {𝜇, 𝜎}

• Mutual Device Setups – Interference between two cameras

MD(⃗ 𝑥, 𝑦) ⃗ =



(⃗ 𝑥 − 𝑦) ⃗ 𝑇 𝛴 −1 (⃗ 𝑥 − 𝑦) ⃗

(3)

(4)

3 Please cite this article in press as: Seewald L.A., et al., Toward analyzing mutual interference on infrared-enabled depth cameras. Computer Vision and Image Understanding (2018), https://doi.org/10.1016/j.cviu.2018.09.010.

L.A. Seewald et al.

Computer Vision and Image Understanding (

)



Fig. 1. Flowchart with an overview of the camera evaluation methodology.

Fig. 2. Description of chunks. A chunk 𝑐 contains 𝐹 frames. Each frame has 𝑃 pixels. For each pixel 𝑝 in a frame 𝑓 (𝑡), one depth measurement 𝑑𝑓𝑝 is available. A chunk frame represents the pixels’ behavior for all frames in the chunk. Fig. 3. Evaluation of warm-up behavior and generation of reference values.

MD𝑝𝑐 =

𝑝 |𝜇𝑐𝑝 − 𝜇𝑅 |

defined above. More specifically, we define the invalid pixel ratio 𝐻 of a chunk 𝑐 as the average value of invalid pixels present in all the 𝐹 frames that compose the chunk, as presented in Eq. (9).

(5)

𝑝 𝜎𝑅

The acceptable error margin for the Mahalanobis distance MD𝑝𝑐 may vary depending on the application. Therefore, we employ a thresholding approach and count the number of pixels within a chunk frame for which the Mahalanobis distance exceeds different thresholds 𝜃. Eq. (6) describes this calculation. The used thresholding function 𝑡ℎ() is represented by Eq. (7). For the threshold 𝜃, we use values between 0.1 and 5.0, incrementing by 0.1. 𝑁𝑐𝜃 =

𝑃 100% ∑ th(MD𝑝𝑐 , 𝜃) 𝑃 𝑝=1

th(MD𝑝𝑐 , 𝜃) =

{ 0

if MD𝑝𝑐 < 𝜃

1

if MD𝑝𝑐 > 𝜃

𝐻𝑓 =

𝐻𝑐 =

|𝑝𝑖𝑓 | |𝑝𝑓 | 𝑖 𝐹 −1 1 ∑ |𝑝𝑓 | 𝐹 𝑓 =0 |𝑝𝑓 |

(8)

(9)

4. Experimental setup (6) This section addresses the cameras and the experiments we conducted. In Section 4.1, we describe the target devices. In the following sections, we define the set of experimental scenarios used in the evaluation. The first type of experiment was related to single-camera setups, where warm-up and long-time behavior were analyzed; see Section 4.2. The multi-camera setups are then explained in Section 4.3. Our evaluations of the single- and multi-device setups had different objectives. The single-camera evaluation was designed to explore the accuracy and precision of camera models based on different technologies while considering warm-up and long-term usage. In turn, in the multi-camera evaluation a metric (defined in Section 3) was used to compare the behavior of the camera with and without interference. Consequently, the results presented in Section 6 are a self-contained comparison of the device’s behavior with and without multi-device interference. In all the scenarios, the room temperature was kept stable at 26 ◦ C. Regarding light, both cameras employ IR emitters to measure image

(7)

To further evaluate and compare the quality of the depth data obtained from the evaluated devices, we measure the ratio of invalid pixels. These invalid pixels are the portions of the depth frame that do not contain useful distance information. Their occurrence is due to various limitations derived from the depth sensing technology and its implemented hardware, in particular the sensors. These pixels are explicitly indicated in the results obtained from the cameras and, therefore, can be counted. Consider that a frame 𝑓 has a total number of pixels |𝑝𝑓 |, of which a given number |𝑝𝑖𝑓 | are known to be invalid. The invalid pixel ratio 𝐻 for frame 𝑓 is then obtained with Eq. (8). Furthermore, this metric can be extended to the concept of chunks 4

Please cite this article in press as: Seewald L.A., et al., Toward analyzing mutual interference on infrared-enabled depth cameras. Computer Vision and Image Understanding (2018), https://doi.org/10.1016/j.cviu.2018.09.010.

L.A. Seewald et al.

Computer Vision and Image Understanding (

)



Table 2 Technical details for camera devices. The frame rate is denoted as frames per second. Values refer to depth information. RGB-data are not included. Device

Depth estimation strategy

Resolution (width × height)

Range

Field-of-view (horizontal × vertical)

fps

Kinect v2 Kinect Developers (2017)

Time-of-flight

512 × 424

0.50–4.5 m

70◦ × 60◦

30

Structured light

320 × 240 480 × 360 640 × 480

0.6–3.5 m

59◦ × 46◦

30/60

Structured light Stereo vision

up to 1280 × 720

0.16–10 m

69◦ × 42◦

30/60/90

RealSense R200 Intel (2015) and Culbertson (2015) RealSense D415 Intel (2018)

depth and are not influenced by lighting conditions (Fürsattel et al., 2016). Because IR-based depth cameras can be influenced by sunlight (Zennaro et al., 2015), the experiments were conducted in an isolated room, without any sunlight incidence. Additionally, the room lights were turned off during the experiments. 4.1. Target devices The cameras’ technical specifications are shown in Table 2. The Kinect v2 employs ToF technology to estimate the distance from objects to the camera. This means that it emits IR light and senses the reflection, which is phase delayed. From this delay, the distance is calculated (Conde, 2017; Hansard, 2013). The Kinect v2 records 30 depth frames per second (fps) at a resolution of 512 × 424 pixels. In contrast to the Kinect v2, the RealSense R200 employs SL technology. An IR light pattern is projected onto the scene. This is captured by two cameras to construct a depth image from SV (Corke, 2017). The resolution of IR cameras is 640 × 480 pixels. From the IR image, a depth image is calculated. The user can choose between a resolution of 320 × 240 or 480 × 360 pixels for the depth image. Additionally, internal hardware upsampling to a resolution of 640 × 480 is available. Depending on the chosen resolution for the included RGB camera, the depth image can be recorded at either 30 or 60 fps. In all the experiments in this study, a resolution of 640 × 480 and a frame rate of 30 fps were used.

Fig. 4. Camera position for warm-up evaluation.

4.3. Multi-camera

4.2. Single-camera

The multi-camera setups were divided into setups including two and three devices. These setups were further divided into two scenarios. In one scenario, all the cameras were facing a wall and their view direction with respect to the center of the FoV was parallel to the ground. In the second scenario, the cameras were facing a cardboard ‘‘box’’ placed on the ground, and therefore, their vertical view angle was tilted downwards. According to Intel documentation (Intel, 2016), multiple RealSense R200 cameras can be placed in the same environment and no interference that degrades the depth-images’ quality will occur. Therefore, only the Kinect v2 was evaluated in the multi-camera setups. In the first two-camera setup, we followed the recommendations of Wasenmüller and Stricker (2017) and positioned the cameras to face a room corner, as illustrated in Fig. 5. The camera of which the data were evaluated was at the same position as during the warm-up experiments. A second camera was positioned at a distance of 2 m. Their vertical view angle was parallel to the ground. In the second two-camera setup, the cameras were positioned as depicted in Fig. 6. Although three cameras are shown, only the evaluated camera ‘‘𝐸’’ and one interference camera ‘‘𝐼𝑙 ’’ were activated during this experiment. Additionally, a box with dimensions 38 cm × 26 cm × 34 cm was added to the scene at a distance of 1.8 m from each camera to simulate an object under surveillance. The cameras were focused on the box and therefore were tilted down by approximately 35◦ . Now, we describe the setups employing three Kinect v2 cameras. In the first scenario, the three cameras were positioned in front of a wall, as depicted in Fig. 7. Their distance from the wall was 3.0 m. The distance of the interference cameras from the evaluated camera was 1.45 m and all cameras were focused on the same spot on the wall. Their vertical view angle was parallel to the ground. In turn, the second three-camera scenario employed the complete setup depicted in Fig. 6. As previously

To obtain results that could be used as a basis for our evaluation, we conducted experiments with single-camera setups. First, we explored the warm-up behavior of devices and then we extended our experiment to an 8-h stress test. The warm-up time is a characteristic that is particularly pronounced in ToF devices, but may occur also with other technologies. When the camera is activated, the IR emitter and sensor heat up until they reach a stable operating temperature, as shown by Fürsattel et al. (2016). Because of the changing temperature, the measured distance values vary even if a still scene is captured. Thus, the variation in measured distances for a given pixel provides information about the warm-up effect. We used a room corner without adding objects to the scene, a setup similar to that of Wasenmüller and Stricker (2017). The details of the camera positioning are shown in Fig. 4. The Kinect v2 was positioned 2.5 m away from the room walls, which intersected at one corner. As a result, the camera was positioned at approximately 3.5 m from the corner. Thus, we ensured that the measured distances were less than the maximum range of 4.5 m provided by the manufacturer (Table 2). For the RealSense R200, a smaller distance from the corner was chosen, taking into account its shorter range of 3.5 m. It was positioned 2.1 m away from the walls, resulting in an approximate distance of 3 m from the room corner. Both cameras were placed at a height of 1.45 m above ground level. The vertical angle of both cameras was set parallel to the room floor considering the center of their fields-of-view (FoVs). To evaluate the warm-up effect on the measured depth data, we analyzed a 2-h footage, as suggested by Fürsattel et al. (2016). To allow us to evaluate the cameras’ long-time behavior, we conducted an 8-h long capture. The experimental setup was the same as for the warm-up experiment described above. 5

Please cite this article in press as: Seewald L.A., et al., Toward analyzing mutual interference on infrared-enabled depth cameras. Computer Vision and Image Understanding (2018), https://doi.org/10.1016/j.cviu.2018.09.010.

L.A. Seewald et al.

Computer Vision and Image Understanding (

)



Fig. 8. Wall setup for three cameras modified for short-range experimentation. The evaluated data were recorded by camera ‘‘𝐸’’. ‘‘𝐼’’ denotes the interference cameras. A box is located in the middle of the scene.

Fig. 5. Wall setup for two cameras. Evaluated data were recorded by camera ‘‘𝐸’’. ‘‘𝐼’’ denotes the interference camera.

5. Single-Camera evaluation results This section presents the evaluation results of the warm-up and stress tests for the single-camera setup. Section 5.1 addresses the effect of the warm-up period on the accuracy and precision measurements for the Kinect v2, RealSense R200, and RealSense D415 cameras. Section 5.2 presents an analysis of these metrics in long data capture periods for both camera devices. 5.1. Warm-up We evaluated both the accuracy and precision errors for the Kinect v2, RealSense R200, and RealSense D415 (as described in Section 4.2) in the 1-h period immediately after the sensor was activated, with the objective of demonstrating the effects of the warm-up on the measured depth values. In this section, we present the results of both metrics for the two camera sensors in Figs. 9 and 10. The results indicate that the devices suffer from different variations as a function of time, as shown in the figures. In particular, the main variations for the Kinect v2 are more significant around the first 20 min. The error during this interval decreases from 42.67 mm at time 00:10 (mm:ss) to 12.67 mm at time 19:50 (mm:ss). After this period, the error level suffers short variations and after time 29:30 (mm:ss) the values remain below the 10 mm error margin. This behavior demonstrates that, when the cooling fan is not activated, the changing temperature of the sensor affects the accuracy of the depth data readings. When the fan starts operating, some fluctuations occur during a short period of time, and then, the values stabilize and further significant changes do not occur. In contrast to those of the Kinect v2, the RealSense R200 results do not present a varying behavior in different time periods. Fig. 9(b) shows that variations in accuracy error occur over 60 min, ranging from 359.1 mm to 414.8 mm. These results show neither an initial trend nor time variations that could indicate a warm-up behavior. More importantly, they demonstrate that the depth values captured by the RealSense R200 tend to present a high variation that is not correlated to the sensor warm-up. Finally, the results for the RealSense D415 also indicate a warmup behavior in terms of the accuracy and precision of depth readings. For the first 20 min, the reading accuracy (presented in Fig. 9(c)) shows a higher error that peaks at 55 mm. This error gradually falls to approximately 14 mm and is stabilized throughout the remaining 40 min of the experiment. A similar behavior is observed in the precision results in Fig. 10(c). The precision error peaks at 48 mm at the beginning of the experiment and gradually recedes to 38 mm during the first 20 [min] of the experiment. The precision error then remains stable throughout the final 40 min.

Fig. 6. Ring setup for both two and three cameras. The evaluated data were recorded by camera ‘‘𝐸’’. ‘‘𝐼’’ denotes the interference cameras. A box is located in the middle of the scene.

Fig. 7. Wall setup for three cameras. The evaluated data were recorded by camera ‘‘𝐸’’. ‘‘𝐼’’ denotes the interference cameras. A box is located in the middle of the scene.

described, a box was positioned in the scene at a distance of 1.8 m from each camera. The cameras were focused on the box and therefore tilted down by approximately 35◦ . Additionally, we employed a modified version of the wall setup to evaluate the effect of the camera distance on the behavior of mutual interference. To achieve this goal, we reduced the distance between the cameras and the reference box to approximately 1.5 m Fig. 8 illustrates the modified setup with cameras at short range. 6

Please cite this article in press as: Seewald L.A., et al., Toward analyzing mutual interference on infrared-enabled depth cameras. Computer Vision and Image Understanding (2018), https://doi.org/10.1016/j.cviu.2018.09.010.

L.A. Seewald et al.

Computer Vision and Image Understanding (

Fig. 9. Accuracy error (𝜀𝜇𝑐 ) of the first 1-h capture for (a) the Kinect v2, (b) the RealSense R200, and (c) the RealSense D415. The setup for these data captures is presented in Fig. 4.

)



Fig. 10. Precision error (𝜀𝜎𝑐 ) of the first 1-h data capture period for (a) the Kinect v2, (b) the RealSense R200, and (c) the RealSense D415. The setup for these data captures is presented in Fig. 4.

To summarize, the Kinect v2 and RealSense D415 show more stable readings than the RealSense R200, even during the warm-up period. The mean accuracy error of the 60 min for the Kinect v2 is 14.5 mm. The RealSense D415 presents a slightly higher mean accuracy error of 15.2 mm. In turn, the depth readings of the RealSense R200 show considerable instability, with a mean error of 376.1 mm. Although the Kinect v2 and RealSense D415 present some oscillation in the first 20 min, their errors for the entire data capture period are considerably lower than those of the RealSense R200. An examination of Fig. 10, which illustrates precision error, reveals that the results for all the devices indicate a behavior similar to that

shown in Fig. 9. The Kinect v2 shows significant variations in the first 20 min and stable values thereafter. As for the accuracy results, this behavior is due to the sensor heating and the cooling system that starts to operate some minutes after the camera is activated. In contrast, the results for the RealSense R200 reveal significant variations during the entire data capture period, and do not show a warm-up behavior. Fig. 10(b) demonstrates values ranging from 484.8 mm to 542.9 mm during the data capture period. These values are considerably higher than those for the Kinect v2, which range from 9.3 mm to 34.2 mm. Finally, the RealSense D415 also presents a higher variation in the first 7

Please cite this article in press as: Seewald L.A., et al., Toward analyzing mutual interference on infrared-enabled depth cameras. Computer Vision and Image Understanding (2018), https://doi.org/10.1016/j.cviu.2018.09.010.

L.A. Seewald et al.

Computer Vision and Image Understanding (

)



20 min of usage and stable values thereafter. The results indicate that the Kinect v2 and RealSense D415 present very similar behavior in terms of accuracy. In turn, the RealSense D415 presents a higher precision error than the Kinect v2. As can be seen in Figs. 9 and 10, the Kinect v2 shows a warm-up period of about 20 min in which the precision and accuracy errors are accentuated. This is an expected behavior of cameras that use ToF technology. The results presented in the figures are similar to others reported in the literature, as in Fürsattel et al. (2016), where the authors analyzed the errors in ToF sensor data. The results for the RealSense R200 do not indicate warm-up behavior, which is expected from cameras based on SL technology. Finally, the results for the RealSense D415 present a warm-up behavior similar to that of the Kinect v2. Albeit this behavior is expected to occur in ToF devices, the RealSense D415 contains an IR emitter that may suffer from the same physical conditions as the Kinect v2. More importantly, the presented results demonstrate that the accuracy and precision errors of the RealSense R200 are considerably higher, while the Kinect v2 and RealSense D415 present similar behaviors. 5.2. Stress test The objective of the stress test was to evaluate accuracy and precision deviations in long data capture periods, as described in Section 4.2. Figs. 11 and 12 show the results for these metrics for the Kinect v2, RealSense R200, and RealSense D415 cameras. The accuracy errors shown in Fig. 11(a) demonstrate the Kinect v2’s warm-up interval as the only time period in which the errors visibly change. Subsequently, they stabilize between 12 mm and 15 mm in the remaining data capture period. Similarly to accuracy errors, the precision errors of the Kinect v2 in Fig. 12(a) show, after the warm-up period, stable results in which the errors remain between 22.5 mm and 25.5 mm. In contrast, the results for the RealSense R200 shown in Figs. 11(b) and 12(b) do not suggest clear patterns as a function of time. Further, as expected, there is no warm-up behavior. Finally, the RealSense D415 presents a crescent behavior in its accuracy error between 2 h and 7 h of experimentation. It is important to note that, in this 5 h period, the accuracy error varies by 10 mm, that is, a variation of 2 mm per hour. This behavior is not observed in the RealSense D415 precision error, which remains almost constant at 12 mm. The results show that the Kinect v2 and RealSense D415 present low variations in precision and accuracy errors after their warm-up period, whereas the RealSense R200 presents considerably higher errors than does the Kinect v2. Figs. 11 and 12 have different scales in the error axis of parts (a) and (b), in which the RealSense R200 axis scale is six times that of the Kinect v2 and RealSense D415. This also shows the considerable difference between the three camera devices, indicating that the accuracy and precision errors of the RealSense R200 are higher than those of the other two devices. 5.3. Invalid pixel analysis Invalid pixels occur when the depth sensor in not able to correctly evaluate the distances of a portion of its FoV. They are explicitly identified in the frame data obtained from RGB-D cameras, allowing their removal from depth data processing to avoid, for example, inconsistent 3D mapping results. Because they do not provide useful information for monitoring activities, the ratio of invalid pixels provides an insight into the usefulness of captured frames. The Kinect v2 presents a low invalid pixel ratio of approximately 1.8%, which remains nearly constant throughout the duration of the stress test. The same result is observed for the RealSense D415, albeit with a slightly higher invalid pixel ratio of 3%. In turn, the RealSense R200 displays an average invalid pixel ratio of 50%, with variations between 60% and 40%. A visual inspection of the depth frames obtained from the cameras confirmed these results. Frames obtained from the Kinect v2 are very stable, with invalid pixels appearing close to the object’s edges, where

Fig. 11. Accuracy error (𝜀𝜇𝑐 ) of the 8-h stress test for (a) the Kinect v2, (b) the RealSense R200, and (c) the RealSense D415. The setup for these data captures is presented in Fig. 7 with cameras ‘‘𝐼’’ not activated.

shadows are expected to occur in the IR projection. The RealSense D415 also presents very stable readings, but random invalid pixel groups can be identified in flat surfaces. In turn, the RealSense R200 presents considerably more artifacts in the depth image, even for plain surfaces. These results serve as a baseline to evaluate further the mutual interference among cameras, as discussed in the next section. Mutual interference is expected to increase the invalid pixel ratio in some scenarios. However, the so-called ‘‘constructive interference’’ of the RealSense R200 may reduce the ratio when multiple devices are used in combination. 8

Please cite this article in press as: Seewald L.A., et al., Toward analyzing mutual interference on infrared-enabled depth cameras. Computer Vision and Image Understanding (2018), https://doi.org/10.1016/j.cviu.2018.09.010.

L.A. Seewald et al.

Computer Vision and Image Understanding (

)



Fig. 13. The Mahalanobis distance results considering the threshold analysis (𝑁𝑐𝜃 ) from data captures with two Kinect v2 cameras. The setup for (a) is presented in Fig. 5 and for (b) in Fig. 6, in which only cameras ‘‘𝐸’’ and ‘‘𝐼𝑙 ’’ were activated.

of the RealSense R200, albeit with slightly less instability. Thus, without loss of generality, we restrict the results discussed in this section to those obtained with the Kinect v2 and the RealSense R200. 6.1. Field overlapping with two Kinect v2 devices In these experiments, we evaluated the effect of the interference introduced by a single Kinect v2 on depth data readings, as described in Section 4.3. Fig. 13 presents the Mahalanobis distance results for two different scenarios. Part (a) shows the results for the wall setup with two cameras aimed at a single corner of the room and part (b) for the ring setup with two active cameras. In the following figures, the term bad pixels refers to pixels for which the value surpassed a given threshold value. The thresholds are explicitly indicated in the figures and respective analysis. Fig. 13(a) presents high rates of bad pixels with thresholds lower than 𝜃 = 1. However, after threshold 𝜃 = 1, all rates of bad pixels remain below 7.5% and there is no significant change with time. This shows that, in the case of the corner setup, there is no periodic interference generated by a second device. Fig. 13(b) shows the results for the ring setup where, in contrast, the two cameras present behavioral changes as a function of time. To provide a better visualization of the results for the two setups, Fig. 14 shows the timeline of values considering the threshold 𝜃 = 1 from both parts of Fig. 13. Fig. 14(a) does not show any variation in the results, as can be seen in Fig. 13(a), whereas Fig. 14(b)

Fig. 12. Precision error (𝜀𝜎𝑐 ) of the 8-h stress test for (a) the Kinect v2, (b) the RealSense R200, and (c) the RealSense D415. The setup for these data captures is presented in Fig. 7 with cameras ‘‘𝐼’’ turned off.

6. Multi-camera evaluation results In this section, we evaluate the mutual interference caused by overlapping FoVs using the threshold methodology presented in Section 3. This section is divided into three main parts. The first two are focused on multi-camera scenarios in which the Kinect v2 is employed and the third on the multi-camera scenarios based on the RealSense R200. We explore the Kinect v2 in more depth because its mutual interference behavior has significant changes among different experimental scenarios. In turn, the behavior of the RealSense R200 under mutual interference is more predictable. We also conducted mutual interference experiments with the RealSense D415 and verified that its behavior is very similar to that 9

Please cite this article in press as: Seewald L.A., et al., Toward analyzing mutual interference on infrared-enabled depth cameras. Computer Vision and Image Understanding (2018), https://doi.org/10.1016/j.cviu.2018.09.010.

L.A. Seewald et al.

Computer Vision and Image Understanding (

Fig. 14. The Mahalanobis distance threshold 𝜃 = 1 (𝑁𝑐1.0 ) with two Kinect v2s. The setup for (a) is presented in Fig. 5 and for (b) in Fig. 6, in which only cameras ‘‘𝐸’’ and ‘‘𝐼𝑙 ’’ were activated.

)



Fig. 15. The Mahalanobis distance results considering the threshold analysis (𝑁𝑐𝜃 ) from data captures with three Kinect v2 cameras. The setup for (a) is presented in Fig. 7 and for (b) in Fig. 6.

shows in detail periodic variations that occur at specific moments. The high values seen in the first minutes are due to the warm-up effect. After this period, a periodic behavior can be noted, in which a peak of bad pixels starts in intervals of 21 min. The average of these peak durations is 4 min and 20 s. Additionally, the ratio of bad pixels within this peak ranges from 1.68% to 10.03% (average of all peaks). Although this setup presents visible levels of interference, the values in these peaks represent less than 11% of the pixels of an entire chunk.

To facilitate the analysis of the variations from the time perspective, Fig. 16 depicts the results for threshold 𝜃 = 1 as a function of time. Fig. 16(a) refers to the wall setup and Fig. 16(b) refers to the ring setup. The resultant mean of the bad pixels rate for the wall setup in Fig. 16(a) is 3.74%, whereas this value is 3.97% for the ring setup (Fig. 16(b)). By comparing these graphs, it is possible to visualize that both have periodic peaks that are similar to the results with two cameras previously presented in Fig. 14(b). However, with three Kinect v2 cameras, the results show some peaks that are higher than those previously demonstrated. These peaks, in particular, have a time duration and frequency of appearance that are different from those of the lower peaks in the same data capture period. More specifically, in Fig. 16(b), the duration of the lower peaks is 4 min and 40 s (average) and they occur every 20 min and 40 s (average), whereas the durations of the two peaks around time 3:45 (h:mm) and 6:50 (h:mm) are 42 min and 40 min long, respectively. Additionally, the time difference between their starting points is 3 h and 7 min. Although presenting periodic peaks, the ring setup results have stable values and all peaks have a similar rate of bad pixels not exceeding 20%, except those during the warm-up period. In contrast, the wall setup exhibits results with various intensities and in some cases the rate of bad pixels is over 30%. The data capture time affected by interference amounts to 42.88% and 37.19% of the total time for the ring and wall setups, respectively. For each occurrence of interference, the highest value represents only 10 s of duration, which is a short period of time as compared to the data capture period length.

6.2. Field overlapping with three Kinect v2 devices Here, we evaluate the effects of the mutual interference caused by two devices on the depth data readings of a third one, where all the devices are the same camera model (see Section 4.3). Fig. 15 shows the Mahalanobis distance results. Each part of the figure refers to a different scenario. Part (a) represents the results for the wall setup, where the devices are aimed at the same side of a room’s wall. The results of the ring setup are presented in part (b) of the figure. Despite the high rates of bad pixels during the warm-up period, most values remain between 0% and 10% for thresholds higher than 𝜃 = 1. As occurred in the two-device ring setup (see Fig. 13(b)), a periodic increase in the rate of bad pixels appears for both scenarios with three devices. However, the additional device causes these periodic peaks to reach higher values at some points. A comparison of the results in the two graphs reveals that the mean of the bad pixels rate from all points and thresholds is 7.24% for the ring setup (Fig. 15(a)) and 5.12% for the wall setup (Fig. 15(b)). 10

Please cite this article in press as: Seewald L.A., et al., Toward analyzing mutual interference on infrared-enabled depth cameras. Computer Vision and Image Understanding (2018), https://doi.org/10.1016/j.cviu.2018.09.010.

L.A. Seewald et al.

Computer Vision and Image Understanding (

)



Fig. 17. The Mahalanobis distance with three Kinect v2cameras for the modified wall setup (Fig. 8), with cameras closer to the reference object. (a) Results for thresholds 𝜃 = 0.1, … , 5 and (b) results for 𝜃 = 1.0.

Fig. 16. The Mahalanobis distance threshold 𝜃 = 1 (𝑁𝑐1.0 ) with three Kinect v2. The setup for (a) is presented in Fig. 7 and for (b) in Fig. 6.

Additional experiments were conducted with a modified version of the wall setup, as presented in Fig. 8, to evaluate the effect of the camera distance on the behavior of the Kinect v2 camera’s mutual interference. More specifically, the distance between the cameras and the reference object was reduced from the original 3 m to 1.5 m. The results of these experiments are illustrated in Fig. 17. Fig. 17(a) shows the overall results for thresholds 𝜃 = 0.1, … , 5 and Fig. 17(b) depicts the results for 𝜃 = 1.0. In turn, Fig. 18 illustrates the comparison of the two evaluated distances of the wall setup for threshold 𝜃 = 1.0. The results indicate that a reduction in the distance between the cameras does not change the periodic behavior of the interference peaks. However, the intensity of the interference peaks is higher than that observed in Figs. 14(b) and 16(b). When threshold 𝜃 = 1.0 is considered, the interference peaks observed after the warm-up period do not affect 20% of the pixels. In turn, when the cameras are closer to the reference object, the ratio of affected pixels is considerably higher during interference, reaching 60% in one of the peaks. Our analysis of the results indicates that the greater effect can be attributed to two main factors. First, the shorter distance between the cameras and the subject increases the intensity of the IR light reflected from the surfaces. Because of the higher intensity of the reflected IR light, there is a greater possibility that it will influence other devices. As a result, the intensity of the mutual interference phenomena is increased. Second, when the cameras are close to each other, the intersecting area of their frustums is increased. Consequently, there is a greater possibility that the IR light beams emitted by different devices will influence each other. In summary, these results indicate that the distance between ToF

Fig. 18. Comparison of the Mahalanobis distance results in the two evaluated camera distance scenarios for the wall setup (Figs. 7 and 8). The graph presents results for threshold 𝜃 = 1.0.

cameras affects the occurrence of mutual interference, with the intensity increasing as the distance between the cameras is reduced. To demonstrate the practical effect of interference on the frames’ pixels, Fig. 19 shows three specific frames obtained from the ring setup (Fig. 16(b)). The frame illustrated in Fig. 19(a) was captured 1 h after the experiment’s beginning and does not present considerable interference. In contrast, Figs. 19(b) and 19(c) illustrate two distinct frames from the interference peaks observed in Fig. 16(b). The frame 11

Please cite this article in press as: Seewald L.A., et al., Toward analyzing mutual interference on infrared-enabled depth cameras. Computer Vision and Image Understanding (2018), https://doi.org/10.1016/j.cviu.2018.09.010.

L.A. Seewald et al.

Computer Vision and Image Understanding (

)



Fig. 19. Visualization of depth image frames generated in the experiment shown in Fig. 16. The black pixel clusters represent the areas with high interference.

illustrated in Fig. 19(b) was captured 51 min after the experiment’s beginning and that in Fig. 19(c), 3 h 45 min after the beginning. By comparing the two images showing interference, it is possible to observe that Fig. 19(c) has a higher number of ‘‘black pixels’’ than Fig. 19(b). Nevertheless, in both cases, the interference occurred in the same region of the image. By analyzing each peak obtained from other data captures, we identified this region as the one that suffered most interference. Moreover, although affecting the same region, the interference size and duration differ from peak to peak. 6.3. Field overlapping with three RealSense R200 devices The RealSense R200 presents behavior that differs from that of the Kinect v2. Fig. 20 illustrates the Mahalanobis distance results observed with three device setups for different threshold values and Fig. 21 shows the values for the specific threshold 𝜃 = 1.0. In the footage obtained from the RealSense R200, the interference peaks observed in the experiments with the Kinect v2 are not registered. The Mahalanobis distance presents little variation throughout the experiments for all thresholds in both the wall and ring setups. This indicates the feasibility of a multi-device setup based on this specific SL device. However, the results again indicate that the instability of the RealSense R200 is higher, because a higher percentage of pixels exceeds the defined thresholds for the Mahalanobis metric. This can be attributed to the instability discussed in Section 5. The readings of the cameras under normal conditions already present a high variation and this is propagated to the results with multiple devices. We also investigated the claims in the RealSense R200 documentation concerning ‘‘constructive interference’’. We captured footage using the experimental setup when a single camera was activated (without interference) and then when all three devices were activated (with interference). A comparison of the results from the two footages indicates a nearly constant reduction of approximately 3% in the ratio of invalid pixels in the interference scenario. However, the error observed in these results is still greater than that observed in the Kinect v2 footage, even under low levels of mutual interference. This indicates the limited nature of the ‘‘constructive interference’’ benefits. We conducted an additional set of experiments to investigate the effect of the reference object’s distance on the RealSense R200’s depth readings. The results are illustrated in Fig. 22 for different threshold values and Fig. 23 presents the values for the specific threshold 𝜃 = 1.0. They indicate that no interference peaks occur in multi-device RealSense R200 setups, as observed in previous results. However, a reduction in the distance between the camera and the reference object reduces the number of invalid pixels by 35%. This is an expected result, as the reduced distance increases the intensity of the IR light emitted to the scene.

Fig. 20. The Mahalanobis distance results considering the threshold analysis (𝑁𝑐𝜃 ) from data capture periods with three RealSense R200 cameras. The setup for (a) is presented in Fig. 7 and for (b) in Fig. 6.

Fig. 21. The Mahalanobis distance with three RealSense R200 cameras for the ring (Fig. 6) and wall (Fig. 7) setups when threshold 𝜃 = 1.0.

a periodic behavior, the intensity and duration of which is described by the results. Neither the origin of this disturbance nor the aggravating and mitigating factors are known. For two cameras, there is interference in the ring setup but not in the wall setup. This could be due to the different angles between the cameras in the two setups, a notion that is supported by the results of Kunz et al. (2016). Nonetheless, this

6.4. Discussion This section presents insights related to the experiments described previously and some additional results obtained for the evaluated devices. The experiments revealed that the interference phenomena show 12

Please cite this article in press as: Seewald L.A., et al., Toward analyzing mutual interference on infrared-enabled depth cameras. Computer Vision and Image Understanding (2018), https://doi.org/10.1016/j.cviu.2018.09.010.

L.A. Seewald et al.

Computer Vision and Image Understanding (

)



results indicate that Kinect v2 cameras should be kept at a given distance from each other to reduce the intensity of the interference peaks. The results of experiments with the RealSense R200 did not show mutual interference in the case of multi-device setups. Furthermore, the ‘‘constructive interference’’ was observed in the device’s results. The invalid pixel ratio of the depth frames was reduced by approximately 3% in multi-device setups. Nevertheless, we observed that the results for the RealSense R200 show considerable instability, even with the effect of ‘‘constructive interference’’. This instability results in a higher ratio of pixels exceeding a tighter threshold than that observed for the Kinect v2, even in the presence of light mutual interference. However, when the close-range setup results for the Kinect v2 are considered, it may be found that the RealSense R200 performs better in such scenarios. In our additional experiments with the RealSense R200, the wall setup where the cameras were positioned at a short range was used. The obtained results indicate that short the cameras’ short distance from the subject further reduces the invalid pixel ratio without creating prejudicial mutual interference phenomena. These results indicate that, at close range, the RealSense R200 tends to offer more stability than the Kinect v2 in multi-device setups. Nevertheless, as other results testify, the Kinect v2 still offers better results than single-device and long range multi-device setups. The experimental results indicate that the camera arrangement influences the behavior of mutual interference. The three-camera wall setup presents short interference peaks with higher intensity. In turn, the three-camera ring setup results in interference peaks of lower intensity but longer duration. Further analysis indicates that the intensity of the interference peaks is related to the proportion of overlapping among the camera’s frustums. Consequently, one goal for reducing the mutual interference in multi-device setups is to minimize the frustums’ overlapping area. The tolerance of 3D applications to mutual interference depends on the application and its capacity to mitigate or correct it. For example, an application that captures data from a still scene can compensate for errors by processing central tendencies from a series of depth frames. In this case, the error rate indicates the number of frames that are required to mitigate interference errors. However, frames from a dynamic scene cannot be passed through such a correction process. Consequently, the error rate acts as a measure of the confidence in the data extracted from such frames. It is possible to evaluate the tolerance of such applications quantitatively, but such an analysis requires a dedicated experimental methodology. Such an analysis does not fit the scope of this article, which focuses on evaluating the behavior of depth cameras in mutual interference scenarios, and is left as future work. The results presented in Section 6 are for multi-device setups including cameras of the same model. We conducted an additional experiment using a multi-device setup that included a mixture of camera models to verify the existence of mutual interference. In this experiment, one device from each technology was employed in the wall setup presented in Fig. 5. In summary, the results for the Kinect v2 do not show any meaningful interference when it is used in combination with the RealSense R200. However, the RealSense R200 was deeply affected by the IR light of the Kinect v2. We observed that the device suffered from interference during nearly the entirety of the experiment and showed no normal operation periods. We recorded interference peaks where more than 80% of pixels exceeded the thresholds, even for loose threshold values. Additionally, we noticed cases where an entire frame returned only invalid pixels, rendering it unusable for practical purposes.

Fig. 22. The Mahalanobis distance with three RealSense R200 cameras for the modified wall setup (Fig. 8), with cameras closer to the reference object. The graph shows the results for the threshold 𝜃 = 0.1, … , 5.

Fig. 23. The Mahalanobis distance with three RealSense R200 cameras for the modified wall setup (Fig. 8), with the cameras closer to the reference object. The graph shows the results for the threshold 𝜃 = 1.0.

does not explain the periodicity of the interference. A possible origin for this may be within the elaborate processing within the Kinect v2. For example, the system uses two different shutter times and multiple readings of a pixel are executed to generate a single frame. After these readings, the best result for each pixel is selected and used in the frame of interest (Sell and O’Connor, 2014). It might be possible that further dynamic behavior is included in the device, which could explain the periodicity of the interference. In further preliminary tests, we found that re-connecting the USB cable from one Kinect v2 eliminated the interference in some situations. This observation is in line with the hypothesis that the interference is due to dynamic behavior of the Kinect v2. The difference in the results presented by Sarbolandi et al. (2015) and our results concerning interference time intervals is noteworthy. In our case, the interval was approximately 20 min, whereas in their case it was approximately 50 s. Moreover, they observed interference in a setup where two cameras were focused on the same flat surface, in contrast to our Two-Kinect-Corner setup where we do not see interference. In our case, the distance to the wall in the experimental setup was twice as large as in their case. Furthermore, the distance between the devices was 1.45 m in our case and 0.5 m in their case. The short-range experiment results presented in Fig. 17 are further evidence that the inter-device distance may be a key factor in the intensity of interference peaks. These

6.5. Limitations The analysis of our results indicates a few limitations regarding the experimental setup, especially as related to the selected camera models and their positioning. Concerning the selected cameras, as mentioned in Section 4, the Kinect v2 is the most popular and recently developed device used in studies that explored depth imaging based on ToF. However, some specific aspects and parameters cannot be comprehensively 13

Please cite this article in press as: Seewald L.A., et al., Toward analyzing mutual interference on infrared-enabled depth cameras. Computer Vision and Image Understanding (2018), https://doi.org/10.1016/j.cviu.2018.09.010.

L.A. Seewald et al.

Computer Vision and Image Understanding (

)



examined because of its proprietary firmware. Thus, it is not possible to control some aspects relevant to mutual interference analysis, such as the periodicity of the changes in IR patterns. These changes can directly affect the number and duration of the observed interference peaks. Access to such parameters would enable further evaluation of the effects of mutual interference on ToF devices. The camera setups that we selected for our experiments were those most frequently employed in other studies related to mutual interference among depth cameras. Nevertheless, there are parameters related to positioning (e.g., the angles between cameras) that can influence interference behavior that have not been exhaustively explored. Such parameters will be the subject of future evaluation. Finally, the goal of this article is to present the experimental methodology to evaluate the effect of mutual interference on RGB-D devices. This methodology allows experiments that can lead to a mathematical formulation that models the behavior of mutual interference with various devices and technologies. However, such a mathematical formulation requires a specific set of experiments with different setups, which will be conducted in future work.

high interference peaks observed with the Kinect v2 render the RealSense R200 a more stable alternative. Our results demonstrate that mutual interference should be carefully considered in the design of a multi-device setup with IR-enabled depth cameras. Furthermore, we assert that our proposed methodology can be used as a basis for a monitoring mechanism for detecting instances where mutual interference affects depth data. A computer vision-based system with such a mechanism would be able to dynamically adapt to interference conditions, thereby increasing its resilience to erroneous depth information.

7. Final remarks

Disclaimer

Acknowledgments The authors would like to acknowledge Siemens Healthcare GmbH, Germany for providing the funding for this research project. They also would like to thank Julia Schottenhamml, Jennifer Maier, and Peter Fürsattel for their support during the experimental evaluation conducted in this work.

The concepts and information presented in this paper are based on research and are not commercially available.

Camera setups with multiple devices are a key aspect of ambient monitoring applications. These types of setups can result in data corruption when applied to recently developed RGB-D camera models because of the mutual interference of the IR light emitters employed by such devices. Consequently, the behavior of such interference must be appropriately evaluated to provide data that allow monitoring systems to handle possible errors introduced into the data captured by depth sensors. This article contributes to closing this gap in that it proposes a methodology to measure deviations in the precision and accuracy of readings from depth camera sensors. In contrast to related studies, in our study, we employed the proposed methodology to (i) individually analyze cameras based on SL and ToF technologies with a focus on the stability of camera measurements and (ii) investigate in detail the effect of IR interference on the performance of depth cameras when used in multi-device setups. Our experiments were focused on two well-known cameras: the Microsoft Kinect v2, which employs the ToF technology, and the Intel RealSense R200, which is based on SL. We expect our methodology to be applied to further evaluations of multi-device setup for depth cameras. Furthermore, it can be used to develop mechanisms to monitor IR interference during the deployment of such setups. The results from individual camera evaluations show that the Kinect v2 and RealSense D415 have a warm-up period that influences its depth readings for 20–30 min after it is activated. However, the warm-up influence was measured as less than 10 cm. After the warm-up period, the depth readings become very stable, even after 8 h of constant sensor usage. The RealSense D415 presented a slight increase in its accuracy error throughout the 8 h of the experiment, but the error remained within the 10 mm range. In turn, the RealSense R200 presented very unstable depth readings, with an average error greater than 40 cm. A comparison of the two camera models shows that the Kinect v2 yields considerably more stable depth readings, even when the warm-up time is considered. In multi-device setups, the Kinect v2 displays periodic peaks of mutual interference that increase in intensity depending on the distance between cameras, with short-range setups yielding higher interference peaks. Further, the addition of more devices can potentially increase the duration of some interference peaks, albeit their intensity is not greatly influenced. In long-range setups, the measured interference is small considering the experiment’s length, with a range between 3.74% and 3.97% of pixels exceeding the threshold among all captured frames in a setup with three depth cameras. In turn, multi-device setups consisting of the RealSense R200 are not affected by prejudicial interference peaks. In long-range setups, the instability of the RealSense R200 yields lower-level accuracy and precision than the Kinect v2 under mutual interference. However, in close-range multi-device setups, the

References Alabbasi, H., Gradinaru, A., Moldoveanu, F., Moldoveanu, A., 2015. Human motion tracking evaluation using kinect v2 sensor. In: 2015 E-Health and Bioengineering Conference, EHB, pp. 1–4. Breuer, T., Bodensteiner, C., Arens, M., 2014. Low-cost commodity depth sensor comparison and accuracy analysis, in: Proc. SPIE, p. 92500G. Capecci, M., Ceravolo, M.G., Ferracuti, F., Iarlori, S., Longhi, S., Romeo, L., Russi, S.N., Verdini, F., 2016. Accuracy evaluation of the kinect v2 sensor during dynamic movements in a rehabilitation scenario. In: 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBC, pp. 5409–5412. Carfagni, M., Furferi, R., Governi, L., Servi, M., Uccheddu, F., Volpe, Y., 2017. On the performance of the intel sr300 depth camera: Metrological and critical characterization. IEEE Sens. J. 17, 4508–4519. Conde, M., 2017. Compressive Sensing for the Photonic Mixer Device: Fundamentals. Methods and Results. Springer Fachmedien Wiesbaden. Corke, P., 2017. Robotics, Vision and Control, second ed. Springer. Corti, A., Giancola, S., Mainetti, G., Sala, R., 2016. A metrological characterization of the kinect v2 time-of-flight camera. Robot. Auton. Syst. 75, 584–594. Culbertson, C., 2015. Introducing the intel realsense™ r200 camera (world facing). Fanello, S., Valentin, J., Rhemann, C., Kowdle, A., Tankovich, V., Davidson, P., Izadi, S., 2017. Ultrastereo: Efficient learning-based matching for active stereo systems. Fankhauser, P., Bloesch, M., Rodriguez, D., Kaestner, R., Hutter, M., Siegwart, R., 2015. Kinect v2 for mobile robot navigation: Evaluation and modeling. In: International Conference on Advanced Robotics, ICAR, pp. 388–394. Freedman, D., Smolin, Y., Krupka, E., Leichter, I., Schmidt, M., 2014. SRA: Fast Removal of General Multipath for ToF Sensors. Springer International Publishing, Cham, pp. 234–249. http://dx.doi.org/10.1007/978-3-319-10590-1_16. Fürsattel, P., Placht, S., Balda, M., Schaller, C., Hofmann, H., Maier, A., Riess, C., 2016. A comparative error analysis of current time-of-flight sensors. IEEE Trans. Comput. Imaging 2, 27–41. Geerse, D.J., Coolen, B.H., Roerdink, M., 2015. Kinematic validation of a multi-kinect v2 instrumented 10-meter walkway for quantitative gait assessments. PLoS One 10, 1–15. Geiselhart, F., Otto, M., Rukzio, E., 2016. On the use of multi-depth-camera based motion tracking systems in production planning environments. In: Procedia CIRP, Vol. 41, pp. 759–764. Research and Innovation in Manufacturing: Key Enabling Technologies for the Factories of the Future - Proceedings of the 48th CIRP Conference on Manufacturing Systems. Gokturk, S.B., Yalcin, H., Bamji, C., 2004. A time-of-flight depth sensor - system description, issues and solutions. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop, pp. 35–35. Gonzalez-Jorge, H., Rodríguez-Gonzálvez, P., Martínez-Sánchez, J., González-Aguilera, D., Arias, P., Gesto, M., Díaz-Vilariño, L., 2015. Metrological comparison between kinect i and kinect ii sensors. Measurement 70, 21–26. Hansard, M., 2013. Time-of-Flight Cameras. Springer. Henry, P., Krainin, M., Herbst, E., Ren, X., Fox, D., 2012. Rgb-d mapping: Using kinectstyle depth cameras for dense 3d modeling of indoor environments. Int. J. Robot. Res. 31, 647–663. Hussmann, S., Hagebeuker, B., Ringbeck, T., 2008. A Performance Review of 3D TOF Vision Systems in Comparison to Stereo Vision Systems. INTECH Open Access Publisher.

14 Please cite this article in press as: Seewald L.A., et al., Toward analyzing mutual interference on infrared-enabled depth cameras. Computer Vision and Image Understanding (2018), https://doi.org/10.1016/j.cviu.2018.09.010.

L.A. Seewald et al.

Computer Vision and Image Understanding (

)



Moon, S., Park, Y., Ko, D.W., Suh, I.H., 2016. Multiple kinect sensor fusion for human skeleton tracking using kalman filtering. Int. J. Adv. Robot. Syst. 13, 65. Otto, M.M., Agethen, P., Geiselhart, F., Rukzio, E., 2015. Towards ubiquitous tracking: Presenting a scalable. Sarbolandi, H., Lefloch, D., Kolb, A., 2015. Kinect range sensing: Structured-light versus time-of-flight kinect. Scharstein, D., Szeliski, R., 2003. High-accuracy stereo depth maps using structured light. In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. I–195–I–202. Schreiter, L., Wörtwein, T., Jöchner, M., Hussain, J., Wörn, H., Raczkowsky, J., 2016. Konzeptionierung und Implementierung eines Interpretationssystem zur Situationserkennung in OP:Sense. In: Jahrestagung der Deutschen Gesellschaft für Computerund Roboterassistierte Chirurgie (CURAC), vol. 15, pp. 268–271. Sell, J., O’Connor, P., 2014. The xbox one system on a chip and kinect sensor. IEEE Micro 34, 44–53. Wasenmüller, O., Stricker, D., 2017. Comparison of Kinect V1 and V2 Depth Images in Terms of Accuracy and Precision. Springer International Publishing, Cham, pp. 34–45. Yao, Y., Zhang, F., Fu, Y., 2014. Real-Time Hand Gesture Recognition using RGB-D Sensor. Springer International Publishing, Cham, pp. 289–313. Zennaro, S., Munaro, M., Milani, S., Zanuttigh, P., Bernardi, A., Ghidoni, S., Menegatti, E., 2015. Performance evaluation of the 1st and 2nd generation kinect for multimedia applications. In: 2015 IEEE International Conference on Multimedia and Expo, ICME, pp. 1–6. Zhang, C., Tian, Y., 2015. Histogram of 3d facets: A depth descriptor for human action and hand gesture recognition. Comput. Vision Image Understanding 139, 29–39.

Intel, 2015. SDK Design Guidlines. Technical Report v 1.1. Intel, 2016. Robotics Development Kit R200 Depth-Data Interpretation. Technical Report. Intel, 2018. Intel RealSense D400 Depth Camera Series. Technical Report. Kerola, T., Inoue, N., Shinoda, K., 2017. Cross-view human action recognition from depth maps using spectral graph sequences. Comput. Vision Image Understanding 154, 108– 126. Kim, D., Comandur, B., Medeiros, H., Elfiky, N.M., Kak, A.C., 2017. Multi-view face recognition from single rgbd models of the faces. Computer Vision Image Understanding 160, 114–132. Kinect Developers, 2017. Kinect hardware. Kowalski, M., Naruniec, J., Daniluk, M., 2015. Live scan3d: A fast and inexpensive 3d data acquisition system for multiple kinect v2 sensors. In: International Conference on 3D Vision. Kunz, A., Brogli, L., Alavi, A., 2016. Interference measurement of kinect for xbox one. In: Proceedings of the 22Nd ACM Conference on Virtual Reality Software and Technology. ACM, New York, NY, USA, pp. 345–346. Lachat, E., Macher, H., Landes, T., Grussenmeyer, P., 2015. Assessment and calibration of a rgb-d camera (kinect v2 sensor) towards a potential use for close-range 3d modeling. remote sensing. Lee, J., Hong, M., Ryu, S., 2015. Sleep monitoring system using kinect sensor. Int. J. Distrib. Sens. Netw. 11, 875371. Liu, L., Mehrotra, S., 2016. Patient walk detection in hospital room using microsoft kinect v2. In: 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBC, pp. 4395–4398. McLachlan, G.J., 1999. Mahalanobis distance. Reson.–J. Sci. Educ. 4, 20–26. Microsoft, 2014. The kinect for windows v2 sensor and free sdk 2.0 public preview are here.

15 Please cite this article in press as: Seewald L.A., et al., Toward analyzing mutual interference on infrared-enabled depth cameras. Computer Vision and Image Understanding (2018), https://doi.org/10.1016/j.cviu.2018.09.010.