A flexible heterogeneous real-time digital image correlation system

A flexible heterogeneous real-time digital image correlation system

Optics and Lasers in Engineering 110 (2018) 7–17 Contents lists available at ScienceDirect Optics and Lasers in Engineering journal homepage: www.el...

4MB Sizes 5 Downloads 50 Views

Optics and Lasers in Engineering 110 (2018) 7–17

Contents lists available at ScienceDirect

Optics and Lasers in Engineering journal homepage: www.elsevier.com/locate/optlaseng

A flexible heterogeneous real-time digital image correlation system Tianyi Wang, Qian Kemao∗, Hock Soon Seah, Feng Lin School of Computer Science and Engineering, Nanyang Technological University, 639798, Singapore

a r t i c l e

i n f o

Keywords: Digital image correlation Real-time processing Parallel computing Pipelined system GPU Multicore CPU

a b s t r a c t An accurate and flexible real-time digital image correlation (RT-DIC) system utilizing a pipelined CPU and GPU parallel computing framework is proposed. First, the respective advantages of CPU and GPU in performing the fast Fourier transform-based cross-correlation (FFT-CC) algorithm and the inverse-compositional Gauss Newton (IC-GN) algorithm of the employed path-independent DIC (PI-DIC) method are elucidated. Second, based on the different properties and performances of CPU and GPU, a pipelined system framework unifying five Variants of combinations of CPU and GPU is proposed, which can be flexibly applied to various practical applications with different requirements of measurement scales and speeds. Last, both the accuracy and speed of the entire pipelined framework are verified by a PC implementation of the RT-DIC system integrating Variants 2–5. Variants 2 and 5 are also implemented on an iPhone 5S for the feasibility investigation of realizing a portable RT-DIC system on mobile devices using the same framework.

1. Introduction Digital image correlation (DIC) [1–6] is a very important optical measurement technique for surface deformation of an object/material/structure. DIC estimates deformation between a reference (or undeformed) image and a target (or deformed) image, assuming that the deformation is small. When this assumption is violated, i.e. the deformation becomes large, decorrelation occurs, and the estimated result may be unreliable. However, a high-performance DIC system should have the capability to solve this problem. In the literature, two categories of methods of mitigating large deformation have been attempted, which are the direct methods [7–10] and the incremental methods [11– 13]. The direct methods work on one reference image and one target image with a large deformation between them. To converge to satisfactory measurement accuracy, the direct methods always require highly reliable initial guesses. The feature matching method with scale-invariant feature transform (SIFT) features originated from computer vision was first applied to the initialization stage [7–9]. As long as more than three corresponding points are matched between the reference and the target subsets, the initial guesses can be effectively and reliably calculated. Another recent method transformed the rotation in Cartesian coordinates to the translation in polar coordinate, after which the initial guesses were calculated based on the gradient orientation error between the gradient orientation at the seed points and those at the search points [10]. Theoretically, this method can be applied to arbitrary rotation angles. These direct methods, however, are only applicable to large rigid



rotational deformation but not effective in analyzing significant tension or compression, in which case SIFT may fail to generate a set of corresponding feature points, while the deformation cannot be effectively characterized in polar coordinates. The incremental methods resolve the large deformation problem by inserting intermediate images between the reference image and the target image to make the deformation between each two consecutive images small enough to use any existing DIC methods. DIC is thus preferred to be performed dynamically on these images and the final deformation is calculated by accumulating the per-frame results. To avoid decorrelation, the reference image is updated if the deformation is getting too large [11] or the correlation coefficients are less than a certain threshold [12,13]. The incremental methods not only solve the large deformation problem but also provide a means of deformation evolution of the dynamic process, and thus are more generally applicable and useful than the direct methods. The open-source DIC application Ncorr [13] has already employed an incremental method to resolve the large deformation. The similar idea has also been applied to dynamic measurement [14], such as the fracture [15], the strain under dynamic loading [16,17], and the velocity field [18], by incorporating even more intermediate frames. Nonetheless, due to the high computation burden of conventional DIC methods [4], the growth of the number of images during this dynamic process makes DIC even more time-consuming. Therefore, to use the incremental methods in practical DIC applications, the computation efficiency of DIC methods should be improved. In terms of increasing computation efficiency, the inverse compositional Gauss-Newton (IC-GN) algorithm [19,20], as an alternative to the successful forward additive Newton-Raphson (FA-NR) algorithm,

Corresponding author. E-mail address: [email protected] (Q. Kemao).

https://doi.org/10.1016/j.optlaseng.2018.05.010 Received 28 February 2018; Received in revised form 5 April 2018; Accepted 11 May 2018 0143-8166/© 2018 Elsevier Ltd. All rights reserved.

T. Wang et al.

Optics and Lasers in Engineering 110 (2018) 7–17

was proposed. It eliminated redundant calculations on reference subsets without compromising with accuracy [21,22]. In addition, a reliabilityguided initial-guess transferring scheme and an interpolation coefficients loop-up table (LUT) approach [20] were proposed to further reduce the computation cost of IC-GN. Thereafter, IC-GN was applied to develop a video-rate non-contact extensometer recently [23]. Besides the mere algorithmic optimizations of DIC, CPU multithread computing strategies have also been attempted to accelerate DIC to an even higher frame rate [17,23–25], which is crucial for a growing need of on-line monitoring and measurement of dynamically evolving displacements and strains [15,26,27]. Pan and Tian proposed a parallel implementation of their reliability-guided DIC algorithm by starting from multiple seed points of interest (POIs) and achieved a 7 times speedup compared with its sequential implementation [24]. Similarly, Shao et al. designed a real-time 3D DIC system [25,28], which reached a computation speed of more than 50,000 POIs/s when the subset size was set at 15 × 15. Wu et al. proposed a real-time DIC method [17] employing an efficient integer-pixel estimation method with the combination of an improved particle swarm optimization (PSO) algorithm and a blockbased gradient descent search (BBGDS) algorithm. In order to achieve a high frame rate, they then applied OpenMP [29] to parallelize both the integer-pixel search algorithm and the IC-GN algorithm for sub-pixel registration. This system could run at 60 Hz when four POIs were interrogated simultaneously in one frame of images. One of the well-known commercialized DIC systems VIC-2D was also announced to achieve a processing speed up to 1.35 × 105 POIs/s using a PC with a single quadcore CPU [30]. A general-purpose graphical processing unit (GP-GPU), as a more powerful parallel computing device than a multicore CPU, has also been attempted to further accelerate DIC in the past few years. To fully utilize the computation power of GPU, the path-independent DIC schemes have been developed. Jiang et al. proposed a novel path-independent DIC (PI-DIC) algorithm to process each POI independently [31], and accelerated it on GPU (paDIC), reaching a 1.66 × 105 POIs/s computation speed [32]. Afterwards, paDIC was extended and applied to digital volume correlation (paDVC) with a computation speed of 1750 POIs/s [33]. A super-fast, path-independent DIC algorithm called FOLKI-D was proposed by Besnerais et al. for dense 2D and 3D displacement field estimation [34]. GPU was applied to pixelwise operations including solving the 2 × 2 linear systems and doing the sub-pixel interpolation on GPU’s texture memory [35]. Their method achieved a computation speed closed to 2.5 × 106 points/s for a 75% overlap among subsets in the 2D DIC. As the zero-order shape function was employed for matching, the performance of using higher-order shape functions needs further investigation. Very recently, Huang et al. proposed to use heterogenous CPU and GPU parallel computing to accelerate the PI-DIC algorithm which achieved higher speed performance than the GPU-based paDIC [36]. As will be shown in Section 4.2, this method can be simply integrated into the proposed DIC system framework. Although the computation efficiency of DIC algorithms has been greatly improved, current DIC methods cannot be directly applied to dynamic analysis without investigating the appropriate scheduling scheme of the entire measurement procedure including data acquisition (or data capture), data analysis, and result saving and visualization. To our best knowledge, such investigation has not been conducted in the literature, even though there are commercial DIC systems available in the market [13,30,37]. For convenience, from now on, we refer to correlation analysis as DIC algorithm and the process from data acquisition to result saving and visualization as DIC (software) system. In this study, an accurate and flexible real-time DIC (RT-DIC) system is proposed with the following considerations: (i) since both multicore CPU and GPU have been used to accelerate DIC algorithms, choosing appropriate computing platforms for different measurement requirements is demanded. Thus, a performance comparison between CPU and GPU in implementing FFT-CC and IC-GN in the employed PI-DIC method is first performed; (ii) Guided by the comparison result, it is found that a pipelined heterogeneous sys-

tem framework that unifies the computing strengths of CPU and GPU is necessary to accelerate the entire DIC system. Based on different measurement scales and different properties and performances of CPU and GPU, five Variants of the pipelined framework are explained and discussed. These Variants can be unified into a single RT-DIC system, which offers it the flexibility to fulfill different requirements in practical DIC applications; (iii) Both the accuracy and speed of the entire framework are verified by a PC implementation of RT-DIC integrating Variants 25. Also, as the computation power of mobile devices keeps increasing, the feasibility of porting RT-DIC to mobile devices is investigated. We successfully implement Variants 2 and 5 on CPU and GPU of an iPhone 5S, which opens the door of realizing portable high-performance DIC systems using the proposed pipelined framework. The rest of the paper is organized as follows. Section 2 briefly reviews the principle of the PI-DIC method used in the proposed system. Section 3 studies the speed performance between multicore CPU-based and GPU-based implementations of PI-DIC. Section 4 explains the design of the five Variants of the proposed RT-DIC framework, whose implementation and validation are given in Section 5 by two applications on a PC and an iPhone 5S. Section 6 concludes the paper.

2. Principle of the PI-DIC method For the sake of completeness of this paper and the convenience of the description of the following sections, the PI-DIC method is briefly described. More details can be found in [31]. Given a reference subset R centered at a POI in the reference image, DIC searches the subset T in the target image with the highest correlation coefficient, from which the displacement vector of this POI is determined. The PI-DIC method is able to process different POIs independently [31], and the principle of processing one POI-centric subset is schematically shown in Fig. 1. This method includes two parts, the fast Fourier transform-based crosscorrelation (FFT-CC) for initial guess of the displacement vector and the IC-GN for its sub-pixel refinement. For FFT-CC, the zero-order shape function is assumed for a subset. The u- and v- displacements with integer-pixel accuracy can be easily and quickly estimated by Fourier transforms as, { ([ ]) ([ ])} 𝐶𝑍𝑁𝐶𝐶 = 𝐹 𝐹 𝑇 −1 𝐹 𝐹 𝑇 ∗ 𝑅̄ ⋅ 𝐹 𝐹 𝑇 𝑇̄ ,

(1)

where 𝑅̄ and 𝑇̄ are two matrices containing the zero-normalized values in the reference and the target subsets, respectively; FFT and FFT-1 represent forward and inverse Fourier transforms, respectively; and ∗ indicates complex conjugate. The results are augmented into an integerpixel deformation parameter vector p0 = [u, 0, 0, v, 0, 0]T as the initial guess for IC-GN. The assumption of zero-order shape function in FFTCC requires small in-plane strain and small rigid body rotation [19] and may yield unreliable estimations otherwise. This problem can be compensated by capturing intermediate frames between the reference and target states. In IC-GN, the first-order shape function is adopted in this study although higher order shape functions can also be used. The initial guess p0 calculated from FFT-CC is passed to IC-GN to obtain the subpixel deformation parameter vector p = [u, ux , uy , v, vx , vy ]T , where ux , uy , vx , and vy are the gradients of u and v, respectively. IC-GN minimizes the zero-normalized sum of squared difference (ZNSSD) criterion iteratively. In each iteration, the incremental deformation parameter ∆p = [∆u, ∆ux , ∆uy , ∆v, ∆vx , ∆vy ]T is calculated as, Δ𝐩 = 𝐇

−1

∑ 𝛏

{[

𝜕𝐖 ∇𝑅(𝐏 + 𝛏) 𝜕𝐩

]𝑇 [ ̄ ]} ] 𝑅𝑛 [ ̄ ̄ 𝑇 𝐏 + 𝐖(𝛏; 𝐩) − 𝑅(𝐏 + 𝛏) (2) 𝑇̄𝑛

where P(x0 , y0 ) is the position of the POI; 𝝃 = (∆x, ∆y) represents the local coordinates of pixels within the reference subset R; ∇R(P + 𝝃) is the gradient within the reference subset; W(𝝃; p) is the warp function, 8

T. Wang et al.

Optics and Lasers in Engineering 110 (2018) 7–17

To reduce the redundant calculations among the IC-GN iterations, the quantities such ∇R(P + 𝝃), H-1 , 𝜕 W/𝜕 p, the zero-normalized refer√as ∑ ∑ ence subset 𝑅̄ 𝑛 = {𝑅[𝐏 + 𝐖(𝛏; 0)] − 𝑅𝑚 }2 (where 𝑅𝑚 = 1 𝑁−1 𝑅𝑖 ), 𝛏

𝑁

𝑖=0

and the bi-cubic interpolation coefficients look-up table (LUT), are precomputed, as shown in Fig. 1. PI-DIC’s path-independence guarantees a POI immune to the possible errors from neighboring POIs, and thus achieves an accuracy comparable to or even higher than the existing path-dependent DIC algorithms. Moreover, the following characteristics make PI-DIC distinguished in high-speed DIC implementations: (i) Calculations at individual POIs are independent from each other so that they have been well accelerated by parallel computing on GPU [32,33]. A higher processing rate can be expected when more POIs are processed simultaneously on a more advanced GPU or multiple GPUs; (ii) The precomputation, except the calculation of the interpolation coefficients LUT, only has to be performed once if the reference image remains unchanged; (iii) As observed from Fig. 1, FFT-CC and IC-GN are two separate stages and can be performed asynchronously. In other words, while IC-GN is performing sub-pixel registration after receiving the initial guesses from FFT-CC, FFT-CC can immediately continue providing the initial guesses for subsequent image frames without interrupting the execution of IC-GN. This interesting feature, which has not yet been mentioned in the literature, is very useful to facilitate a flexible configuration of the proposed RT-DIC system. As will be shown in Section 3.2, RT-DIC enables one to invoke the two algorithms either from CPU or from GPU as well as decide how often each of them is performed based on specific demands of certain practical applications. 3. Multicore CPU vs. GPU 3.1. Multicore CPU and GPU in general Multicore CPU and GPU are two widely used general-purpose parallel computing platforms due to their cost effectiveness, short development cycle, and excellent portability and scalability. However, different design philosophies of CPU and GPU result in their different roles in parallel computing applications and systems. Modern CPU contains up to 72 physical cores. However, each core is optimized for fast serial code execution by integrating low-latency arithmetic and logic units (ALUs), multi-level caches, branch prediction units, and so on, which consume large chip area and power, limiting the total number of cores within CPU. Thus, CPU is good at task-parallel problems, in which case different tasks and logics can be allocated to different CPU cores and executed asynchronously. GPU, on the other hand, is designed with a philosophy of making computation “wider than faster”. A GPU contains much more cores than those in a CPU. However, each core is only good at performing simple arithmetic operations and the latency is very high. Therefore, GPU is mainly used for data-parallel problems, in which case each GPU core can be allocated different data, but all the data execute the same instructions. Also, due to the high latency, GPU only achieves its peak performance when its resources are fully occupied.

Fig. 1. Schematic illustration of the PI-DIC method combining the FFT-CC algorithm and the IC-GN algorithm.

⎡1 + 𝑢 𝑥 𝐖(𝛏; 𝐩) = ⎢ 𝑣𝑥 ⎢ ⎣ 0

𝑢𝑦 1 + 𝑣𝑦 0

𝑢⎤⎡Δ𝑥⎤ 𝑣⎥⎢Δ𝑦⎥; ⎥⎢ ⎥ 1⎦⎣ 1 ⎦

𝜕 W/𝜕 p is the Jacobian matrix of W(𝝃; p), [ ] 1 Δ𝑥 Δ𝑦 0 0 0 𝜕𝐖 = ; 0 0 0 1 Δ𝑥 Δ𝑦 𝜕𝐩

(3)

(4)

W(𝝃; ∆p) is the incremental warp function used to adjust the shape of the reference subset, ⎡1 + Δ𝑢𝑥 𝐖(𝛏; Δ𝐩) = ⎢ Δ𝑣𝑥 ⎢ ⎣ 0

Δ𝑢𝑦 1 + Δ𝑣𝑦 0

Δ𝑢⎤⎡Δ𝑥⎤ Δ𝑣⎥⎢Δ𝑦⎥; ⎥⎢ ⎥ 1 ⎦⎣ 1 ⎦

and H- 1 denotes the inverse of the 6 × 6 Hessian matrix H, {[( )𝑇 ( ) ]} ∑ 𝜕𝐖 𝜕𝐖 𝐇= ∇𝑅(𝐏 + 𝜉) ∇𝑅(𝐏 + 𝜉) . 𝜕𝐩 6×1 𝜕𝐩 1×6 𝛏

(5)

3.2. Experimental verification of parallelizing PI-DIC on CPU and GPU (6)

Regarding the application of CPU and GPU parallel computing in DIC, PI-DIC provides an easy data-parallel model for accelerating DIC towards real-time performance. The GPU-based PI-DIC method has already been realized in [32], where a real-time processing rate of 30 fps (frames per second) has been achieved on a laptop when the number of POIs is less than 4700. The details of GPU-based PI-DIC implementation can be referred to [32,33].

In each iteration, p is updated as [ ] 𝐖(𝛏; 𝐩) ← 𝐖 𝐖−1 (𝛏; 𝚫𝐩), 𝐩 = 𝐖(𝛏; 𝐩)𝐖−1 (𝛏; 𝚫𝐩) (7) √ The stopping criteria for iterations are: (i) (Δ𝑢)2 + (Δ𝑣)2 ≤ 0.001 or (ii) a maximum iteration number of 20 is reached. 9

T. Wang et al.

Optics and Lasers in Engineering 110 (2018) 7–17

Table 1 Five Variants of the general pipelined DIC system framework, where “V” indicates Variant.

As multicore CPU is another powerful and widely used parallel computing platform for DIC acceleration [17,24,25], it is important to understand its performance in executing PI-DIC, especially in comparison with the GPU platform. The multicore-based implementation of PI-DIC is more straightforward than its GPU counterpart. Assume that we have M POIs and N CPU threads. The M POIs are evenly distributed to the N threads, each of which contains M/N POIs. If M is not divisible by N, the left M % N (where % indicates modular operation) POIs are assigned to the first M % N threads. Within each thread, each POI is processed according to the procedure shown in Fig. 1. The experiment is performed on twenty-one frames of simulated speckle images, each of which contains 514 × 514 pixels. The first image is generated according to,

ICGN

FFT-CC

CPU + GPU GPU CPU

CPU + GPU

GPU

CPU

V1: large-scale – –

– V2: large-scale V4: small-scale

– V3: extremely large-scale V5: extremely small-scale

4. Flexible and heterogeneous real-time DIC system framework 4.1. A general pipelined framework

{ 𝑅(𝑥, 𝑦) = round

[ ( )2 ( )2 ]} 𝑠 ∑ 𝑥 − 𝑥𝑖 + 𝑦 − 𝑦𝑖 𝐼𝑖 exp − 𝑟2 𝑖=1

A practical real-time DIC system should include the entire procedure of data acquisition, DIC data processing, and result saving and visualization, and requires/expects that the processing rate matches the data acquisition rate. Therefore, a pipelined system framework [39] overlapping data acquisition and processing can be considered as a strong candidate solution. Based on the different superiorities of CPU and GPU mentioned in Section 3.1, one possible design of a general pipelined DIC system framework is shown in Fig. 3, in which Threads 0–2 are referred to CPU threads. Thread 0 takes charge of the response from the Graphical User Interface (GUI) and result visualization, and Thread 1 and Thread 2 are responsible for data acquisition and result saving, respectively. Also note that, Thread 0 performs as the “master thread” responsible for the allocation and scheduling of other worker threads, i.e. Thread 1 and 2. In this framework, processes within each thread depend on each other, but they can be executed concurrently. For example, as soon as Thread 2 has obtained the calculated results from FFT-CC and IC-GN and begins to output the results to hard-disk, the FFT-CC and IC-GN calculations of the next frame is immediately started without affecting the output process. Such a pipelined structure for DIC system has not been found in the literature. The black box containing the “FFT-CC + IC-GN” performs image correlation and is the most computationally heavy process in the pipeline. However, it can be assigned to CPU threads and/or GPU with much flexibility. According to where FFT-CC and IC-GN will be executed, five Variants of the design of the black box are highlighted in Table 1 and explained in the next subsections. The other combinations are not considered in this study since they are either obviously inappropriate or their performances are lower than the proposed five Variants, making them less practical.

(8)

where s represents the total number of speckles; Ii and (xi , yi ) represent the random peak intensity value of the i-th speckle and its center position, respectively; r is the radius of the speckle; and function round(x) returns the nearest integer to x. The subsequent twenty images are generated by transforming R(x, y) along the x-axis according to the Fourier domain shifting theorem [38], with pre-set displacement of 0.05 pixels between each two consecutive frames. The subset size is set at 21 × 21 pixels. All the programs are written in C++ with the Microsoft Visual C++ 2013 x64 compiler and CUDA 8.0 and tested on a Dell® Precision workstation equipped with an Intel® Xeon® CPU E51650 (6-core, 3.20 GHz), 16.0 GB RAM and a NVIDIA GeForce GTX 680 GPU (8 SMs with 1536 CUDA cores and 2GB 256-bit RAM). On the CPU side, OpenMP [29] is chosen as the threading library, and the SSE3 compiler optimization is enabled. The number of active CPU threads N for performing FFT-CC and/or ICGN is set as the number of the physical cores, i.e. N = 6. The frame rates of performing only FFT-CC or IC-GN on CPU and GPU are shown in Figs. 2(a) and 2(b), respectively, and the frame rates of performing both FFT-CC and IC-GN on different combinations of the two parallel computing platforms (i.e. CPU and GPU) are shown in Fig. 2(c), from which the following results can be observed:

(i) As seen from Fig. 2(a), when only FFT-CC is performed, GPU shows its superiority. However, CPU can run faster than GPU when the number of POIs is less than 36 since FFT can be executed very efficiently on CPU and the problem size is too small to fully utilize the computation resources of GPU; (ii) On the other hand, as seen from Fig. 2(b), when only IC-GN is performed, GPU always outperforms CPU, since IC-GN is more complex than FFT-CC and the finer-grained parallelization achieved on GPU thus gains a higher speed; (iii) As a consequence of (i) and (ii), and shown in Fig. 2(c), when FFTCC + ICGN is performed, GPU is consistently and obviously superior to CPU. Thus, GPU should be considered as the first choice for DIC acceleration; (iv) However, a special and important case appears in Fig. 2(c) that multicore CPU outperforms GPU when the number of POIs is 4. This number can be even larger if a more advanced CPU containing more cores with higher main frequencies is used, which indicates that multicore CPU can be a more convenient choice for such applications that the required number of POIs is very small, for example, the video extensometer used 4 POIs [23] .

4.2. Variant 1: FFT-CC + IC-GN on CPU + GPU if high-end CPUs are available If both CPU and GPU resources are available, it is reasonable to involve both of them in correlation computation. Indeed, recently, Huang et al. accelerated PI-DIC using both CPU and GPU [36] and found that heterogeneous CPU and GPU parallel computing can lead to further speed improvement compared with GPU alone. Their speedup was achieved by allocating POIs to both CPU and GPU and processing them on these two platforms simultaneously. This method is abstracted and illustrated in Fig. 4, in which Thread 3 is used to schedule and control the GPU implementations while Threads 4 ∼ N are responsible for the multicore implementations executed on the CPU side. It is worth mentioning that, since the data allocated to GPU is always much larger than those in CPU, based on the CUDA and OpenGL interoperability [40], the results calculated in GPU can be directly mapped to OpenGL buffers and rendered as OpenGL textures [41], which eliminates the memory transferring latency from GPU and CPU. However, as indicated by the blue and red arrowed lines in Fig. 4, the results calculated in GPU should

These observations serve as the foundation of the proposed flexible pipelined real-time DIC system framework, which will be detailed in the next section. 10

T. Wang et al.

Optics and Lasers in Engineering 110 (2018) 7–17

Fig. 2. Frame rates of using different combinations of CPU and GPU in performing (a) only FFT-CC, (b) only IC-GN, and (c) FFT-CC + IC-GN.

Fig. 3. Schematic of a general pipelined DIC system framework based on the PI-DIC method.

be transferred back to CPU if saving of them is required and the results calculated on CPU should be transferred to GPU for display. In this Variant, in order to make CPU’s contribution significant, a higher-end CPU containing more cores should be used (e.g. the 12-core CPU used in [36]). Furthermore, a good distribution of POIs to CPU and GPU is not straightforward and needs a series of trial tests [39].

GN, which is seen as a special case of Variant 1 by discarding the CPU threads, as shown in Fig. 5. This neat approach has been adopted and realized in [32]. Although this design may be a little slower than Variant 1, a real-time frame rate (i.e. 30+ fps) can be reached with up to 10,000 POIs being processed within each image frame.

4.3. Variant 2: FFT-CC + IC-GN on GPU if high-end CPUs are unavailable

4.4. Variant 3: FFT-CC on GPU and IC-GN on CPU for ultra-fast large-scale deformation monitoring with sub-pixel compensation

When a common multicore CPU is used, the CPU threads that can be assigned for FFT-CC + IC-GN computation are very few and their contribution is limited. In this case, GPU alone can be used for FFT-CC + IC-

A large-scale deformation monitoring system can be used to monitor deformation on the surface of extremely large structures such as bridges [42], which requires real-time computation and visualization of 11

T. Wang et al.

Optics and Lasers in Engineering 110 (2018) 7–17

Fig. 4. Schematic of Variant 1 of the general real-time DIC system framework.

Fig. 5. Schematic of Variant 2 of the general real-time DIC system framework.

displacement or strain fields within an large region of interest (ROI). The system framework shown in Fig. 6 demonstrates a possible candidate solution for the extremely large-scale deformation monitoring, in which FFT-CC is performed on GPU, and the obtained integer-pixel results are directly used for displacement visualization. IC-GN, on the other hand, is executed in several CPU threads in background, and will be executed only when sub-pixel accuracy is demanded. In this Variant, the GPU accelerated FFT-CC is extremely fast even when tens of thousands of POIs are contained in each image fame, for example, as shown in Fig. 2(a), a 30+ fps has been achieved even when 40,000 POIs are processed every frame. In other words, over 1.2 million POIs are computed within one second. This ultra-high-speed process could be useful for monitoring a structure with a large surface area and rapid deformation, for example, a bridge may vibrate intensively when a heavily loaded truck is passing by. Although IC-GN is not the focus, it can still be performed in background using the spare CPU threads without affecting FFT-CC computation, and the subpixel accuracy may be useful for compensation whenever the FFT-CC measurement accuracy degrades to a certain level.

inspires us to propose Variant 4 as illustrated in Fig. 7, i.e., FFT-CC is performed on CPU and the subsequent IC-GN is off-loaded to the GPU side. This Variant is practically useful when a small-scale deformation measurement system is developed to monitor the in situ deformation of small parts on an object’s surface [17] or develop a video-based extensometer [23] where POIs are sparsely distributed on an object’s surface. As illustrated in Fig. 7, on the CPU side, several CPU threads are spawned to perform FFT-CC together, after which the results are passed to Thread 3 which is responsible for the communication between CPU and GPU as well as launching GPU kernels to perform IC-GN. To use this framework, the speed difference between CPU and GPU implementations of FFT-CC should be pre-examined first to know the critical point where CPU is faster than GPU so that a proper scheduling of CPU threads and GPU can be obtained to reach the desired performance. In this study, as explained in Section 3.2, CPU is preferred to GPU when the number of POIs is less than 36. This number is subject to the computation powers of the employed CPU and GPU. Generally, a more powerful CPU makes this number larger. 4.6. Variant 5: FFT-CC + IC-GN on CPU for extremely small-scale measurement

4.5. Variant 4: FFT-CC on CPU and IC-GN on GPU for small-scale measurement

As the number of POIs keeps decreasing, as indicated by observation (iv) in Section 3.2, the superiority of performing IC-GN on GPU will also reduce since the problem size allocated to GPU is not large enough to

According to observation (i) in Section 3.2, CPU is more efficient in performing FFT-CC than GPU when the number of POIs is small, which 12

T. Wang et al.

Optics and Lasers in Engineering 110 (2018) 7–17

Fig. 6. Schematic of Variant 3 of the general real-time DIC system framework.

Fig. 7. Schematic of Variant 4 of the general real-time DIC system framework.

Fig. 8. Schematic of Variant 5 of the general real-time DIC system framework.

fully occupy its computation resources and thus cannot hide the high latency of GPU threads. If the required number of POIs is extremely small (i.e.,≤4 as shown in Fig. 2(c)), the pure CPU-based FFT-CC + IC-GN Variant 4, as shown in Fig. 8, is expected to outperform all the three Variants explained above, in which case, several CPU threads are scheduled to execute the multicore implementations of FFT-CC and IC-GN. This Variant does not require GPU involvement. Although it is considered novel in the context of PI-DIC including FFT-CC + IC-GN processes, pure multicore CPU-accelerated methods have already been attempted for path-

dependent DIC in [17,24,25,28]. Also, this Variant can be considered if GPU is not available, and at least a real-time 30+ fps can be achieved when less than 100 POIs are analyzed per frame. 4.7. Reference updating scheme Fast image acquisition and processing rates result in an issue of decorrelation when the scale deformation keeps growing while the reference image remains unchanged. Thus, a reference image updating 13

T. Wang et al.

Optics and Lasers in Engineering 110 (2018) 7–17

Fig. 9. Schematic illustration of the reference image updating scheme every m frames.

Fig. 10. Parameters setting dialog of the proposed RT-DIC system. Variants 2–5 can be selected in the right panel.

scheme is necessary. A simple updating scheme is shown in Fig. 9 where the reference image is updated every m frames (where m is manually set). This scheme is generally applicable to all Variants, but special consideration is required for Variant 3, in which the integer-pixel results calculated from the GPU-based FFT-CC algorithm should be transformed back to CPU side and saved into a concurrent buffer before they can be further refined by the CPU-based IC-GN algorithm. Since the GPUbased FFT-CC algorithm is much faster than the CPU-based IC-GN, the size of the concurrent buffer used to cache the integer-pixel results may keep increasing and an overflow will happen if m is not carefully tuned. Therefore, m is recommended to be set as the ratio between the running time of the GPU-based FFT-CC and the CPU-based IC-GN in one frame. Otherwise, either a frame dropping mechanism or temporarily writing the results to hard-disk can be considered. However, both the two solutions suffer from either an accuracy lose or a decrease in computation efficiency. To avoid the manual setting of m, a more adaptive alternative is to update the reference image only when correlation becomes unreliable [12], i.e. if the average of Czncc values among POIs is less than 0.8, the reference image is then updated to the newly captured frame. It is also worth mentioning that, when updating occurs, the precomputation shown in Fig. 1 also must be recalculated, which may im-

pose additional computation burden. However, as verified in [32], the time consumed by precomputation is negligible compared with the more complex FFT-CC and IC-GN algorithms so that it has very little influence on the overall performance of the pipeline. 5. Implementation and validation In this section, based on Variants 2–5, a flexible 2D RT-DIC system is implemented on a PC. Furthermore, porting the proposed DIC framework to both multicore CPU and GPU on a mobile phone is also attempted. 5.1. PC implementation and validation By now, PC is almost the sole platform for realizing highperformance DIC systems. The entire pipeline shown in Fig. 3 is implemented on a PC which includes Variants 2–5. Fig. 10 demonstrates the parameter configuration dialog of the RT-DIC system. After selecting a ROI (shown by the blue rectangle), users decide and choose which Variant to use from the four options shown in the right panel. For example, Variant 2 is enabled in Fig. 10. Note that Variants 3–5 involve the multicore CPU-based implementations of FFT-CC or IC-GN. Since each core 14

T. Wang et al.

Optics and Lasers in Engineering 110 (2018) 7–17

Fig. 11. Comparison of the pre-set and calculated strain fields: (a) pre-set strain field of “Sample 14 L1 Amp0.1.tif”; (b) calculated strain field of “Sample 14 L1 Amp0.1.tif”; (c) pre-set and strain field of the first row of “Sample 14 L1 Amp0.1.tif”; (d) pre-set strain field of “Sample 14 L3 Amp0.1.tif”; (e) calculated strain field of “Sample 14 L3 Amp0.1.tif”; (f) pre-set and strain field of the first row of “Sample 14 L3 Amp0.1.tif” .

Table 2 Number of POIs that can be processed in realtime (≈30 fps) by Variants 2–5.

corresponds to two threads if hyperthreading technology [43] is supported, N physical cores can support 2 N threads. Among these threads, the default setting in the system uses N threads for the CPU-based FFTCC/IC-GN computation and the other N threads are responsible for all the other functions in the pipeline or simply remain idle. For Variant 3, the number of threads involved for the calculation of IC-GN in the background can also be customized.

Variant 2 Variant 3 Variant 4 Variant 5

5.1.1. Uniform deformation The accuracy of the PC implementation is first examined by performing the Variants 2–5 on the same set of simulated images as used in Section 3.2. Within each image, 95 × 95 = 9025 POIs are evenly distributed every 5 pixels and the subset size is equal to 21 × 21. All the four Variants achieve the same accuracy and precision that the mean bias error falls into a range between −3.2 × 10–3 and 3.2 × 10–3 and the standard deviation is less than 2.1 × 10–3 . The processing rate is examined by inputting the simulated images to the system from beginning to end then from end to beginning repeatedly. The reference image is updated every frame (i.e. m = 1). The highest average processing rate is around 34 fps when Variant 2 is selected. With the same configuration, Variants 3 achieves 2 fps. However, if m = 100 (i.e. the speed ratio between the GPU-based FFT-CC and the CPU-based IC-GN shown in Fig. 2(a) and (b)), Variant 3 can run with a very high frame rate at 223 fps. Variant 4 is tested on a series of smallscale measurements, in which the number of POIs ranges from 4 to 3600. The achieved frame rates fall into a range from 34 fps to 2288 fps. Finally, Variant 5 is tested on 4 POIs and a processing rate of 2320 fps is achieved. It is found that the performance of the RT-DIC system coincides with the experimental verification explained in Section 3 (see

Number of POIs

Frame rate (fps)

10,000 48,400 5625 100

29.8 34.0 30.1 32.3

Fig. 2), since the DIC data processing stage is the most time-consuming part in the entire pipeline. Moreover, the number of POIs that can be processed by each of Variants 2–5 in real-time (i.e.,≈30 fps) is demonstrated in Table 2.

5.1.2. Non-uniform deformation The accuracy of the PC implementation is further examined on the Sample 14 downloaded from the SEM 2D DIC challenge website [44], in which varying strains are non-uniformly introduced into the deformed speckle patterns using the FFT method [45]. The reference image “Sample14 Reference.tif” as well as the two deformed images “Sample 14 L1 Amp0.1.tif” and “Sample 14 L3 Amp0.1.tif” are imported to the proposed RT-DIC system for analysis. Within each image, 326 × 34 = 11,084 POIs are evenly distributed every 5 pixels and the subset size is equal to 21 × 21. Since the number of POIs is large and the deformation is complex, Variant 2 is selected. Based on the obtained displacement field, their strain fields are further calculated using the method proposed in [46]. 15

T. Wang et al.

Optics and Lasers in Engineering 110 (2018) 7–17

Fig. 12. Visualization of the horizontal and vertical displacement fields calculated from the flexible real-time 2D RT-DIC system, where a blue-green-red color map is used. The green color indicates zero displacement, and the color changes towards red if the displacement is along the positive x or y axis.

Fig. 11 compares the pre-set and the calculated strain fields of Sample 14. The root-mean-square errors (RMSEs) between the calculated strain fields and the pre-set ones are 4.4 × 10–5 and 4.3 × 10–4 for the “Sample 14 L1 Amp0.1.tif” and the “Sample 14 L3 Amp0.1.tif”, respectively, which indicates that RT-DIC can achieve a high accuracy in measurement of non-uniform deformation. The processing rate is examined by repeatedly performing DIC analysis on the two deformed images “Sample 14 L1 Amp0.1.tif” and “Sample 14 L3 Amp0.1.tif” alternatively with the same reference image “Sample14 Reference.tif” in the RT-DIC system. The average frame rate achieves 23 fps, which is similar to the speed performance of the uniform deformation example shown in Section 5.1.1.

The CPU-based mobile version named iDIC has been published in Apple Store, which employs the same implementation explained in Section 3.2 and pThread [47] is used as the threading library since it is inherently supported by Unix-based operation system. Although it requires about 3 s to process one frame containing only 210 POIs and thus is only applicable to static DIC analysis, it demonstrates the feasibility of DIC on mobile devices. The GPU-based version is then developed using the novel GPU computing library called Metal 2 released by Apple recently. The time cost is reduced to 0.19 s to process the same frame, which is 16 times faster than the CPU-based version and a framerate of 5.2 fps is achieved. It is worth noting that, although the implementations of the algorithms have been ported to a mobile device, the integration of the entire pipeline is currently under investigation due to the limited computation capabilities of mobile CPU and GPU. For example, the CPU of an iPhone 5S only contains 2 cores, however, as shown in Fig. 5, at least 4 cores are required to implement Variant 2. Recently, Apple has released iPhone X which contains a powerful 6-core CPU, which can be used to test the portability of our proposed pipelined framework.

5.1.3. Real experiment A real experiment is then carried out as shown in Fig. 12, in which the specimen is made by pasting a speckle pattern image captured from real experiment onto a flat board. A horizontal force is exerted on the specimen to consistently slip it towards the right and the speckle pattern images are captured at a frame rate of 30 fps. The white box represents the ROI, in which the displacement field is visualized using a blue-greenred colormap, where the green color indicates zero displacement. Even though the image quality is not high, RT-DIC works successfully for all Variants. A ROI containing 9,440 POIs is first selected. Based on these configurations, Variant 2 is used, and the horizontal and vertical displacement fields of the POIs at frames 0, 24, 40, and 101 are shown in Fig. 11. Note that, as shown in Fig. 2(c), the DIC data processing rate should be larger than 30 fps. However, it is bound at 30 fps due to the limitation of the fixed acquisition rate of the employed camera. Variant 3 also achieves 30 fps if the reference image updating rate m = 100 is used. In fact, as mentioned in the experiment performed on simulated images, it can potentially be even faster than Variant 2 if a more advanced camera with higher acquisition rate is used. Finally, for the same number of POIs, Variant 4 and 5 only run at 14 fps and 1.4 fps, respectively, which cannot match the data acquisition rate. Nevertheless, a high framerate can be easily obtained by decreasing the number of POIs.

6. Conclusion In order to design a real-time DIC (RT-DIC) system, the speed performances of the multicore CPU-based and GPU-based implementations of the fast Fourier transform-based cross-correlation (FFT-CC) algorithm and the inverse-compositional Gauss-Newton (IC-GN) algorithm employed in the path-independent DIC (PI-DIC) method are first carefully studied. It is found that multicore CPU outperforms GPU in small-scale measurement while GPU shows its prominence as the measurement scale increases. Based on the different performances and characteristics of CPU and GPU, an accurate and flexible RT-DIC system is designed and realized using a pipelined framework with five Variants of combinations of CPU and GPU, which can fulfill specific requirements in practical applications. The first two Variants of them, i.e. Variants 1 and 2, already exist in the literature while the others are newly proposed in this paper. Thus, the contributions of this paper are three-folds: it (i) elucidates the respective advantages of CPU and GPU in DIC analysis; (ii) unifies different Variants by a pipelined structure; and (iii) provides various choices of DIC implementations to cater different application requirements including measurement scales and computation speeds. Both the accuracy and speed of the entire pipelined framework are then verified by a PC implementation of the RT-DIC system integrating Variants 2–5. Variant 2 and 5 are also implemented to an iPhone 5S to explore the feasibility of a portable RT-DIC system using the same framework on mobile

5.2. Mobile device implementation and validation Due to the increasing popularity of the mobile devices, we also have attempted to port Variants 2 and 5 of the proposed RT-DIC system framework to an iPhone 5S equipped with an A7 64-bit dual-core CPU and a PowerVR (Series 6) G6430 GPU. 16

T. Wang et al.

Optics and Lasers in Engineering 110 (2018) 7–17

devices. With the achieved high processing speed, the incremental DIC method can be foreseen to be actively used in dynamic measurement applications.

[20] Pan B, Li K, Tong W. Fast, robust and accurate digital image correlation calculation without redundant computations. Exp Mech 2013;53:1277–89. [21] Baker S, Matthews I. Equivalence and efficiency of image alignment algorithms, computer vision and pattern recognition. In: CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, IEEE2001, 1091; 2001 I-1090-I-1097. [22] Pan B, Xie H, Wang Z. Equivalence of digital image correlation criteria for pattern matching. Appl Opt 2010;49:5501–9. [23] Pan B, Tian L. Advanced video extensometer for non-contact, real-time, high-accuracy strain measurement. Opt Express 2016;24:19082–93. [24] Pan B, Tian L. Superfast robust digital image correlation analysis with parallel computing. Opt Eng 2015;54 034106-034106. [25] Shao X, Dai X, Chen Z, He X. Real-time 3D digital image correlation method and its application in human pulse monitoring. Appl Opt 2016;55:696–704. [26] Carr J, Baqersad J, Niezrecki C, Avitabile P, Slattery M. Dynamic stress–strain on turbine blades using digital image correlation techniques part 2: dynamic measurements. In: Topics in experimental dynamics substructuring and wind turbine dynamics, 2. Springer; 2012. p. 221–6. [27] Gao G, Huang S, Xia K, Li Z. Application of digital image correlation (DIC) in dynamic notched semi-circular bend (NSCB) tests. Exp Mech 2015;55:95–104. [28] Shao X, Dai X, He X. Noise robustness and parallel computation of the inverse compositional Gauss–Newton algorithm in digital image correlation. Opt Lasers Eng 2015;71:9–19. [29] Dagum L, Menon R. OpenMP: an industry standard API for shared-memory programming. Comput Sci Eng IEEE 1998;5:46–55. [30] VIC-2D Data Sheet, Correlated solutions, http://www.correlatedsolutions.com/wpcontent/uploads/2013/10/VIC-2D-Datasheet.pdf. [31] Jiang Z, Kemao Q, Miao H, Yang J, Tang L. Path-independent digital image correlation with high accuracy, speed and robustness. Opt Lasers Eng 2015;65:93–102. [32] Zhang L, Wang T, Jiang Z, Kemao Q, Liu Y, Liu Z, Tang L, Dong S. High accuracy digital image correlation powered by GPU-based parallel computing. Opt Lasers Eng 2015;69:7–12. [33] Wang T, Jiang Z, Kemao Q, Lin F, Soon S. GPU accelerated digital volume correlation. Exp Mech 2016;56:297–309. [34] Le Besnerais G, Le Sant Y, Lévêque D. Fast and Dense 2D and 3D displacement field estimation by a highly parallel image correlation algorithm. Strain 2016;52:286–306. [35] Ruijters D, Thévenaz P. GPU prefilter for accurate cubic B-spline interpolation. Comput J 2010 bxq086. [36] Huang J, Zhang L, Jiang Z, Dong S, Chen W, Liu Y, Liu Z, Zhou L, Tang L. Heterogeneous parallel computing accelerated iterative subpixel digital image correlation. Sci China Technol Sci 2018:1–12. [37] LaVision, StrainMaster, http://www.lavision.de/en/products/strainmaster/ strainmaster-dic.php. [38] Schreier HW, Braasch JR, Sutton MA. Systematic errors in digital image correlation caused by intensity interpolation. Opt Eng 2000;39:2915–21. [39] Gao W, Kemao Q, Wang H, Lin F, Seah HS. Parallel computing for fringe pattern processing: a multicore CPU approach in MATLAB® environment. Opt Lasers Eng 2009;47:1286–92. [40] Stam J. What every CUDA programmer should know about OpenGL. GPU Technology Conference San Jose, CA; 2009. [41] Shreiner D. OpenGL programming guide: the official guide to learning OpenGL, versions 3.0 and 3.1 B.T.K.O.A.W. Group. Pearson Education; 2009. [42] Lee J-J, Shinozuka M. Real-time displacement measurement of a flexible bridge using digital image processing techniques. Exp Mech 2006;46:105–14. [43] Marr D, Binns F, Hill D, Hinton G, Koufaty D. Hyper-threading technology in the netburst® microarchitecture. 14th Hot Chips 2002. [44] DIC Challenge, SEM, http://www.sem.org/dic-challenge/. [45] Reu PL, Toussaint E, Jones E, Bruck HA, Iadicola M, Balcaen R, Turner DZ, Siebert T, Lava P, Simonsen M, Challenge DIC. Developing images and guidelines for evaluating accuracy and resolution of 2D analyses. Exp Mech 2017:1–33. [46] Pan B, Xie H, Guo Z, Hua T. Full-field strain measurement using a two-dimensional Savitzky-Golay digital differentiator in digital image correlation. Opt Eng 2007;46 033601-033601-033610. [47] McCracken D. POSIX threads and the Linux kernel. In: Ottawa Linux Symposium; 2002. p. 330.

Acknowledgment The authors acknowledge the contributions from Cai Lianjiang, Lin Jun and Le Khac Minh Tue in developing the DIC mobile version. Supplementary materials Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.optlaseng.2018.05.010. References [1] Bruck H, McNeill S, Sutton MA, Peters Iii W. Digital image correlation using Newton-Raphson method of partial differential correction. Exp Mech 1989;29:261–7. [2] Chen D, Chiang F-P, Tan Y, Don H. Digital speckle-displacement measurement using a complex spectrum method. Appl Opt 1993;32:1839–49. [3] Sutton MA, McNeill SR, Helm JD, Chao YJ. Advances in two-dimensional and three-dimensional computer vision. Photomechanics, Springer; 2000. p. 323–72. [4] Pan B, Qian K, Xie H, Asundi A. Two-dimensional digital image correlation for in-plane displacement and strain measurement: a review. Meas Sci Technol 2009;20:062001. [5] Sutton M, Hild F. Recent advances and perspectives in digital image correlation. Exp Mech 2015;55:1–8. [6] Sutton MA, Vision-Based Computer. Noncontacting deformation measurements in mechanics: a generational transformation. Appl Mech Rev 2013;65:43–61. [7] Zhou Y, Pan B, Chen YQ. Large deformation measurement using digital image correlation: a fully automated approach. Appl Opt 2012;51:7674–83. [8] Zhou Y, Chen YQ. Feature matching for automated and reliable initialization in three-dimensional digital image correlation. Opt Lasers Eng 2013;51:213–23. [9] Wu R, Qian H, Zhang D. Robust full-field measurement considering rotation using digital image correlation. Meas Sci Technol 2016;27:105002. [10] Zhong F, Quan C. Digital image correlation in polar coordinate robust to a large rotation. Opt Lasers Eng 2017;98:153–8. [11] Tang Z, Liang J, Xiao Z, Guo C. Large deformation measurement scheme for 3D digital image correlation method. Opt Lasers Eng 2012;50:122–30. [12] Pan B, Dafang W, Yong X. Incremental calculation for large deformation measurement using reliability-guided digital image correlation. Opt Lasers Eng 2012;50:586–92. [13] Blaber J, Adair B, Antoniou A. Ncorr: open-source 2D digital image correlation matlab software. Exp Mech 2015;55:1105–22. [14] Hild F, Bouterf A, Forquin P, Roux S. On the use of digital image correlation for the analysis of the dynamic behavior of materials. In: The micro-world observed by ultra high-speed cameras. Springer; 2018. p. 185–206. [15] Kirugulige MS, Tippur HV, Denney TS. Measurement of transient deformations using digital image correlation method and high-speed photography: application to dynamic fracture. Appl Opt 2007;46:5083–96. [16] Tarigopula V, Hopperstad OS, Langseth M, Clausen AH, Hild F, Lademo O-G, Eriksson M. A study of large plastic deformations in dual phase steel using digital image correlation and FE analysis. Exp Mech 2008;48:181–96. [17] Wu R, Kong C, Li K, Zhang D. Real-time digital image correlation for dynamic strain measurement. Exp Mech 2016;56:833–43. [18] Li E, Tieu A, Yuen W. Application of digital image correlation technique to dynamic measurement of the velocity field in the deformation zone in cold rolling. Opt Lasers Eng 2003;39:479–88. [19] Sutton MA, Orteu JJ, Schreier H. Image correlation for shape, motion and deformation measurements: basic concepts. Theory and applications. Springer Science & Business Media; 2009.

17