Parallel Performance of Typical Algorithms in Remote Sensing-based Mapping on a Multi-Core Computer Jinghui Yang and Jixian Zhang
Abstract
widely available. Hence, parallel computing methods, which Typical algorithms in remote sensing-based mapping, such as not only fully leverage the state-of-the-art hardware platforms geometric correction, image fusion, image mosaic, and autobut also promote processing speed for mapping, are required. matic DEM extractions, are data- and computation-intensive; There are some related works in the field of parallel proprocessing on multi-core computers can improve their perforcessing for remote sensing. Lee et al. (2011) reviewed recent mance. Therefore, parallel computing methods that can fully developments in high performance computing (HPC) for leverage state-of-the-art hardware platforms and that can be easily adapted to these algorithms are required. In this paper, remote sensing. Plaza et al. (2011) reviewed recent developa method with high parallelism is adopted. The method ments in the application of HPC techniques to hyperspectral integrates a recursive procedure with a parallel mechanism imaging problems. Plaza et al. (2006 and 2008) have develthat is capable of concurrently processing multiple blocks on oped several highly innovative parallel algorithms for stanmultiple cores. The parallel experiments of five categories of dard data processing and information extraction of hyperspectypical algorithms on two multi-core computers with Windows tral images in heterogeneous networks of workstations and and Linux operating systems, respectively, were fulfilled. The homogeneous clusters. Luo et al. (2012) presented a parallel experimental results show that although the gains of parallel implementation of N-FINDR (Winter, 1999) (a widely used performance vary for different algorithms, the processing perendmember extraction algorithm) which was run on a cluster formance achieved on multi-core computers is significantly connected by the Gigabit Ethernet. Generally, these methods improved. The best case on a computer with two CPUs is able mainly concentrate to information extraction of hyperspectral images. These experiments were performed on clusters in to perform the DEM extractions up to 13.6 times faster than which are largely different from the counterparts serial execution. According toDelivered these experiments, the factors by Publishing Technology to: components McMaster University in multi-cores computers. influencing parallel performance onIP: a multi-core computer 41.189.254.79 On: Sat, 23 Jan 2016 23:43:15 Althoughand there are fewSensing high performance software systems, are discussed. Copyright: American Society for Photogrammetry Remote e.g., PHOTOMOD® (Adrov et al., 2012), PIXEL FACTORYTM (ASTRIUM, 2013), Correlator3DTM (Rotenberg et al., 2013), Introduction and PCI GXL (PCI Geomatics, 2009) in photogrammetric and Typical algorithms in remote sensing-based mapping are remote sensing community; they are based on either clusters data- and computation-intensive, such as image fusion, im(PHOTOMOD® and PIXEL FACTORYTM) or graphic processage mosaic, geometric correction, and image matching. In ing unit (GPU) (Correlator3DTM and PCI GXL). The systems practice, these algorithms usually require not only reading based on clusters mostly apply distributed computing, which and writing a large amount of data but also several sophistionly dispatches multiple independent tasks to different cated computational steps, including interpolating, filtering, computers (or cores) and usually cannot subdivide a task into orthogonal transformation, convolution, geometric transformany smaller tasks. Similar to automatic tiling and stitching mation, solving systems of linear equations, optimization, and methods in eCognition® server software, the parallel method multi-resolution analysis. The large amount of data and the adopted in this paper can split a large image into more blocks immense computational costs of these algorithms are a realisto enable concurrent processing of multiple blocks. The tic concern. Normally, the entire process of serial execution of systems based on GPU mostly need an add-on GPU card. This complicated algorithms is rather time-consuming. paper does not compare these HPC methods used in remote However, the current computing resources are typically sensing field, but aims to provide another insight that the not utilized efficiently for serial algorithms. A serial algorithm commonplace desktop computing platforms can be employed is unaware of the existence of multiple CPU cores, and the to improve processing performance. performance of such an algorithm on a multi-core computer In a narrower field, i.e., the multi-core based parallel will be the same as its performance on a single core computer. computing for remote sensing, several researchers have done The current serial algorithms are not matched to the developsome valuable works in recent years. Christophe et al. (2011) ments of computer hardware in which multi-core CPUs are compared the relative performance of a multithreaded program run on a CPU and the corresponding program running on a GPU. Remon et al. (2011) presented parallel experiments of Jinghui Yang is with the Chinese Academy of Surveying hyperspectral endmember extraction algorithms on multi-core and Mapping (CASM), No. 28 Lianhuachi West Road, Beijing, 100830, P. R. China, and formerly with the School of Resource and Environmental Sciences, Wuhan University, No. 129 Luoyu Road, Wuhan, 430079, P. R. China (
[email protected]). Jixian Zhang is with the Chinese Academy of Surveying and Mapping (CASM), No. 28 Lianhuachi West Road, Beijing, 100830, P. R. China.
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
05-15 May Peer Reviewed.indd 373
Photogrammetric Engineering & Remote Sensing Vol. 81, No. 5, May 2015, pp. 373–385. 0099-1112/15/373–385 © 2015 American Society for Photogrammetry and Remote Sensing doi: 10.14358/PERS.81.5.373
M a y 2015
373
4/15/2015 10:40:38 AM
(Zhang G. et al., 2010; Zhang L. et al., 2011) as an alternative processors. Bernabe et al. (2013) tested the parallel unmixing sensor orientation. In this experiment, geometric correction processing chain on a multi-core system made up of two Intel was applied using the RPCs based model on a RADARSAT-2 Xeon CPUs at 2.53 GHz with 12 physical cores. Yang and Zhang image. Moreover, the original SLC image was transformed to (2012) presented a parallel framework for remotely sensed image fusion and gave experimental results on an ordinary PC the magnitude-phase version. with four cores. These studies show that the parallel computImage Fusion ing on multicore computers can increase the computing perThe block-regression-based (BR) fusion algorithm (Zhang formance for remotely sensed image to some extent. However, J.X. et al., 2010) was selected as an example. The algorithm because of the limited cores used in these above studies and derives a synthetic block as a linear function of blocks of the relatively low scalability of the adopted multithreading multispectral bands that has the maximum correlation with and OpenMP (an API for shared-memory parallel programthe relevant block of the panchromatic band for every block of ming) libraries, the achieved acceleration is about three times. images. The maximum correlation with the block of the panWith the rising number of CPU cores available in a single comchromatic image results in the maximum enhancement of the puter, the parallel experiments with more cores and a more spatial details at the expense of the minimum spectral distorscalable parallel library are needed. With the parallel library, tion derived from the fusion operations. The block-regression such as message passing interface (MPI), the flexible and optifusion algorithm can be described by Equation 1: mal parallel strategies can be easily embodied in the course of remote sensing image processing. xs(Lk ,i , j) Even though users can make use of resources of remote pan(i , j ) − ∑ ck xs(Lk ,i , j ) (1) xs(Hk ,i , j ) = xs(Lk ,i , j ) + L clusters such as Amazon™ computing cluster, remote sensing k ∑ k (ck xs(k ,i , j) ) images are usually stored in local devices and the transmission of a large amount of data is a tedious and time-consumIn Equation 1, xsl(k,i,j) and xsH(k,i,j) are the pixel values before ing task. Multi-core computers which are commonplace and and after fusion, respectively, pan(i, j) is the pixel value of the popular make parallel methods on this type of computer more panchromatic image, k is the band number, (i, j) represents accessible. We define a multi-core computer as a computer the pixel location, and ck is the linear regression coefficient in that contains single or multiple CPUs in one machine with the multiple linear regression of the block region containing each CPU including multiple cores. To the best of the authors’ the pixel located at (i, j). knowledge, there are no extensive parallel computing results available for typical algorithms in remote sensing based mapImage Mosaics ping (i.e., geometric correction, image fusion, image mosaics, An image mosaic combines two or more images with overlapand digital elevation model extractions) using multi-core ping or connected areas for an overview of a large image scene computers. The objective of this research is to quantitatively (Du et al., 2008). The prerequisite of a mosaic is that the images identify what levels the processing performance of these algoare georeferenced or registered. Generally, radiometric normalrithms coupled with an optimizedDelivered parallel mechanism can be by Publishing Technology University ization to: andMcMaster blending processes can be employed for this purpromoted to, and to discuss what factorsIP: determine parallelOn: Sat, 41.189.254.79 23InJan 23:43:15 pose. this2016 paper, the input images are already geo-referenced, performance on a multi-coreCopyright: computer. American Society for Photogrammetry Remote Sensing and the adoptedand method blends pixels from different input This paper applies a straightforward and high parallelized bands into the average value in overlapping areas, whereas in mechanism to investigate parallel performance of five categothe areas that have only one image, the values are just copied. ries of typical algorithms on two multi-core computers with DEM Extractions Windows™ and Linux operating systems, respectively. The We present a modified version of the matching algorithm next section briefly describes the five categories of algorithms, (Sohn et al., 2005). A flowchart of the proposed DEM/DSM followed by a parallel method optimized for a multi-core extraction algorithm is shown in Figure 1. First, the points of computer. Extensive tests of parallel computing on multi-core interest extracted from the left image by the Förstner interest computers and discussions are presented in the Parallel Peroperator (Förstner, 1986) are matched through cross-correlation formance and Analysis section, followed by Conclusions. against points along the piecewise matching line (PML) in the right image, which is determined by image-to-object and The Tested Algorithms object-to-image transformations (Sohn et al., 2005). The maximum and minimum height values were calculated according Preprocessing to the height offset and scale factor in the RPC file. Second, A filtering algorithm and the de-correlation stretch algorithm the average height is determined according to those matched were selected to verify the parallel performance of pre-propoints in a certain area. The value was used to decide the new cessing. A smoothing filter with a 5 × 5 size filter kernel in height range, which is required in the subsequent matching. which each element is 0.04 is used. De-correlation stretch is a Third, the interest and grid points from the left image are again process that is used to stretch the color differences found in a matched against points along the PML in the right image with color image. It is described as a technique based on a printhe new height range. Last, the 3D points are generated using cipal component transformation in which the transformed RPC-based space intersection (Grodecki et al., 2004). The whole variables are contrast-stretched and then retransformed to process reveals that the computational cost is relatively large. the original coordinates in display (Gillespie et al., 1986). Campbell (1996) found that de-correlation stretch essentially produces an alternative linear transformation of the spectral The Method of Parallel Processing bands to that produced by principal component analysis. In In the course of rectification, one can pursue either a direct this paper, we implement the technique following the method or an indirect approach to transform a digital image by an proposed by Campbell. analytical function (Novak, 1992). The indirect method takes each pixel location of the result (e.g., orthophoto), determines Geometric Correction its position in the original image by certain transformation, The rational polynomial coefficients (RPCs) based model has and interpolates the gray value by a given resampling method been extensively used for high-resolution satellite optical im(Figure 2a); On the other hand, the direct method starts agery (Fraser et al., 2006; Tao and Hu, 2001) and SAR imagery from the pixel location in the original image, transforms its
(
374
M a y 2 015
05-15 May Peer Reviewed.indd 374
)
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
4/15/2015 10:40:41 AM
Delivered by Publishing Technology to: McMaster University IP: 41.189.254.79 On: Sat, 23 Jan 2016 23:43:15 Copyright: American Society for Photogrammetry and Remote Sensing Figure 1. Flowchart of the proposed image matching algorithm.
(a)
(b)
Figure 2. Transformations of a pixel: (a) indirect, and (b) direct (Novak, 1992). coordinates into the result, and places the gray value to the nearest integer pixel (Figure 2b) (Novak, 1992). Our method is similar to the abovementioned indirect method. After determining the bounds used to save the final results, the method divides the coverage of the bounds into blocks according to spatial domain. Thus, a block is a certain area covered by output bounds. To generate the final results for each block, the idea of indirect method is used. Our method begins each time taking a block instead of a pixel, determines the region in the original image for this block by a recursive manner, reads the data in the region, carries out a sequence of steps, and writes the final results to the block. This method has two improvements above the original
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
05-15 May Peer Reviewed.indd 375
indirect method (Novak, 1992). It can process one block at a time, and multiple steps can be incorporated into the indirect procedure. Therefore, our method is not limited to geometric correction. It can be used in many other algorithms, e.g., image fusion and mosaic. Furthermore, the processing tasks of multiple blocks can be distributed to multiple CPU cores with the help of parallel computing techniques.
Block Partition in Output Bounds Used to Save Final Results
The coverage in output bounds, which is related to either an image or another form (e.g., 3D points used for DEM extraction), is partitioned into blocks. The final results of the algorithm are generated after certain processing steps, and written into
M a y 2015
375
4/15/2015 10:40:46 AM
of the (i, j) pixel but also the global statistical information or neighboring pixels of location (i, j) are used for location (i, j) in certain cases; (i, j) represents the pixel location. In general, three cases exist in which the necessary data are required to enable the entire processing of a block (Nicolescu and Jonker, 2002). The case in which only the corresponding pixels are used and the case in which both the corresponding and neighboring pixels are used are shown in Figure 5a and Figure 5b, respectively. For instance, the former includes geometric transformation, and the latter includes interpolating, convolution, etc. For the case in which the corresponding pixels and the global statistical information are used, the global information is generated at the beginning of a procedure. The principal component analysis (PCA) projection and the decorrelation stretch are examples where the global information is the mean value, the standard deviation value for each band and covariance matrix, etc.
blocks one by one. The advantages of block partition rest on two points. One is that a large task for a whole image is reduced to many smaller pieces, which are processed by multiple CPU cores simultaneously by parallel computing. The other aspect concerns the temporal file. Because data volume corresponding to a block is not large, the temporal processing results for a block can be saved in the random-access memory (RAM) instead of in a temporal file, although complicated calculations are required to determine the results. Thus, a temporal file is not needed. The strategies of block partition are shown in Figure 3. An entire coverage in output bounds can be divided into strips (Figure 3a), tiles (Figure 3b), or bands (Figure 3c). One of these three strategies can be chosen according to the requirements. In practice, a tile is the form that is most commonly used in data decomposition for parallel computing. In this paper, the tested algorithms adopt different configurations of block partition. De-correlation stretch employs two strategies, i.e., tile (size: 64 × 64) and band based strategies, in its different steps. The others adopt tile based strategy, but the sizes of tiles are different: 64 × 64 for filtering, 128 × 128 for DEM extractions, and 512 × 128 for geometric correction, image fusion, and image mosaics. The configurations of block partition in this paper are based on the characteristics of each algorithm, block size of file used to store final results, and parallel performance achieved by us. These configurations are relatively optimal selections. The rationale of choosing partition strategy is worth discussing, but is out of the scope of this paper.
The Parallel Processing Mechanism
Parallel processing is the concept of using multiple computers or CPU cores to reduce the time needed to solve a heavy computational problem, operating on the principle that large problems can often be divided into smaller ones and then solved concurrently (Han et al., 2009). In the course of parallelization of a procedure, high parallelism is a primary goal, which is usually evaluated by the index, i.e., speedup. Because the final results of these tested algorithms can be generated and saved in the form of blocks, the parallel computing technique makes it possible for multiple blocks to be processed concurrently on multiple CPU cores. A Recursive Procedure to Generate Final Results In this paper, the message passing interface (MPI) is used as If several steps are required to generate the final results, the recursive procedure shown in Figure 4 for fetching and the parallel environment. The MPI is a message-passing appliprocessing data will be used, where each step is the same as cation-programmer interface packaged together with protocol described in the following cases. The geometric correction, and semantic specifications for how its features must behave Publishing Technology to: McMaster University which includes two steps, namelyDelivered geometricby transformation in any implementation (Gropp et al., 1996). The MPI features IP: 41.189.254.79 23performance, Jan 2016 23:43:15 and interpolating, is an example. The strategy is similar to On: Sat, high scalability, and portability and is the most Copyright: American Society for Photogrammetry that described in the literature (Christophe et al., 2011). common libraryand usedRemote in high Sensing performance computing. According to the characteristics of typical algorithms in In a multi-core computer, one CPU core (or virtual core usremote sensing based mapping, not only the intensity value ing hyper-threading technology widely used by an Intel® CPU)
(a)
(b)
(c)
Figure 3. Three types of strategies of block partition: (a) Strip-based decomposition, (b) Tile-based decomposition, and (c) Band-based decomposition.
Figure 4. The recursive procedure for fetching and exporting.
376
M a y 2 015
05-15 May Peer Reviewed.indd 376
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
4/15/2015 10:40:49 AM
(a)
(b)
Figure 5. Data fetching for a block: (a) Fetching the corresponding pixels, and (b) Fetching the corresponding and neighboring pixels.
Delivered by Publishing Technology to: McMaster University IP: 41.189.254.79 On: Sat, 23 Jan 2016 23:43:15 Copyright: American Society for Photogrammetry and Remote Sensing
Figure 6. The parallel processing mechanism. Input files whose numbers are different for different algorithms, i.e., (a) one, (b) two, or (c) more than two are processed to yield (d) 3D points, or (e) image.
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
05-15 May Peer Reviewed.indd 377
M a y 2015
377
4/15/2015 10:40:50 AM
corresponds to one processor in the MPI environment. ProcesLinux operating systems, respectively. Several datasets are sors are identified by non-negative integer, ranks. Those proselected to demonstrate the performance of the implemented cessors are divided into two types, i.e., master and slave(s), algorithms. Parallel performance using different numbers of in the adopted parallel mechanism. One processor commonly processors set by us is given in The Performance of Paralacts as a master and the others as slaves. Assuming there are k lel Processing Section. Based on the achieved performance, processors, the rank of master is 0 and the rank range of slaves factors determining parallel performance are discussed in the is 1 − (k − 1). The processing tasks of blocks into which the Factors Determining Parallel Performance Section. output bounds are partitioned are also labeled as numbers. Datasets The numbering order is row-first. There are (n × m) processing tasks whose numbers range from 0 to (n × m − 1) if a row and Preprocessing a column are decomposed into m and n blocks, respectively. The tested dataset for filtering is a pan-sharpened SPOT5 imThese processing tasks are only distributed to and processed age of which the image size is 28,820 × 28,155 pixels and the by slaves. In the course of parallel processing, task i is asfile size is 3.02 GB. The image has four bands. The file size of signed to processor (i%(k − 1) + 1). In other words, given tile the filtered image is the same as that of the input. A hypersize and bounds of output, the rank of slave determines which spectral image is used in the experiment for de-correlation tasks are dispatched to the slave. As a result, the processing stretch. The dataset is described as follows: (a) Sensor: OMIS, tasks of blocks are equally distributed to the slaves. (b) Image size: 512 × 4,000 pixels, (c) Bands: 128, and (d) File The procedure and interactions between the master and size: 500 MB. A display of the images before and after de-corslaves is shown in Figure 6. One slave corresponds to one relation stretch is shown in Plate 1. block at a time. After one block task is completed by a slave, Geometric Correction the next block corresponding to the slave will be processed The input dataset is as follows: (a) Radarsat-2 strip model imby the slave to continue the procedure. The slave processors age, (b) SLC image with two channels, (c) Image size: 9,712 × execute the recursive procedure to generate final results for 11,349, (d) Data type: signed 16-bit, and (e) File size: 422 MB. each block, and then send the results to the master. The recurThe corrected image is a 556 MB-sized file with two channels sive procedure consists of several operations of fetching and corresponding to the magnitude and phase value, respecexporting, reading the necessary data from the input file(s), tively. The images before and after geometric correction are and a series of steps. The master receives the results sent by shown in Figure 7. multiple slave processors and writes the results to the output file until all blocks in the result image are finished. In these Image Fusion parallel tests, the numbers of input files used in different algo- To show the fusion effect, the high-resolution QuickBird rithms are not the same. Filtering, de-correlation stretch, and images in a scene are fused by the BR algorithm. Information geometric correction only need one input file (Figure 6a). The about the images is shown in Table 1. The two segments of number of input files for image fusion and DEM extractions is the panchromatic image with 0.6-meter resolution, the multwo (Figure 6b), while the numberDelivered for image by mosaics can beTechnology Publishing to:image McMaster Universityresolution; the fusion results tispectral with 2.4-meter more than two (Figure 6c). In general, there is one output file IP: 41.189.254.79 On: Sat, Jan in 2016 23:43:15 are 23 shown Plate 2. for these algorithms, whose Copyright: form is either 3D pointsSociety (Figurefor Photogrammetry and Remote Sensing American 6d) or image (Figure 6e). Image Mosaics This parallel strategy is similar to those used in the litIn this experiment, nine TM images from different times erature (Nicolescu and Jonker, 2002; Plaza et al., 2006); the around Beijing and Tianjin, China were selected as input only difference is that parallelism is applied to computation images. The file sizes of each image, which contain only 3 as well as file reading in this paper, while the parallelism in bands, range from 177 MB to 202 MB. The result is shown in the literature (Nicolescu and Jonker, 2002; Plaza et al., 2006) Plate 3, where the file size is 1.11 GB, and the image size is is only applied to computing. The idea behind the adopted 20,678 × 19,151 pixels. mechanism is driven by high parallelism which is achieved DEM Extractions using two types of effort. First, many slaves read and process A 2.5-meter Cartosat-1 dataset provided by the program blocks simultaneously. Therefore, the consumed processing “Benchmarking and quality analysis of DEM generated from times of each slave are able to overlap one another. Second, high and very high resolution optical stereo satellite data (Rebecause of simultaneous execution of master and slaves, inartz et al., 2010)” organized by the ISPRS Working Group writing by master and computing by slaves are concurrent, I/4 (2008 to 2012) is used in this experiment. The selected which is shown in Figure 6. As a result, the computational region is an area in Catalonia, Spain that includes medium time overlaps the disk I/O time such that the entire runtime undulated terrain as well as steep mountainous terrain. The decreases. In short, the mechanism is capable of concurrently stereo cover in the dataset is approximately an image size of processing multiple blocks. The improvements on processing 2,000 pixels × 2,000 pixels. The 3D point clouds automaticalperformance are demonstrated in the next Section. ly extracted by the tested algorithm are displayed in Plate 4. Another advantage is that memory consumption is small in the adopted parallel mechanism. The parallel procedure The Performance of Parallel Processing only reads blocks, whose number is equal to the number of One workstation on which Windows-7™ operating system slaves, into the memory and writes a block into the result file (OS) is installed is used in the experiments, and the configuat a time. Compared to the situation where the entire image rations of the computer are shown in Table 2. The other, a is imported or exported with one step, the consumption of customized workstation with Linux OS, was also used in the memory tremendously decreases in our parallel mechanism. experiments. The configurations of this computer are shown Sometimes, because of the large file, it is impossible to load or in Table 3. The workstation has lower capabilities than the write a whole image by one step. above Windows OS computer. The index used to evaluate the performance of parallel computing, i.e., speedup S(p), is used in the following experiParallel Performance and Analysis ments and subsequent analysis (Wilkinson and Allen, 1999). Each tested algorithm coupled with the above parallel mechaThe index demonstrates the increase in execution speed of nism is implemented using C++ programming language, the parallel vs. serial algorithms by Equation 2: and runs on two multi-core computers with Windows™ and
378
M a y 2 015
05-15 May Peer Reviewed.indd 378
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
4/15/2015 10:40:51 AM
(a)
(b)
(a)
(b)
Delivered by Publishing Technology to: McMaster University omis23:43:15 image, and (b) The image after de-correlation. Plate 1. Display of images before and after de-correlation stretch: (a) The IP: 41.189.254.79 On: Sat, 23 original Jan 2016 Copyright: American Society for Photogrammetry and Remote Sensing
Figure 7. Images before and after rpc-based geometric correction: (a) The original image, 9,712 × 11,349 pixels, display of real channel in slc, and (b) The image after correction, 12,913 × 11,239 pixels, display of magnitude channel. Table 1. A Scene of QuickBird Panchromatic and Multispectral Images Image type
File size
Image size
Panchromatic
346 MB
Multispectral
86.7 MB
Fusion results
1.35 GB
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
05-15 May Peer Reviewed.indd 379
Image File Format
Data type
Bands
24784 × 14532
ERDAS IMAGINE
Unsigned 8-bit
1
6196 × 3633
ERDAS IMAGINE
Unsigned 8-bit
4
24784 × 14532
ERDAS IMAGINE
Unsigned 8-bit
4
M a y 2015
379
4/15/2015 10:41:00 AM
Delivered by Publishing Technology to: McMaster University IP: 41.189.254.79 On: Sat, 23 Jan 2016 23:43:15 Copyright: American Society for Photogrammetry and Remote Sensing
Plate 2. Fusion results by block-regression-based algorithm. Fusion results and the multispectral images are displayed in the form of false infrared (ir) color composites of 4, 3, and 2 bands: (a) and (b) Two segments of the QuickBird panchromatic image; (c) and (d) Two segments of the QuickBird multispectral image; and (e) and (f) Two segments of the fusion result of the BR algorithm. Table 2. Configurations of the Workstation Computer with Windows-7 OS Dell Precision T7500 workstation CPU: Two Intel Xeon X5675 CPUs, 3.06 GHz, 6 cores per CPU, 24 virtual CPU cores using hyper-threading technology RAM: 48 GB Disk and file system: A Solid-State Drive (SSD) with 120 GB, NTFS file system OS: Windows-7 Professional, Service Pack 1, 64-bit Compiler: Microsoft Visual C++ 10.0 MPI: Intel MPI 4.0 Update 3, IA32
Table 3. Configurations of the Workstation Computer with Linux OS A customized workstation CPU: Two Intel Xeon E5520 CPU, 2.26 GHz, 4 cores per CPU, 16 virtual CPU cores using hyper-threading technology RAM: 12 GB Disk and file system: Areca 1222 RAID card, raid 0, 6 disks, Strip Size: 4 KB, SAS Disk: Seagate Savvio (ST9300603SS), 300G, 10K, ext3 file system OS: Redhat Enterprise Linux Server, Release 5.4, X86_64 Compiler: GNU GCC 4.1.2 MPI: LAM-MPI 7.1.1
380
M a y 2 015
05-15 May Peer Reviewed.indd 380
S ( p) =
tserial t parallel
(2)
where, tparallel is the entire time needed to finish the full procedure by a parallel algorithm using p processors, and tserial is the corresponding time required by serial processing. In these experiments, serial processing only using one processor is a controlled case where block partition and the recursive procedure are embodied, but parallel mechanism is exempted. The single processor executes the recursive procedure (as previously described) to generate final results for each block, and then writes the results to the output file instead of sending them to another processor, block-by-block. The setting of block partition and the recursive procedure in serial processing are the same as parallel processing. The execution times used in Equation 2 represent the entire time required to finish the full procedure of an algorithm, including file reading, writing time, and computing time. The experiments using different numbers of processors, which can be specified by user but cannot be more than the total number of cores (or virtual cores) available in the employed computer, are carried out. For both computers shown in Table 2 and Table 3, even numbers under the total number are used, which are as indicated in Table 4 and Table 5. The
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
4/15/2015 10:41:04 AM
Delivered by Publishing Technology to: McMaster University IP: 41.189.254.79 On: Sat, 23 Jan 2016 23:43:15 Copyright: American Society for Photogrammetry and Remote Sensing Plate 3. The result after mosaics of nine tm images.
Plate 4. Result of the extracted point clouds. PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
05-15 May Peer Reviewed.indd 381
M a y 2015
381
4/15/2015 10:41:15 AM
Table 4. Execution Times (Including Disk I/O and Computation, in Seconds) of Different Numbers of Processors on the Windows Computer and Speedup Number of Processors Filtering De-correlation RPC-based correction BR fusion Mosaics DEM extractions
Times Speedup Times Speedup Times Speedup Times Speedup Times Speedup Times Speedup
1
2
4
6
8
10
12
14
16
18
20
22
24
317 1.0 277 1.0 48 1.0 683 1.0 228 1.0 5434 1.0
274 1.2 328 0.8 42 1.1 660 1.0 210 1.1 5410 1.0
113 2.8 118 2.3 18 2.7 419 1.6 78 2.9 1843 2.9
78 4.1 75 3.7 13 3.7 191 3.6 51 4.5 1153 4.7
65 4.9 56 4.9 11 4.4 119 5.7 41 5.6 822 6.6
59 5.4 46 6.0 11 4.4 136 5.0 35 6.5 637 8.5
61 5.2 42 6.6 10 4.8 116 5.9 31 7.4 519 10.5
68 4.7 56 4.9 11 4.4 173 3.9 40 5.7 657 8.3
80 4.0 53 5.2 12 4.0 153 4.5 37 6.2 614 8.9
81 3.9 49 5.7 12 4.0 119 5.7 35 6.5 527 10.3
83 3.8 45 6.2 11 4.4 126 5.4 32 7.1 475 11.4
83 3.8 43 6.4 12 4.0 118 5.8 31 7.4 440 12.4
93 3.4 39 7.1 12 4.0 112 6.1 31 7.4 401 13.6
Table 5. Execution Times (Including Disk I/O and Computation, in Seconds) of Different Numbers of Processors on the Linux Computer and Speedup Number of Processors
1
2
4
6
8
10
12
14
Times 408 360 134 142 110 95 81 76 Speedup 1.0 1.1 3.0 2.9 3.7 4.3 5.0 5.4 Times 964 882 304 189 197 185 152 137 De-correlation Speedup 1.0 1.1 3.2 5.1 4.9 5.2 6.3 7.0 Times 58 55 20 12 13 11 9 11 RPC-based correction Speedup 1.0 1.1 2.9 4.8 4.5 5.3 6.4 5.3 Times 1608 1558 799 376 359 354 288 280 BR fusion Speedup 1.0 1.0 2.0 4.3 4.5 4.5 5.6 5.8 Times 213 189 68 52 38 54 43 44 Mosaics Speedup 1.0 1.1 3.1 4.1 5.6 3.9 5.0 4.8 Times 14081 14066 4767 4863 3643 2804 2347 2024 DEM extractions Delivered by Publishing Technology to: McMaster University Speedup 1.0 1.0 On:3.0 2.9Jan 2016 3.9 23:43:15 5.0 6.0 7.0 IP: 41.189.254.79 Sat, 23 Filtering
16 67 6.1 122 7.9 8 7.3 241 6.7 47 4.5 1752 8.0
Copyright: American Society for Photogrammetry and Remote Sensing
Figure 8. Speedups of different numbers of processors on the Windows computer. ability of controlling the number of used processors makes it possible for users to select the optimal number which can achieve the maximum speedup. In addition, users can flexibly allocate the computational resources in a computer for a specific parallel procedure by setting the number of processors. The execution times and speedups of the six algorithms on the computers with Windows OS and Linux OS are indicated in Table 4 and Table 5, respectively. In the two tables,
382
M a y 2 015
05-15 May Peer Reviewed.indd 382
Figure 9. Speedups of different numbers of processors on the Linux computer. the shortest times and the maximum speedups are bold. The speedups achieved by different numbers of processors are also shown in Figure 8 and Figure 9, which correspond to the Windows computer and the Linux computer, respectively. These results show that the maximum speedups vary for the different algorithms and computing platforms. The characteristics of parallel performance attained by different numbers of processors are summarized in the following.
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
4/15/2015 10:41:19 AM
operations and the disk I/O operations occur simultaneously. In other words, overlap between the computational time and the disk I/O time results in the greatest decrease of the overall runtime. The strategies described in The Method of Parallel Processing Section are underscored by the fact that the shortest times indicated in Table 4 and Table 5 are closer to the theoretical limit. Thus, the parallel strategy is optimal. Third, the computing platform’s performance also determines the execution times and the effectiveness of parallel computing. The performance of a multi-core computer depends on three factors, i.e., the computing power of one CPU core, the number of cores, and disk I/O performance. Generally speaking, if a specific number of cores which is less than the number achieving the maximum speedup is used, the more powerful a single CPU core, the shorter the execution time. For instance, when four cores are used, the time required for BR fusion on the Windows computer is shorter than that on the Linux computer. The reason is that the power of a single core in the Windows computer is stronger than one in the Linux computer. Thus, a smaller number of powerful cores are equal to a larger number of weak cores in terms of Factors Determining Parallel Performance the execution performance. If the amount of required compuThe differences in speedup values for the different algorithms tation is large enough, for example, in the DEM extractions, the are mainly determined by the following factors. larger number of cores and higher performance of disk I/O are First, the amount of required computation per image size preferable. The speedup is 13.6× for DEM extractions on the is one of the main factors that determined the effectiveness of Windows computer when 24 processors are used. In any case, parallel computing. In other words, under conditions of the high performance of disk I/O means that less time is required same amount of disk input/output (I/O), the larger the amount for reading and writing. of required computation, the higher the achieved maximum In these experiments, once the size of the available RAM speedup. This is straightforward and evidenced by the fact that in a computer is above a threshold, RAM is not a bottleneck the maximum speedup values are different when the amount affecting the parallel performance. Assuming there are k cores of disk I/O is the same. For instance, in these experiments, the in a computer, the threshold of required memory is the size amount of disk I/O for de-correlation is the same as that for the supporting the concurrent processing of (k - 1) blocks. From RPC-based correction, but the maximum speedups are different: the to: Parallel Processing Mechanism Section, we know that Delivered by Publishing Technology McMaster University 7.1× and 7.9× on the computers with Windows OS and Linux memory consumption in the adopted parallel mechanism is IP: 41.189.254.79 On: Sat, 23 Jan 2016 23:43:15 OS, respectively, for the de-correlation as opposed to 4.8× small. Usually ordinary personal computer can meet the Copyright: American Society Photogrammetry andthe Remote Sensing and 7.3× on the computers with Windows OS and Linuxfor OS, requirements. Although the available RAM in the employed respectively, for the RPC-based correction. This result occurs computers is relatively large (48 GB and 12 GB, respectively), because the computational cost of de-correlation is larger than which is assembled in the course of purchase, the amount of that of the RPC-based correction in these experiments. Another RAM is not a factor that impacts parallel performance when an example is that the DEM extraction algorithm achieves as high achievable multi-core computer is used. as 13.6× speedup because of its very large required amount of computation. Hence, the conclusion can be drawn that under Comparison with Other Multi-core Parallel Methods conditions of the same amount of disk I/O, the greatest speedup Compared with the experimental results in the literature is determined by the amount of computation required. (Remon et al., 2011), in which no significant improvements Second, the amount of data in these algorithms is relatively is observed for the multi-core version over the optimized large. Because the disk I/O operation is not very efficient due to OSP (orthogonal subspace projection) using one core, and the the limits of the hardware, the reading and writing processes multi-core version for N-FINDR with eight cores achieves a are time-consuming. Thus, the amount of required disk I/O per 3.1× speedup (15.001 sec using one core, 4.879 sec using eight cores). The speedup values in this paper range from 3.7× to image size is also one of the main factors that determine the parallel performance. In this regard, the experiments in this pa- 6.6× if eight processors are used. The results on a multi-core system with 12 physical cores presented by Bernabe et al. per record the entire time required to finish the full procedure (2013) show that there is no significant difference between usof an algorithm, including the time for reading and writing and ing 4 or 12 cores (Figure 11 in this referenced paper), because the time for computing, instead of only the time for computing. the parallel implementation is not optimized to increase its For those cases (i.e., RPC-based correction and mosaics on scalability to a high number of cores. Our method can achieve the Windows computer and mosaics on the Linux computer), high scalability evidenced by the results in Table 4 and Table 5. the maximum speedup is not improved very much, even The higher performance is achieved using two improvethough more processors are used. These experimental results ments. First, the flexible and optimal parallel strategies imply that the large amount of disk I/O is the main determinadapted for remote sensing image processing can be embeding factor that prevents greater improvement of the speedup ded in the method by means of the MPI. However, the parallel for those algorithms. Theoretically, the time required to directly read the source images and to write the resulting imagcomputing in the literature (Bernabe et al., 2013; Remon et es without computation is a hard limit, which is most closely al., 2011), which is built on the OpenMP, basic linear algeapproached when unlimited processors are utilized. Through bra subprograms (BLAS) and linear algebra package (LAPACK) analysis, the best case is that the entire time required to finish libraries supported by the compilers, exploits multi-threaded the complete operation is equal or close to the limit. Because linear algebra subprograms and parallelized loops. Second, the algorithms in this paper are all data-intensive operathe latest computers containing more cores are used in these tions, the best parallel computing strategy applied to achieve experiments. Therefore, the scalability of the adopted parallel high speed and efficiency is that in which the computation mechanism properly matches the newest hardware. For the case of two processors, one processor is used as a master for receiving and writing the resulting data, and the other is used as a slave to read and process the image data. The execution time for two processors is slightly shorter than that for one processor, except for the case of de-correlation on the Windows computer. With an increase in the number of processors, the execution times decrease. When the number of processors is increased from two to eight, the decreases in processing times are significant. For example, in the case of the BR fusion algorithm on the Windows computer, the time for two processors is 660 sec, while the time for eight processors is 119 sec. In this range, parallel computing achieves nearly linear speedup for the selected algorithms. When the number of processors is greater than eight, the decrease in processing time is gradual, or a decrease in processing time does not occur. For instance, for de-correlation on the Linux computer, the time for ten processors is 185 sec, while the time for sixteen processors is 122 sec. The decrease in time is 63 sec. In the cases of the filtering and RPC-based correction on the Windows computer, further decreases in time do not occur.
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
05-15 May Peer Reviewed.indd 383
M a y 2015
383
4/15/2015 10:41:19 AM
Conclusions
ASTRIUM, 2013. PIXEL FACTORY™, The Power of an Industrial
Solution in your Hands, URL: http://www2.astrium-geo.com/ This paper applied a straightforward and highly parallelized files/pmedia/public/r367_9_geo_014-pixelfactory_en_4p mechanism to investigate parallel performance of five categop.pdf (last date accessed: 18 March 2015). ries of typical algorithms in remote sensing based mapping on Bernabe, S., S. Sanchez, A. Plaza, S. Lopez, J.A. Benediktsson, and R. two multi-core computers. According to these experimental Sarmiento, 2013. Hyperspectral unmixing on GPUs and multiresults, conclusions can be drawn that parallel computing on core processors: A comparison, IEEE Journal of Selected Topics in a multi-core computer for these algorithms not only yields Applied Earth Observations and Remote Sensing, 6(3):1386–1398. high speedups but also efficiently leverages the computational Campbell, N., 1996. The decorrelation stretch transformation, resources of the state-of-the-art hardware platforms. The International Journal of Remote Sensing, 17(10):1939–1949. achieved performance of each algorithm on two multi-core Christophe, E., J. Michel, and J. Inglada, 2011. Remote sensing computers both with two CPUs is improved significantly. The processing: From multicore to GPU, IEEE Journal of Selected highest speedups of these algorithms range from 4.8× to 13.6×. Topics in Applied Earth Observations and Remote Sensing, The parallel method adopted in this paper is quite easy to 4(3):643–652. implement and is seamlessly integrated with the five categoDu, Q., N. Raksuntorn, A. Orduyilmaz, and L.M. Bruce, 2008. ries of typical algorithms. The method has a broad applicabilAutomatic registration and mosaicking for airborne multispectral ity. Because these algorithms require reading and writing a image sequences, Photogrammetric Engineering & Remote large amount of data and the component dealing with reading Sensing, 74(2):169 -181. and writing is less efficient than other components in a multiFörstner, W., 1986. A feature based correspondence algorithm for core computer, a good parallel strategy is that the computaimage matching, International Archives of Photogrammetry and Remote Sensing, 26(3):150–166. tion and disk I/O operations occur simultaneously resulting in decrease of overall runtime. The adopted parallel method not Fraser, C., G. Dial, and J. Grodecki, 2006. Sensor orientation via RPCs, ISPRS Journal of Photogrammetry and Remote Sensing, only enables multiple blocks to be processed concurrently on 60(3):182–194. multiple cores but also implements the overlap between the Gillespie, A.R., A.B. Kahle, and R.E. Walker, 1986. Color computational time and the time of reading and writing. Thus, enhancement of highly correlated images - I. Decorrelation the parallel method is optimized for a multi-core computer. and HSI contrast stretches, Remote Sensing of Environment, Parallel performance on a multi-core computer is deter20(3):209–235. mined by these factors: the amount of required computation Grodecki, J., G. Dial, and J. Lutes, 2004. Mathematical model for per image size, the amount of required disk I/O per image 3D feature extraction from multiple satellite images described size, and the performance of the computing platform. The by RPCs, Proceedings of the 2004 ASPRS Annual Conference, larger the amount of required computation, the higher the Denver, Colorado. achieved maximum speedup. Hence, an algorithm requirGropp, W., E. Lusk, N. Doss, and A. Skjellum, 1996. A highing a large amount of computation can be scaled. The large performance, portable implementation of the MPI message amount of required disk I/O per image size is a key factor that passing interface standard, Parallel Computing, 22(6):789–828. prevents improving the speedup to an even higher level for Technology to: McMaster University Delivered by Publishing Han, S.H., J. Heo, H.G. Sohn, and K. Yu, 2009. Parallel processing some algorithms. In order to decrease the of reading and IP:time 41.189.254.79 On: Sat, 23 Jan 2016 23:43:15 method for airborne laser scanning data using a PC cluster and a writing, high performance disk should be in high demand Copyright: American Societyin for Photogrammetry and Remote Sensing virtual grid, Sensors, 9(4):2555–2573. a computing platform. The number of CPU cores in the latest Lee, C.A., S.D. Gasster, A. Plaza, C.-I. Chang, and B. Huang, 2011. computers with symmetrical multi-processing (SMP) and hyRecent developments in high performance computing for remote per-threading technology can meet the requirements of most sensing: A review, IEEE Journal of Selected Topics in Applied algorithms in remote sensing based mapping to some extent. Earth Observations and Remote Sensing, 4(3):508–527. The testing results and contributions can guide ordinary Luo, W., B. Zhang, and X. Jia, 2012. New Improvements in parallel users in remote sensing community to promote processing implementation of N-FINDR algorithm, IEEE Transactions on Geoscience and Remote Sensing, 50(10):3648–3659. speed using desktop computing platforms. In the course of parallel processing of typical algorithms in remote sensing Nicolescu, C. and P. Jonker, 2002. A data and task parallel image processing environment, Parallel Computing, 28(7):945–965. based mapping on a multi-core computer, optimal selections are recommended as follows. The number of processors rangNovak, K., 1992. Rectification of digital imagery, Photogrammetric Engineering & Remote Sensing, 58(3):339–344. es from 8 to 12 for computers with high-frequency CPU(s) or from 12 to 16 for computers with low-frequency CPU(s); high PCI Geomatics, 2009. GeoImaging Accelerator Ortho Performance Test Results, A PCI Geomatics White Paper, URL: http://www. performance disk, e.g., solid-state drive disk is in demand as pcigeomatics.com/pdf/GXL_Ortho_Whitepaper.pdf, (last date high as possible; and either Windows 7 OS or Linux OS is accessed: 18 March 2015). appropriate.
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Grant No. 40901229, 863 Program under Grant No. 2011AA120401, and the Basic Research Fund of Chinese Academy of Surveying and Mapping under Grant No. G7771413. Special thanks are given to the data providers for the provision of the stereo data sets, especially Euromap for the Cartosat-1 data. The authors thank the anonymous reviewers for their valuable comments.
References
Adrov, V.N., M.A. Drakin, and A.Y. Sechin, 2012. High Performance photogrammetric processing on computer clusters, International Archieves for Photogrammetry, Remote Sensing and Spatial Information Sciences, XXXIX-B4:109–112.
384
M a y 2 015
05-15 May Peer Reviewed.indd 384
Plaza, A., D. Valencia, J. Plaza, and P. Martinez, 2006. Commodity cluster-based parallel processing of hyperspectral imagery, Journal of Parallel and Distributed Computing, 66(3):345–358. Plaza, A., Q. Du, Y.-L. Chang, and R.L. King, 2011. High performance computing for hyperspectral remote sensing, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 4(3):528–544. Plaza, A.J., 2008. Parallel techniques for information extraction from hyperspectral imagery using heterogeneous networks of workstations, Journal of Parallel and Distributed Computing, 68(1):93–111. Reinartz, P., P. D’Angelo, T. Krauß, D. Poli, K. Jacobsen, and G. Buyuksalih, 2010. Benchmarking and quality analysis of DEM generated from high and very high resolution optical stereo satellite data, Proceedings of the 2010 Canadian Geomatics Conference and Symposium of Commission I, ISPRS Convergence in Geomatics - Shaping Canada’s Competitive Landscape, Calgary, Alberta, Canada.
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
4/15/2015 10:41:19 AM
Remon, A., S. Sanchez, A. Paz, E.S. Quintana-Orti, and A. Plaza, 2011. Real-time endmember extraction on multicore processors, IEEE Geoscience and Remote Sensing Letters, 8(5):924–928. Rotenberg, K., L. Simard, and P. Simard, 2013. Dense DSM generation using the GPU, Proceedings of Photogrammetric Week 2013 (D. Fritsch, editor), Stuttgart, Germany, pp. 285–295. Sohn, H.G., C.H. Park, and H. Chang, 2005. Rational function model-based image matching for digital elevation models, The Photogrammetric Record, 20(112):366–383. Tao, C.V., and Y. Hu, 2001. A comprehensive study of the rational function model for photogrammetric processing, Photogrammetric Engineering & Remote Sensing, 67(12):1347–1358. Wilkinson, B., and C.M. Allen, 1999. Parallel Programming, Prentice Hall, New Jersey. Winter, M.E., 1999. N-FINDR: An algorithm for fast autonomous spectral end-member determination in hyperspectral data, SPIE’s Proceedings of the International Symposium on Optical Science, Engineering, and Instrumentation (M.R. Descour and S.S. Shen, editors), International Society for Optics and Photonics, pp. 266–275.
Yang, J.H., and J.X. Zhang, 2012. A parallel implementation framework for remotely sensed image fusion, ISPRS Annuals for Photogrammmetry, Remote Sensing, and Spatial Information Sciences, Copernicus Publications, pp. 329–334. Zhang, G., W. Fei, Z. Li, X. Zhu, and D.R. Li, 2010. Evaluation of the RPC model for spaceborne SAR imagery, Photogrammetric Engineering & Remote Sensing, 76(6):727–733. Zhang, J.X., J.H. Yang, Z. Zhao, H.T. Li, and Y.H. Zhang, 2010. Block-regression based fusion of optical and SAR imagery for feature enhancement, International Journal of Remote Sensing, 31(9):2325–2345. Zhang, L., X. He, T. Balz, X. Wei, and M. Liao, 2011. Rational function modeling for spaceborne SAR datasets, ISPRS journal of Photogrammetry and Remote Sensing, 66(1):133–145.
(Received 13 March 2014; accepted 06 November 2014; final version 17 December 2014)
Delivered by Publishing Technology to: McMaster University IP: 41.189.254.79 On: Sat, 23 Jan 2016 23:43:15 Copyright: American Society for Photogrammetry and Remote Sensing
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
05-15 May Peer Reviewed.indd 385
M a y 2015
385
4/15/2015 10:41:19 AM