Fast inter-frame prediction in multi-view video coding based on perceptual distortion threshold model

Fast inter-frame prediction in multi-view video coding based on perceptual distortion threshold model

Accepted Manuscript Fast inter-frame prediction in multi-view video coding based on perceptual distortion threshold model Gangyi Jiang, Baozhen Du, Sh...

1MB Sizes 0 Downloads 76 Views

Accepted Manuscript Fast inter-frame prediction in multi-view video coding based on perceptual distortion threshold model Gangyi Jiang, Baozhen Du, Shuqing Fang, Mei Yu, Feng Shao, Zongju Peng, Fen Chen

PII: DOI: Reference:

S0923-5965(18)30346-1 https://doi.org/10.1016/j.image.2018.10.002 IMAGE 15455

To appear in:

Signal Processing: Image Communication

Received date : 14 February 2017 Revised date : 14 April 2018 Accepted date : 5 October 2018 Please cite this article as: G. Jiang, et al., Fast inter-frame prediction in multi-view video coding based on perceptual distortion threshold model, Signal Processing: Image Communication (2018), https://doi.org/10.1016/j.image.2018.10.002 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Fast Inter-frame Prediction in Multi-View Video Coding Based on Perceptual Distortion Threshold Model Gangyi Jiang1, Baozhen Du1,2, Shuqing Fang1, Mei Yu1, Feng Shao1, Zongju Peng1, Fen Chen1 1

Faculty of Information Science and Engineering, Ningbo University, Ningbo, China. 2

Electronic &Information college, Ningbo Polytechnic, Ningbo, China.

Abstract: Multi-view video coding (MVC) utilizes hierarchical B picture prediction structure and adopts many coding techniques to remove spatiotemporal and inter-view redundancies at the cost of high computational complexity. In this paper, a novel perceptual distortion threshold model (PDTM) is proposed to reveal the relationship between the mode selection of inter-frame prediction and coding distortion threshold. Based on the proposed PDTM, a new fast inter-frame prediction algorithm in MVC is developed aimed at minimizing computational complexity for dependent view coding. Then the fast MVC algorithm is incorporated into the multi-view High Efficiency Video Coding (MV-HEVC) software to improve MVC coding efficiency. In practical coding, the mode selection for inter-frame prediction of dependent views may be early terminated based on the thresholds derived from the PDTM, thereby reducing the coding time complexity. Experimental results demonstrate that the proposed algorithm can reduce the computational complexity of the dependent views by 52.9% as compared with the HTM14.1 algorithm under the coding structure of hierarchical B pictures. Moreover, the bitrate is increased by 0.9% under the same subjective quality and only increased by 1.0% under the same objective quality peak signal-to-noise ratio (PSNR). Compared with the state-of-the-art fast algorithm, the proposed algorithm can save more coding time, while the bitrate under the same PSNR increases slightly. Keywords: Multi-view video coding; Perceptual distortion threshold model; Binocular just noticeable difference; Fast mode decision 1.

Introduction With the rapid advances in multimedia and network technologies, the traditional two dimensional (2D) videos have

become unable to meet people’s visual requirements. Compared with 2D videos, three dimensional (3D) videos can better represent the genuine and natural visual experience [1-3]. Hence, Video Coding Experts Group of ITU-T and Moving Picture Experts Group of ISO/IEC jointly set up JCT-VC to establish the high efficiency video coding (HEVC) standard [4]. Moreover, its 3D extensions were further implemented, and thus a new generation of 3D video coding standards had been established, namely 3D-HEVC and MV-HEVC [5]. However, in addition to the underlying techniques including single-view coding unit and quad-tree coding structure [6], some techniques, such as inter-view prediction with disparity estimation (DE), were used in 3D video coding. Moreover, multi-view color video, which is mainly one of common 3D video formats for scene representation, leads to huge amount of data. Thus, the resulting computational complexity of multi-view video coding (MVC) is markedly increased. Therefore, how to reduce the computational complexity of MVC without affecting its rate distortion (RD) performance has become a hot research issue in this field. Recently, some MVC algorithms have been proposed. By early predicting the Merge modes with the inter-view relationship, Zhang et al. [7] reduced the codec complexity by suitably terminating the selection of other prediction modes with the 3D-HEVC and MV-HEVC standards. Silva et al. [8] extracted edge information of the predicting units (PUs) in color image and accelerated the intra-frame rough mode decision for multi-view color and depth video using edge directions. Although the method could achieve relatively good coding performance, it did not take into account the highly complicated RD optimization during intra-frame coding, thus leading to limited complexity reduction. Zhang et al. [9] proposed a fast INTRA coding unit (CU) depth decision method based on statistical modeling and correlation

analyses, in which the most probable depth range was predicted based on the spatial correlation among CUs. Moreover, they presented a statistical model based CU decision approach in which adaptive early termination thresholds were determined and updated based on the RD cost (RDC) distribution, video content, and quantization parameter (QP) since the spatial correlation might fail for image boundary and transitional areas between textural and smooth areas. By analyzing inter-view correlation and hierarchical depth correlation of coding modes, Song et al. [10] proposed an early Merge mode decision method for complexity reduction of dependent views to terminate the selection of other unnecessary prediction modes. However, these methods could not sufficiently reduce the coding complexity since they did not fully consider the spatiotemporal redundancy in MVC. Utilizing the inter-view correlations and the particularity of quad-tree structure, Chi et al. [11] proposed a fast CU depth decision method and directly converted the CU depth of independent views into that of dependent views based on the disparity, however, this method leads to poor RD performance. Zhang et al. [12] employed inter-view scene similarity, the temporal correlations of video and the spatial correlations in flat regions to accelerate MVC, thereby further lowering the coding complexity. By analyzing the statistical correlation between discrete cosine transform (DCT) coefficients of the current coding blocks (CBs) and the optimal prediction mode, Pan et al. [13] proposed an early DIRECT mode decision algorithm to reduce the coding complexity to some extent. However, there are about 30% CBs that do not adopt the DIRECT mode as the optimal during the coding process, the coding complexity reduction resulting from this algorithm is limited. Tohidypour et al. [14] proposed a fast adaptive mode prediction scheme based on Bayes classifier, and predicted the prediction mode of the current CU amongst the dependent views using the coding information of the encoded neighboring CUs of independent and dependent views, by which the coding complexity was further reduced. Jung et al. [15] proposed a fast mode decision method, in which the adaptive ordering of modes was adopted, achieving 47.27%~60.73%  time reduction on average with a little degradation in encoding efficiency. The above researches on fast encoding algorithms [7-15] were mainly characterised by predicting the mode and depth of the current CU based on the analyses of spatiotemporal and inter-view relationships. However, they rarely explored the human visual perceptual characteristics, and did not fully consider the inter-view and spatiotemporal differences. As a result, the inter-view and spatiotemporal correlations could not be accurately exploited, thereby better coding RD performance or substantial savings in computational time failed to be achieved. Since human eye is the ultimate receptor of the video, the characteristics of human visual system (HVS) [16] have to be taken into consideration. For example, due to the binocular fusion and suppression effect of the HVS, human vision is unable to perceive some slight distortions in the image. Therefore, how to make full use of the HVS characteristics to directly or indirectly improve the video processing without compromising the subjective perceptual quality of 3D images has become an important issue for 3D video coding [17]. In the past decades, some perceptual threshold models for video coding have been established. Yang et al. [18] proposed a spatial and temporal perception just noticeable distortion (JND) model constructed in pixel domain. Jia et al. [19] put forward a spatial and temporal perception JND model in DCT domain. Wei et al. [20] integrated contrast sensitivity function (CSF) into the JND model, which further represents the sensitivity of HVS in the aspect of frequency. However, these JND models were constructed on 2D plane without sufficiently considering the binocular perception of stereo image. Hence, they were only applicable to the encoding in the planar image domain other than the encoding of stereo images. As for 3D video applications, Chen et al. [21] established the foveated just-noticeable-distortion (FJND) model by combining the HVS characteristics, wherein the quantization step in H.264 encoder was varied locally and adaptively, thereby improving the subjective quality of the compressed video at the same bitrate. Silva et al. [22] proposed a just noticeable distortion in depth (JNDD) model for depth video coding. Zhao et al. [23] proposed a binocular just noticeable difference (BJND) model to reflect the differences between the left and right views that could be recognized by human eye. As demonstrated by the experiments, if the distortion of an image in a stereo image pair was smaller than BJND, due to the binocular fusion and suppression effect human eye could not percieve the distortion in the stereo image pair. However, this model did not take into account the disparity between the

left and right views. Based on the consideration of stereo matching, Jung et al. [24] presented the BJND model with disparity information, thus achieving good sharpness enhancements in stereo images. Wang et al. [25] designed visual perceptual experiments by using the paired comparison methods, and modeled the visible threshold as a linear function of the quantization parameter of the left (dominant) view for asymmetrical video coding. At present, the main idea of the researches on perception based fast coding algorithms is to establish a statistical correlation between the JND values and the optimal prediction modes. According to the HVS characteristics, Zhang et al. [26] established a statistical correlation between JND and the partition mode of CU to speed up the intra-frame coding to a certain degree. Wu et al. [27] constructed a threshold function according to the spatiotemporal JND model, and obtained the thresholds of various sized CUs to early terminate the depth of CU recursive partitioning, thereby speeding up video coding. For MVC, Shang et al. [28] adopted the JND model in mode prediction, and designed the thresholds based on JND statistics for early decision-making on motion estimation (ME) and DE to achieve good coding performance. However, the JND models in their work were derived by applying simple weighting of single-view JND model, and without considering 3D perception for real sense. Zhu et al. [29] optimized the mode selection process with respect to multiple reference frames using the BJND model. Furthermore, they analyzed the statistical correlation between multiple reference frames / bidirectional search in MVC and BJND, and identified that when the BJND of the coded macro-block was larger than the threshold, the mode selection process with respect to multiple reference frames could be optimized to reduce the MVC’s computational complexity. The above perception based fast coding algorithms [26-29] mainly focused on establishing certain threshold criterion using the JND model to speed up the coding, but did not take full advantage of essential properties of the JND model, which in fact was mainly used to reflect the degree that human eye perceives the just noticeable distortion of the images and videos, thus the coding efficiency of these algorithms is rather limited. As distortion directly affects the visual perception quality of image/video, how to guide the CU division and the selection of PU prediction mode using perception distortion criterion has become an important direction for breakthroughs in fast video coding algorithms. In this paper, a perceptual distortion threshold model (PDTM) based fast MVC algorithm is proposed. Combining the 3D-Sobel model and the BJND model, the proposed PDTM model can be used to estimate the sum of the squared errors (SSE) of the current CU in different regions of an image, and terminate the dependent view’s unnecessary mode selection of inter-frame prediction by comparing the estimated SSE with a threshold, thereby reducing the coding complexity of the dependent views in MVC. Experimental results show that the proposed algorithm outperforms other representative algorithms. The proposed algorithm is a fast inter-frame prediction method, and if it is combined with other fast intra-frame predication method, the computational complexcity of MVC could be reduced further. This paper is organized as follows. Section 2 describes the proposed PDTM based fast inter-frame algorithm in detail. The experimental results and discussions are given in Section 3. Finally, we conclude this paper in Section 4. 2.

Proposed PDTM Based Fast Inter-frame Prediction Algorithm for MVC The proposed PDTM based fast inter-frame prediction algorithm is designed for dependent view coding on

MV-HEVC platform. When encoding an image of a dependent view, firstly, the 3D-Sobel operators are used to obtain edge strength of different blocks of the image. Then, the image is classified into salient region and non-salient region according to the edge strength of the blocks. Furthmore, the SSE of the current CU is estimated by using the improved distortion quantization (D-Q) model for the salient blocks, whereas in terms of the non-salient blocks, considering the characteristics of spatiotemporal and inter-view correlations, the SSE of the current CU is estimated by conducting a linear weighting to distortions of the reference CUs. Finally, the BJND model is combined to derive the PDTM. In practical coding, for each CU, the SSE and PDTM threshold are calculatd subsequent to the encoding using the current prediction mode, and the unnecessary mode selection in inter-frame coding can be terminated in advance, thereby reducing the coding complexity.

2.1

3D-Sob bel Operatorss Based oon 2D-Sobel operators, o 3D--Sobel operatoors [30] add th he information in the tempporal direction n, denoted as t

direction, takking into accoount the effectt of image eddge and video motion on peerceptual quallity in spatial and temporall domains, resspectively. Figg. 1 shows thee 3D-Sobel opperator in t direction, while operators inn x and y direections can bee obtained by rrotating the t direction d by 90 9 o. As shownn in Fig. 1, thiis step not only utilizes the information of o other pixelss inside the sqquare region centered with h the central pixel within the same fraame, but alsoo takes advan ntage of pixell information iinside the regiion with the saame spatial poosition in the previous p fram me and the nexxt frame.

Fig. 1 3D D-Sobel operator in t directiion

The 3D--Sobel operatoors are used to o compute thee local gradien nts in three dirrections, nameely x, y, and t. As indicatedd by Fig. 1, thee model is a 3×3×3 3 matrix, thus, for a fr frame to be co oded, its two adjacent a framees are requireed to calculatee their gradiennt values. The local gradientt value G is deefined by G (i , j )    ( g x ( i , j ) 2  g y ( i , j ) 2 )    g t ( i , j ) 2

(1))

p (i, j) of o the image iin x, y and t directions, d andd where gx, gy ,and gt denotee the gradient values of the pixel at the position

are respectivvely defined inn Eqs.(2) (3) and (4). Acur, Apre and Aneext denote the current framee, previous fraame and nextt frame, respecctively. Symbbol “*” in Eqss.(2) (3) and (4) denotes th he convolution operation. α and β are th he adjustmentt parameters. T The larger α is, the more grradient inform mation of the image i is considered, whereeas the larger β is, the moree motion inforrmation of thee video is con nsidered. Let T be a thresh hold, when G((i,j)>T, the pixxel at (i,j) is regarded as a salient pixel, otherwise, ass a non-salientt pixel. The saalient pixels co onstruct the saalient region. 1 0 1  3 0 3 1 0 1 g x (i, j )  3 0 3  A pre   6 0 6   Acur  3 0 3  Annext  3 0 3 1 0 1 1 0 1

(2))

3 6 3 1 3 1  1 3 1     g y (i, j )   0 0 0   A pre   0 0 0   Acur   0 0 0   Anext  3 6 3  1 3 1  1 3 1 

(3))

1 3 1   1 3 1 gt (i, j )  3 6 3  Apre   3 6 3  Anext  1 3 1 1 3 1

(4))

Model 2.2 BJND M BJND m model describbes the minim mum distortionn in one view w that evokes perceptual diifferences in stereo image.. Some experiiments have shown s that th he BJND moodel is related d to the lumin nance adaptat ation and conttrast maskingg characteristiccs of the HVS S [23]. Given n the left and right views of o stereo imag ge, the BJND value of the right view iss defined as foollows [24]

BJNDR (bg (i  d ), eh (i  d ), Al (i  d ))  AC ,limmit (bg (i  d ), eh (i  d ))  (1  (

1 Al (i  d ) ) )  AC ,limit (bg (i  d ), e h (i  d ))

(5))

where BJND DR denotes the BJND value of the right vview image, d is the disparrity of the rigght view relatiive to the leftt view, AC,limit represents thee upper limit of the random m noise for th he binocular perceptual p disttortion from the t right view w onsidered andd the random noise of the left view is zzero; and bg(ii) denotes thee when the contrast maskinng effect is co on i. The noisee of the left viiew is controllled by  in thee range of 1.0 to 1.5, whichh average lumiinance of pixeels in the regio is set to 1.255 in [23]. eh(i) represents the edge gradiennt value obtaiined by using 55 Sobel opperators in thee region i, andd mum tolerable random nois e in the correesponding regiion i of the leeft view imagee. In [24], thee Al(i+d) denottes the maxim specific methhods are preseented to solve AC,limit, eh(i) aand Al(i+d). The T larger the BJND value iis, the smallerr the tolerablee binocular perrceptual distorrtion is.

2.3 The Prop posed PDTM M MVC uttilizes multiplle views of color video to joointly represen nt the real sceene based on th the single-view w HEVC. Thee independent view is codedd by directly using HEVC . As for inter--view frames coding, apartt from applyin ng the HEVC C DCP, and inteer-view residu ual prediction are added wh hile searchingg technologies, some codingg techniques, such as DE, D e CU [31]. All these muultiple views decision processes require a large amou unt calculationn the predictionn modes for each of RDCs andd the coding coomplexity of the t dependentt views is greaatly increased. Quantizaation process in video coding is one of tthe main facto ors leading to coding distorrtion, while th he variation off quantization parameter (Q QP) directly determines d thee degree of distortion d [32]]. Fig. 2 illusstrates four teest sequences’’ ween QP and mean m squared error (MSE) of a dependen nt view, in whhich the x-coo ordinate is QP P statistical corrrelation betw of the dependdent view, andd the y-coordinate denotes ccoding distorttion distributio on, that is, thee MSE betweeen the originall and reconstruucted images after the enco oding. It is seeen that the MS SE varies wheen encoding teest sequences with differentt QPs. The larrger the QP is, i the wider the MSE disstribution ranges. Howeverr, the majoritty of MSE vaalues are stilll concentratedd in a certain range. r Thereffore, if it is poossible to build a model to o estimate the distortion thrreshold of thee current CB iin advance to guide the preediction modee selection of the current CU C block, unnnecessary mod de predictionss can be reduced, and therebby the coding speed can be improved.

Fig. 2 Statistical co orrelation mapps between QP P and MSE off four test sequuences

model in videeo coding [33], the relationnship between n quantizationn Accordinng to the distoortion quantizzation (D-Q) m step and SSE E can be descrribed by b SSE = a  Qstep

(6))

where a and b are video coontent depend dent coefficiennts, and Qstep is the quantizaation step.

p converted to o QP, and Eq. ((6) is represen nted by The logaarithm is takenn on both sidees with the quaantization step ln( SSE )  lnn(a)  b  ln(2 2)  (QP  4) / 6

(7))

As indicated by Eq. (7), thhere exists a linear relationsship between ln(SSE) l and QP. Q

n In the ccase of intra-fframe coding, all the CUs ’ residuals will be quantizzed, thus theree exists certain correlation between the resulting disttortion and QP P; while in terrms of inter-fframe coding, since the SK KIP mode doess not transmitt

m not confo orm to Eq. (77). To further explore thee residuals in practical codding, its resulltant distortioon and QP may d view ws are encodeed by using QPs Q of 12, 16, 20, 24, 28, 32 2, 36, 40, andd relationship bbetween SSE and QP, the dependent 44, respectivvely, and then the relationsh hip between thhe SSE of thee smallest 88 8 CU block annd QP are stud died based onn statistical anaalysis. Fig. 3 ddepicts the relation maps between b ln(SSSE) and QP. Figs. F 3(a) and d 3(d) show tthe original im mages of twoo sequences. F Figs. 3(b) and 3(e) present the 3D-Sobell gradient imaages of the tw wo sequences, in which thee white regionn represents thhe salient regiions (abundan nt texture, edgge region or movement reegion) and thee black region n denotes thee non-salient rregion (flat annd static regiion). Figs. 3((c) and 3(f) plot p the relatiion maps betw ween ln (SSE E) and QP inn dependent viiew images, where w the whitte blocks reprresent that theere exists a lin near relation bbetween ln(SS SE) and QP inn this region (tthe correlationn coefficient iss larger than 00.97) and the black b blocks indicate i that thhere is no sign nificant linearr relation betw ween ln(SSE) and a QP in thiis region. From m Fig. 3, the proportion off linear relatioonship is high in the salientt region. In thhe non-salientt region of th he image, duee to the existtence of SKIP P mode, the linear relationship is onlyy satisfied in a small proporrtion of blocks and the overrall proportion n is not high. Thus, if the ccurrent CU is in the salientt region, givenn the parametters a and b in i Eq. (7), thee distortion th hreshold SSE of the currennt CU can bee estimated too further guidee the prediction mode decisiion of the currrent CU. Generally, the param meters a and b in Eq. (7) are depended on video contentts and thus thee correspondin ng parametricc

del and identify the relatioonship betweeen a and b, a values vary with differennt video conteents. To simpplify the mod hows the corre relation map between b a andd statistical anaalysis is condducted on 100,,000 samples satisfying Eq. (7). Fig. 4 sh b. As shown in Fig. 4, a annd b are mainly concentrateed in one areaa, which exhib bits a certain sttatistical patteern. Thereforee, i how to obtaain statistical vvalues of a an nd b. one of the keey points lies in

(a) Original Bookkarrival image

(b) 3D-Sobbel gradient image of (a)

(c) Relation map bbetween ln(SSE E) and QP

((d) Original Keendo image

(e) 3D-Sobbel gradient imaage of (d) (f) Relation map bbetween ln(SSE E) and QP

Fig g. 3 Relation m maps between n ln(SSE) and QP

Fig. 4 Thee correlation map m between a and b

(aa) Original Boookarrival image

(b) 3D D-Sobel gradien nt image

(cc) Relation mapp between SSE and QP

(d) Original Kendo K image

(e) 3D D-Sobel gradien nt image

(f)) Relation map between SSE and a QP

(gg) Original New wspaper image

(h) 3D D-Sobel gradien nt image

(i) Relation map between SSE and a QP

Fig. 5 The correlaation maps bettween SSE and d QP

According to the research in [34], the D-Q model is further improved as follows

SSE  a  QP b  Qstep

(8)

Similarly, the same method was applied to collect the statistics for the regions in the video that satisfy this formula. Figs. 5(a), 5(d), and 5(g) show the original images of the Bookarrival, Kendo, and Newspaper sequences, respectively; Figs. 5(b), 5(e) and 5(h) show the 3D-Sobel gradient images of the three sequences. The white regions in Figs. 5(c), 5(f) and 5(i) are blocks that satisfy Eq. (8), and the black regions represent the opposite. As shown in Fig. 5, the proportion of CUs that meet the above formula remains very high in the salient regions of the image. By contrast, in the flat and non-salient regions, due to the presence of SKIP mode, though a small number of CUs satisfy the relationship, the overall proportion is not high. To further analyze the proportion of CUs in the salient regions that satisfy the correlation in Eq. (8), the edge strength (ES) of the 88 CB is computed as follows N 1 N 1

ES    x (i , j ) ( N  N )

(9)

i 0 j 0

where (i, j) denotes the position inside the 88 block of the CU, and N stands for size of the smallest CU, namely N=8. x(i, j) denotes whether the pixel of the current position is at the edge or not, if the pixel is the edge pixel, then x(i, j) equals to 1, otherwise, x(i, j) is 0. Table 1 Proportion of CBs in the salient regions which satisfy Eq. (8) Sequences

Bookarrival

Kendo

Newspaper

Poznanstreet

Average

r

Tes 0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.97

0.642

0.692

0.758

0.834

0.892

0.937

0.941

0.949

0.949

0.778

0.95

0.815

0.851

0.899

0.936

0.966

0.985

0.990

0.988

1.000

1.000

0.93

0.901

0.924

0.958

0.973

0.990

0.997

1.000

1.000

1.000

1.000

0.91

0.936

0.955

0.980

0.986

0.994

1.000

1.000

1.000

1.000

1.000

0.97

0.643

0.689

0.728

0.752

0.771

0.777

0.776

0.773

0.766

0.711

0.95

0.818

0.855

0.882

0.897

0.904

0.902

0.890

0.909

0.896

0.880

0.93

0.896

0.917

0.933

0.946

0.949

0.945

0.944

0.950

0.948

0.937

0.91

0.933

0.949

0.961

0.969

0.973

0.970

0.967

0.970

0.974

0.972

0.97

0.767

0.804

0.825

0.851

0.873

0.902

0.905

0.887

0.891

0.857

0.95

0.908

0.926

0.939

0.954

0.962

0.967

0.965

0.956

0.953

0.857

0.93

0.959

0.968

0.977

0.983

0.988

0.989

0.981

0.972

0.984

0.929

0.91

0.976

0.982

0.989

0.993

0.995

0.994

0.990

0.984

1.000

1.000

0.97

0.695

0.702

0.705

0.698

0.675

0.616

0.568

0.518

0.462

0.434

0.95

0.842

0.839

0.836

0.819

0.797

0.750

0.714

0.675

0.636

0.614

0.93

0.901

0.895

0.890

0.873

0.852

0.814

0.784

0.755

0.722

0.702

0.91

0.933

0.929

0.922

0.909

0.894

0.866

0.845

0.825

0.799

0.787

0.97

0.687

0.722

0.754

0.784

0.803

0.808

0.798

0.782

0.767

0.695

0.95

0.846

0.868

0.889

0.902

0.901

0.907

0.890

0.882

0.871

0.838

0.93

0.914

0.926

0.940

0.944

0.945

0.936

0.927

0.919

0.914

0.892

0.91

0.945

0.954

0.963

0.964

0.964

0.958

0.951

0.945

0.943

0.940

Fig. 6 The coorrelation betw ween lna and b In addittion, let Tes bee the edge streength thresholld. For a CB, if ES> Tes, then the current nt CU belongs to the salientt region, otherrwise, it belonngs to non-salient region. T Table 1 show ws the proportiion of CBs inn the salient regions r whichh meet Eq. (8),, where r denootes the correlation coefficiient between the t actual dataa and the fitteed data obtained by Eq. (8).. In this case, 10 values of Tes in the rangee from 0 to 1 aare selected reespectively for statistical annalysis. From Taable 1, as r deecreases, the proportion p of tthe CBs meetiing this relatio onship in the ssalient regionss will increasee. Moreover, ass Tes increasess, the proportio on of the CBss meeting this relationship in i the salient rregions gradu ually increasess first and thenn decreases grradually. Hence, selecting aan appropriate threshold will w have obvioous influence on the actuall coding resultts. Considerinng that the con nfidence coeffi ficient of data fitting is usuaally required too be 0.95 or greater g and thee higher the prroportion of CBs C in the saalient regions that meet Eq q. (8) the better the perform mance will bee, thus, in thee experiments,, r is set as 0.995 and Tes is set s as 0.5. Undder this condiition, 90.7% CBs C in the saliient regions saatisfy Eq. (8).. Therefore, w when the curreent CB is in th he salient regioon, Eq. (8) caan be used to estimate e the SSSE of the currrent CB. Duee to the presennce of two parrameters in th he formula whhich are depen ndent on the video v content, to simply thee formula andd further exploore the relationnship between n a and b, onnly the correlaation between a and b in thhe CBs with th he correlationn coefficient r being above 0.95 is explo ored. Fig. 6 sshows the stattistical resultss of 100,000 samples betw ween a and b,, where the x-ccoordinate and y-coordinatee represent lnna and b, respeectively, and the t black line in Fig. 6 denotes the fittedd curve. As indiccated by Fig. 6, 6 there is a linear relation bbetween lna and a b, with thee correlation ccoefficient R2 of 0.9672, thee slope and inttercept being -0.356 and 1.1 123, respectiveely. Thus, a an nd b may be defined d as folllows: b  m  lna  n

(10))

where m andd n denote the slope and inteercept, the vallues of which are -0.356 and d 1.123, respeectively. For salieent regions, affter the correlation betweenn a and b is ob btained, Eq. (8 8) can be rewrritten as follow ws:

SSEe  a  QP mln a  n  Qstep

(11))

where SSEe ddenotes the SS SE of the the current c CU whhich belongs to t the salient region. r Due to tthe temporal correlation c beetween the viddeos, the param meter a can bee estimated frrom the QP an nd SSE of CU U in the correspponding positiion of the prev vious frame.

l=

lnSSEe _ pre - n  lnQPpre - lnQstep _ pre 1+ m  lnQP Ppre

(12))

where l standds for lna andd is correlated to the codingg information of o the previou us frame. SSE Eee_pre denotes th he SSE of thee correspondinng position off the previouss encoded fraame, QPpre is the quantizattion parameteer of the prev vious encodedd frame, and Qstep_pre represeents the quantiization step off the previous encoded fram me. Thus, thhe SSEe of the current CU in n the salient reegions can be rewritten as

SSEe  el  QP ml  n  Qstep

(13)

For non-salient regions, previous statistical experiments have shown that the vast majority of CUs in this type of regions do not satisfy the relationship of Eq. (8). Therefore, the above methods could not be used to predict SSE values of CUs in non-salient regions. Studies have shown that there exists a strong correlation between the views and the spatiotemporal characteristics [35-37]. As multi-view video is taken from different angles to the same scene, the inherent corelations include temporal relationships between temporally successive frames of each view and inter-view relationships between adjacent views. Therefore, the SSE value of CU in non-salient regions can be estimated by considering the usage of this spatiotemporal correlation. In this paper, we select four CUs, which have strong correlations with current CU and are most commonly used as reference CUs for inter-frame prediction of the current CU. Fig. 7 shows these four reference CUs, where the three patterns from left to right denote the CU of the independent view, the dependent view, and the previous frame of the dependent view, respectively. The orange block denotes the current CU of the dependent view, and the gray blcoks represent the reference units of the current CU.

Fig. 7 Four reference CUs used for estimating the SSE of the current CU in non-salient regions Therefore, the SSE of the current non-salient CU (SSEn) could be estimated by conducting a linear weighting to distortions of the reference CUs in the corresponding positions between the spatiotemporal domain and the views, and it is defined by N

SSEn   wi  SSEi

(14)

i 0

where N=3, i[0,1,2,3] represents the CU corresponding to the current CU in the independent view, the CU corresponding to the current CU in the previous frame of the dependent view, the CU on the left of the current CU, or the CU at the top of the current CU, respectively. SSEi denotes the SSE of the CU in its corresponding position, wi is the weight assigned to the i-th CU. The sum of the weights is 1, whose values are 0.3, 0.3, 0.2 and 0.2, respectively. After the SSE of the current CU in the salient region and non-salient region is estimated, the SSE obtained by the BJND value will be used to derive the PDTM as follows  SSEn   SSEbjnd , JTC     SSEe   SSEbjnd ,

if ES  Tes if ES > Tes

(15)

where JTC denotes the distortion threshold in the PDTM, α, β, and γ are the adjustment factors, SSEbjnd denotes the SSE generated by the BJND value in the current CU, which are defined by N 1 N 1

SSEbjnd    BJND ( x , y ) 2

(16)

x 0 y 0

where (x, y) denotes the position inside the current CU, BJND(x, y) represents the BJND value of the current position, and N is the size of the current CU.

2.4 Proposed PDTM Based Fast Inter-frame Prediction Algorithm in MVC During the process of selecting the inter-frame prediction modes in the dependent views, the RDC is calculated for the coding by using each prediction mode. Finally, the optimal prediction mode is determined by comparing the RDCs. A reconstructed value is obtained for coding at each time. Then, the SSE between the reconstructed value and the original value under the current prediction mode, denoted as SSEmode , is calculated. Whether to early terminate the prediction mode selection will be determined by comparing the values of SSEmode and JTC obtained by PDTM. If SSEmode≤JTC, the search of other prediction modes can be early terminated, and the searched prediction modes will be added to the candidate mode list. Otherwise, the search of the next prediction mode will be executed. To reduce the misjudgement ratio of the prediction mode, Merge2N×2N and Inter2N×2N prediction modes are exploited and added to the candidate mode list. Finally, the optimal prediction mode of the current CU is determined by comparing the RDCs of all the prediction modes in the mode candidate list. Table 2 gives the pseudo codes of the proposed PDTM based fast inter-frame prediction algorithm for encoding the current CU in MV-HEVC. Table 2 The pseudo codes of the proposed PDTM based fast inter-frame prediction algorithm in MV-HEVC. IF (the current CU belongs to a dependent view image) {

calculate SSEbjnd with the BJND value of the current CU; calculate the ES of the current CU; IF (ES >Tes) {

calculate a and b of current CU according to the encoded previous frame of the dependent view; estimate SSEe of the current CU using a and b;

} ELSE estimate the SSEn of the current CU; calculate JTC of the current CU based on PDTM; FOR (all diffenent prediction modes) {

calculate the SSEmode of CU in current prediction mode; IF (SSEmode ≤ JTC) {

terminate the search of other prediction modes; BREAK;}

IF (the current prediction mode is the last one amongst the inter-frame prediction modes) {

terminate the search of other prediction modes; BREAK;}

} select the optimal prediction mode amongst Merge 2N×2N, Inter2N×2N and other searched prediction modes; } ELSE perform the MV-HEVC standard algorithm for encoding the current CU; 3. Experimental Results and Discussions To test the reliability of the proposed algorithm, the HTM14.1 algorithm of MV-HEVC is selected as the benchmark [38]. The hardware configuration is given as follows. The CPU is Intel (R) Core (TM) i3 with a main frequency of 2.4GHz, the memory of 4G, 64-bit WIN7 operating system, with VS2013 as the development tool. A total of 8 standard test sequences [39] are selected from three views in Newspaper, Balloons, Kendo, Gtfly, Poznanhall2, Poznanstreet, Shark and Undodancer. In the test sequences, one of the three views is the independent view, while the others are dependent views. The test configuration is as follows: the HBP coding structure, the GOP length 8, and the I frame period being 24. The initial QPs of the independent views are 22, 27, 32 and 37 respectively. In this experiment, 8 standard test sequences are tested with the common test conditions [40]. The time saving rate of the test algorithm compared with original HTM14.1 algorithm is calculated by

T  (T  TOrg ) / TOrg 100%

(17)

where TOrg represents the coding time of the original HTM14.1 algorithm and T is the coding time of the test algorithm. Table 3 shows the time saving rate of the proposed algorithm compared to the original HTM14.1 algorithm, wherein TV1 , TV2 and TVdep indicate the time saving rate of the two dependent views V1, V2 and the average of the two

views respectively. Table 4 shows the experimental results of the proposed algorithm compared to the original HTM14.1 algorithm using different quality assessment methods. In the quality assessment, BDBRPSNR (%), the percentage change of the bitrate under the same peak signal-to-noise rate (PSNR) is used as the indexes to measure the fast algorithm. To more fairly evaluate the performance of the proposed algorithm, BDBRSSIM (%), the percentage change of the bitrate under the same structural similarity, and, BDBRGMSD (%), the percentage change of the bitrate under the same gradient magnitude similarity mean (GMSM) [41] are calculated simultaneously. The performance indicators of V1 and V2 views and total of three views are shown in the Table 4. Table 3 Time saving ratio of the proposed algorithm of different dependent views. Test sequences

TV1

TV2

TVdep

Balloons

-54.2% -54.9% -54.3% -48.4% -53.6% -50.9% -53.0% -53.1% -52.8%

-54.4% -54.2% -54.1% -49.5% -53.7% -51.7% -53.4% -53.6% -53.1%

-54.3% -54.5% -54.2% -48.9% -53.7% -51.3% -53.2% -53.4% -52.9%

Kendo Newspaper Poznanhall2 Gtfly Poznanstreet Undodancer Shark Average

Table 4 BDBRPSNR , BDBRSSIM, BDBRGMSM and BDRDMOS of the proposed algorithm BDBRPSNR

Test

BDBRSSIM

BDBRGMSM

BDRDMOS

sequences

V1

V2

total

V1

V2

total

V1

V2

total

Balloons

1.4%

0.6%

0.5%

1.5%

0.4%

0.5%

1.5%

0.6%

0.6%

0.8%

Kendo

1.7%

3.1%

1.2%

1.1%

2.5%

0.8%

0.8%

2.8%

0.9%

1.4%

Newspaper

1.3%

1.4%

0.7%

1.5%

1.5%

0.7%

1.4%

1.6%

0.7%

1.8%

Poznanhall2

2.9%

2.2%

1.6%

1.4%

0.8%

0.5%

1.6%

1.4%

0.8%

1.1%

Gtfly

1.3%

1.3%

0.7%

1.1%

0.9%

0.5%

1.2%

1.2%

0.6%

0.2%

Poznanstreet

2.0%

2.2%

1.0%

1.6%

2.0%

0.8%

1.8%

2.2%

1.0%

0.3%

Undodancer

3.0%

4.3%

1.6%

1.4%

2.9%

0.8%

1.8%

3.2%

1.0%

0.9%

Shark

0.4%

0.6%

0.5%

0.1%

0.1%

0.2%

0.2%

0.4%

0.4%

0.8%

Average

1.8%

2.0%

1.0%

1.2%

1.4%

0.6%

1.3%

1.7%

0.7%

0.9%

In Table 3 and Table 4, compared with original HTM14.1 algorithm, the proposed algorithm saves 52.9% of the coding time on average and comprehensive BDBRPSNR of three views increase only by 1.0%. Besides, comprehensive BDBRSSIM and BDBRGMSM of three views increase only by 0.6% and 0.7%, respectively. For the Kendo and Undodancer sequence, the performance of the proposed algorithm is slightly poor due to its relatively intense motions and poor

temporal correlation. Certain errors occur when updating the parameters of the current coding frame using the coded information of the previous frame in the salient blocks. Furthermore, some errors also occur when applying the linear weighting using the correlation between the spatiotemporal domain and the views in the non-salient blocks. These errors lead to the slightly poor performance of the coding in the end. However, this does not affect the stability of the overall rate-distortion performance. By contrast, due to the presence of many salient blocks in Newspaper sequence, relatively large degree of distortions can be tolerated, which improves the overall rate-distortion performance and saves more coding time. Furthermore, a subjective experiment is carried out to more fairly evaluate the performance of the proposed algorithm. The Double Stimulus Continuous Quality Scale (DSCQS) method [42] is used to evaluate the reconstructed video quality. Following the subjective experiment standard [43], twenty-three observers (five of them having previous 3D video subjective scoring experience while the rest are naive) are invited to the subjective perception quality assessment. According to the quality of the transmiting video, the subjective experiment observers give the corresponding subjective grades and finally get the Mean Opinion Score (MOS) for each reconstructed test video. The obtained subjective scoring criteria are presented in Table 5. The subjective experiment directly reflects the quality of the video coding. We use the Bjontegaard measure method to show the BD-Rate Gain based on MOS (BDRDMOS) comparison for evaluating the efficiency of the proposed algorithm. The BDRDMOS of each test video sequence are also listed in Table 4. Under the same subjective quality condition, the bitrate increases ranging from 0.4% to 1.8%. The average BDRDMOS is only 0.9% which means that the proposed algorithm can save much coding time with little bitrate cost. Table 5. Subjective scoring criteria Video quality Subjective score Excellent 5 Good 4 Fair 3 Poor 2 Bad 1

PSNR(dB)

44 42 40

       H… P…

38 36 0

500

1000 1500 Bitrate(kpbs)

2000

2500

(a) Objective RD performance of Kendo sequence

41 40 39 38 37 36 35 34 33

PSNR(dB)

46

H… P… 0

5000 10000 Bitrate(kbps)

15000

(b) Objective RD performance of Poznanstreet sequence

4

4

3

3

2

      

1

H… P…

0 0

500

1000 1500 Bitrate(kbps)

2000

(c) Subjective RD performance in Kendo sequence

MOS

5

MOS

5

2 1

H… P…

0 0

2000

4000 6000 Bitrate(kbps)

8000

10000

(d) Subjective RD performance in Poznanstreet sequence

Fig. 8 Subjective and objective RD performance comparison

Figs. 8(a) and 8(b) show the objective RD performance curves of the HTM14.1 algorithm and the proposed algorithm for the kendo sequence and the PoznanStreet sequence, while Figs. 8(c) and 8(d) show the subjective RD performance curves of the HTM14.1 algorithm and the proposed algorithm for the two sequences in which the subjective assessment method, namely MOS is used as the quality metrics. In addition to comparing with original HTM14.1 algorithm, the crosswise comparison was performed with the a state-of-the-art fast encoding algorithm, Jung’s algorithm (denoted by Jung CSVT[15]). Jung CSVT adopts adaptive ordering of modes to reduce computational complexity. As shown in Table 6, the average RD performance loss of the proposed algorithm (BDBRPSNR=1.9%) is slightly higher than that of the Jung CSVT(BDBRPSNR=-0.1%). The average time saving of the proposed algorithm is -52.9%, which is higher than Jung CSVT -33.0%. Therefore, the proposed algorithm can save more coding time than Jung CSVT algorithm, while objective RD performance is slightly declined. Table 6 Coding performance of the proposed algorithm compared with Jung CSVT’s algorithm

TVdep

BDBRPSNR

Test sequences

Jung CSVT[15]

Proposed

Jung CSVT[15]

Proposed

Balloons

-0.2%

1.0%

-29.3%

-54.3%

Kendo

-0.4%

2.4%

-28.7%

-54.5%

Newspaper

0.0%

1.4%

-34.1%

-54.2%

Poznanhall2

-0.3%

2.6%

-40.0%

-49.0%

Gtfly

0.3%

1.3%

-33.4%

-53.7%

Poznanstreet

0.1%

2.1%

-53.8%

-51.3%

Undodancer

-0.2%

3.7%

-22.2%

-53.2%

Shark

-0.1%

0.5%

-22.5%

-53.4%

Average

-0.1%

1.9%

-33.0%

-52.9%

Figs. 9(a), 9(b), 9(d) and 9(e) show the reconstructed images of view V2 in the Newspaper sequence and the Balloons sequence with respect to the proposed algorithm and the HTM14.1 algorithm when the coding QP is 35. Besides, Figs.9(c) and 9(f) present the residual images that exhibit the differences between the distortion value obtained by the proposed algorithm and that obtained by the HTM14.1 algorithm, where the white pixels denote that the distortion values obtained by the proposed algorithm are larger than that generated by the HTM14.1 algorithm. As observed in Figs. 9(a), 9(b), 9(d) and 9(e), respectively, differences between the reconstructed image of the proposed algorithm and that obtained by the HTM14.1 algorithm cannot be recognized from the perspective of human visual

perception. N Namely, the cooding perform mance of the ttwo algorithm ms is almost co onsistent subjeectively. How wever, in Figs.. 9(c) and 9(f)), there are cerrtain differencces between thhe reconstructted images ob btained by the two algorithm ms. The whitee pixels are baasically locateed in the salient regions oof the image. Namely, in th he salient reggions, the disttortion valuess obtained by the proposedd algorithm remain r largerr than those obtained by the HTM14. 1 algorithm, however thee subjective efffects of the two t algorithm ms are comparrable. It can be b found that the proposedd algorithm can save moree bitrates and eliminate more perceptual redundanciess in the salien nt blocks. Thee fact that thee edge regions can conceall more distortiions is also consistent with h the HVS chharacteristics of o human visual perceptionn. Therefore, the proposedd algorithm is reliable beccause it can reduce the ccoding compllexity withoutt compromisiing the subjeective qualityy degradation.

(a) Decoded iimage with the HT TM algorithm

(b) Decoded im mage with the prop posed algorithm

(c) Difference bbetween distortion ns of (a) and (b) whicch is enlarged for display d

(d) Decoded iimage with the HT TM algorithm

(e) Decoded im mage with the proposed algorithm

(f) Difference bbetween distortion ns of (d) and (e) whicch is enlarged for display d

Fig. 9 Subjective effect of the prroposed algoriithm and the HTM14.1 H algoorithm Since thhe original HE EVC encoder needs to traveerse all possib ble modes to find f out the beest way to enccode the data,, the large flexxibility on bloock size and prediction p moddes causes a tremendous t in ncrease in thee encoding tim me and energyy consumptionn of HEVC encoders. The proposed algorrithm can deteermine the besst prediction m mode in time by b calculatingg the perceptibble distortion threshold t of th he current CU U before coding g. This signifiicantly reducees the number of alternativee prediction m models, reducees the calculation scale, annd speeds up the iterative computation c tto find the op ptimal model.. Therefore, thhis algorithm can save the scale of mem mory operation n and commun nication to a ggreater extentt. Meanwhile,, CPU high-sppeed calculatioon time is greeatly saved. Thhis is beneficiial for devicess with limitedd energy consu umption, suchh as embeddedd systems. Howeverr, this acceleraated algorithm m is at the exppense of perceptible distorttion of the deependent view w, because thee threshold of perceptual distortion terminates some "uunnecessary" prediction patterns in advaance. Thereforre, the qualityy of the dependdent view willl be slightly reduced, r whicch make the prroposed algorrithm is not veery suitable fo or single view w video codingg. But, in the multi-view display d system m, overall visu ual quality is not just deterrmined by thee quality of a single view aand is often affected a by the quality of m multi-views.Th hanks to the masking m effecct, the view with w relativelyy better qualityy contributes more m to the ov verall stereosccopic image quality. q Since the t primary viiew still adopts the originall algorithm and ensures the high quality, the overall peerceived visuaal quality remaains unchangeed.

4. Conclusions To reduce the computational complexity of the multi-view video coding, a new fast multi-view video inter-frame prediction mode selection algorithm is proposed based on the perceptual distortion threshold model. Firstly, the dependent view images are divided into salient and non-salient regions. Then, the SSE of the current coding unit in different regions is estimated by using different methods. Namely the SSE in the salient regions is estimated based on the D-Q model and that of the non-salient regions is estimated by applying the linear weighting to distortions of the reference CUs of the current CU. Further, the binocular perceptive distortion threshold SSEbjnd of the current CU is calculated based on the BJND model. Finally, the estimated SSE and binocular perceptive distortion threshold SSEbjnd are added as the decision threshold of the current CU to early terminate the selection of inter-frame prediction modes of the dependent view. Experimental results demonstrate that as compared with the HTM14.1 algorithm, the proposed algorithm can save the coding time of the dependent views by 52.9%, wherein comprehensive BDBRPSNR, BDBRSSIM and BDBRGMSM of three views only increase by 1.0%, 0.6% and 0.7%, respectively. Compared with the state-of-the-art fast algorithm, the proposed algorithm can save more coding time and the bitrate under the same PSNR increases slightly. Further research may explore the relationship between the quantization parameters and the SSEs in the non-salient regions. Moreover, the BJND model can be employed to guide the in-depth division of CUs in dependent views, thereby further reducing the coding complexity of dependent views. The proposed algorithm is a fast inter-frame prediction method, and if it is combined with other fast intra-frame predication method, the computational complexcity of MVC could be reduced further. Acknowlegements This work was supported by the Natural Science Foundation of China under Grant nos. U1301257, 61671258 and 61620106012, the National High-tech R&D Program of China under Grant no. 2015AA015901, and the Natural Science Foundation of Zhejiang Province under Grant nos. LY15F010005 and LY16F010002. It is also sponsored by K.C. Wong Magna Fund in Ningbo University. the General Scientific Research Project of the Education Department of Zhejiang Province(Y201636754),the open founding of top priority subject (Information and Communication Engineering)of Zhejiang Province (xkxl1418). References [1] J. Zhang, Y. Cao, Z. Zha, Z. Zheng. A unified scheme for super-resolution and depth estimation from asymmetric stereoscopic video. IEEE Transactions on Circuits and Systems for Video Technology, 26(3) (2016) 479-493. [2] B. Allen, T. Hanley, B. Rokers, C. S. Green. Visual 3D motion acuity predicts discomfort in 3D stereoscopic environments. Entertainment Computing, 13 (2016) 1-9. [3] F. Shao, W. Lin, G. Jiang, M. Yu, Low-complexity depth coding by depth sensitivity aware rate-distortion optimization, IEEE Transactions on Broadcasting, 62(1) (2016) 94-102. [4] C. Rosewarne, B. Bross, M. Naccari, et al. High efficiency video coding (HEVC) test Model 16 (HM 16) improved encoder description update 2. Joint Collaborative Team on Video Coding (JCT-VC) Document: JCTVC-T1002, 2015. [5] Y. Chen, G. Tech, K. Wegner, et al. Test Model 11 of 3D-HEVC and MV-HEVC. Joint Collaborative Team on 3D Video Coding Extension Development (JCT-3V) Document: JCT3V-K1003, 2015. [6] G. Sullivan, J. Ohm, W. Han, et al. Overview of the high efficiency video coding standard, IEEE Transactions on Circuits and Systems for Video Technology, 22(12) (2012) 1649-1668. [7] N. Zhang, D. Zhao, Y. Chen, J. Lin, W. Gao, Fast encoder decision for texture coding in 3D-HEVC. Signal Processing: Image Communication, 29(9) (2014) 951-961. [8] T. Da Silva, L. Agostini, C. Da Silva. Fast intra prediction algorithm based on texture analysis for 3D-HEVC encoders. Journal of Real-Time Image Processing, 12(2) (2016) 357–368. [9] Y. Zhang, S. Kwong, G. Zhang, Z. Pan, Y. Hui, G. Jiang, Low complexity HEVC INTRA coding for high-quality mobile video communication, IEEE Transactions on Industrial Informatics, 11(6)(2015) 1492-1504. [10] Y. Song, K. Jia. Early merge mode decision for texture coding in 3D-HEVC. Journal of Visual Communication and Image Representation, 33 (2015) 60-68.

[11] G. Chi, X. Jin, Q. Dai. A quad-tree and statistics based fast CU depth decision algorithm for 3D-HEVC. IEEE International Conference on Multimedia and Expo Workshops (ICMEW), 2014: 1-5. [12] Q. Zhang, H. Chang, Q. Wu, Y. Gan. Fast motion and disparity estimation for HEVC based 3D video coding. Multidimensional Systems & Signal Processing, 27(3) (2016) 743-761. [13] Z. Pan, Y. Zhang, J. Lei, et al. Early DIRECT mode decision based on all-zero block and rate distortion cost for multiview video coding. IET Image Processing, 10(1) (2016) 9-15. [14] H. Tohidypour, M. Pourazad, P. Nasiopoulos. A low complexity mode decision approach for HEVC-based 3D video coding using a bayesian method. International Conference on Acoustic, Speech and Signal Processing (ICASSP), 2014: 895-899. [15] S.-H. Jung and H. W. Park, “A fast mode decision method in HEVC using adaptive ordering of modes,” IEEE Trans. Circuits Syst. Video Technol., vol. 26, no. 10, pp. 1846–1858, Oct. 2016. [16] J. Han, Y Ma, J Huang, et al. An infrared small target detecting algorithm based on human visual system. IEEE Geoscience and Remote Sensing Letters, 13(3) (2016) 452-456. [17] H. Wei, X. Zhou, W. Zhou, et al. Visual saliency based perceptual video coding in HEVC, International Symposium on Circuits and Systems (ISCAS), 2016: 2547-2550. [18] X. Yang, W. Lin, Z. Lu, et al., Motion-compensated residue preprocessing in video coding based on just-noticeable-distortion profile, IEEE Transactions on Circuits and Systems for Video Technology, 15(6) (2005) 742-752. [19] Y Jia, W Lin, A A Kassim, Estimating just noticeable distortion for video, IEEE Transactions on Circuits and Systems for Video Technology, 16(7) (2006) 820-829. [20] Z. Wei, K. N. Ngan, Spatio-temporal just noticeable distortion profile for grey scale image/video in DCT domain, IEEE Transactions on Circuits and Systems for Video Technology, 19(3) (2009) 337-346. [21] Z. Chen, C. Guillemot, Perceptually-friendly H.264/AVC video coding based on foveated just-noticeable-distortion model, IEEE Transactions on Circuits and Systems for Video Technology, 20(6) (2010) 806-819. [22] D. V. S. X. De Silva, W. A. C. Fernando, G. Nur, et al., 3D video assessment with just noticeable difference in depth evaluation, International Conference on Image Processing, HongKong, Sept. 2010, pp. 4013-4016. [23] Y Zhao, Z Chen, C Zhu, et al., Binocular just-noticeable-difference model for stereoscopic images, IEEE Signal Processing Letters, 18(1) (2011) 19-22. [24] S Jung, J Jeong, S Ko, Sharpness enhancement of stereo images using binocular just-noticeable difference. IEEE Transactions on Image Processing, 21(3) (2012) 1191-1199. [25] X. Wang, G. Jiang, J. Zhou, Y. Zhang, F. Shao, Z. Peng, M. Yu, Visibility threshold of compressed stereoscopic image: effects of asymmetrical coding, The Imaging Science Journal, 61 (2013) 172-182. [26] M. Zhang, H. Bai, M. Liu, et al. Just noticeable difference based fast coding unit partition in HEVC intra coding. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E97(12) (2014) 2680-2683. [27] W. Wu, B. Song, Just-noticeable-distortion-based fast coding unit size decision algorithm for high efficiency video coding. Electronics Letters, 50(6) (2014) 443-444. [28] X. Shang, Y. Wang, L. Luo, et al., Fast mode decision for multiview video coding based on just noticeable distortion profile, Circuits, Systems and Signal Processing, 34(1) (2015) 301–320. [29] Y. Zhu, M. Yu, X. Jin, et al. Fast mode decision algorithm for multiview video coding based on binocular just noticeable difference. Journal of Computers, 9(10) (2014) 2428-2434. [30] Y. Wang, T. Jiang, S. Ma, W. Gao. Novel spatio-temporal structural information based video quality metric. IEEE Transactions on Circuits and Systems for Video Technology, 22(7) (2012) 989-998. [31] L. Zhang, G. Tech, K. Wegner, S. Yea. 3D-HEVC test Model 5, Joint collaborative team on 3D video coding extensions (JCT-3V) document JCT3V-E1005, Vienna, July 2013. [32] T Huang, H Chen. Efficient quantization based on rate-distortion optimization for video coding, IEEE Transactions on Circuits and Systems for Video Technology, 26(6) (2016) 1099 - 1106. [33] N. Kamaci, Y. Altunbasak, R. M. Mersereau. Frame bit allocation for the H.264/AVC video coder via cauchy-density-based rate and distortion models. IEEE Transactions on Circuits and Systems for Video Technology, 15(8) (2005): 994-1006. [34] C. Wu, P. Su. A content-adaptive distortion-quantization model for intra coding in H.264/AVC, International Conference on Computer Communications and Networks (ICCCN), 2011: 1-6. [35] Y. Zhang, G. Jiang, M. Yu,  Adaptive multiview video coding scheme based on spatiotemporal correlation analyses. ETRI journa , 31(2) (2009) 151-161. [36] L. Shen, Z. Zhang, Z. Liu. Adaptive inter-mode decision for HEVC jointly utilizing inter-level and spatiotemporal correlations. IEEE Transactions on Circuits and Systems for Video Technology, 24(10) (2014) 1709-1722. [37] S. Ahn, B. Lee, and M. Kim. A novel fast CU encoding scheme based on spatiotemporal encoding parameters for

HEVC inter coding. IEEE Transactions on Circuits and Systems for Video Technology, 25(3) (2015) 422-435. [38].https://hevc.hhi.fraunhofer.de/svn/svn_3DVCSoftware/tags/HTM-14.1. [39] D. Rusanovskyy, K. Mueller, A. Vetro. Common test conditions of 3DV core experiments. Joint collaborative team on 3D video coding extensions (JCT-3V), document JCT3V-E1100, Vienna, July 2013. [40] K. Mueller, A. Vetro. Common test conditions of 3DV core experiments. Joint collaborative team on 3D video coding extensions (JCT-3V) document JCT3V-G1100, San Jose, 2014. [41] W Xue, L Zhang, X Mou, A C Bovik. Gradient magnitude similarity deviation: a highly efficient perceptual image quality index. IEEE Transactions on Image Processing, 23(2) (2014) 684-695. [42] Subjective Video Quality Assessment Methods for Multimedia Applications, I.-T. Rec. P.910, 2008. [43] Methodology for the subjective assessment of the quality of television pictures, ITU-R.BT.500-11, 2002.  

   

A fast multi‐view video inter‐frame prediction mode selection  algorithm based on perceptual  distortion threshold model (PDTM) is proposed to speedup the coding of dependent view.  A modified distortion quantization model is established to estimate sum of the squared errors  (SSE) of current coding unit (CU) in salient region.  SSE of current CU in non‐salient region is estimated by using spatiotemporal and inter‐view  correlations.  PDTM is derived by combining the estimated SSE with binocular just noticeable difference.