Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Parametric and nonparametric residual vector quantization optimizations for ANN search Dan Guo a,n, Chuanqing Li a, Lv Wu b a b
School of Computer and Information, Hefei University of Technology, Hefei, China School of Information Engineering, Wuhan University of Technology, Wuhan, China
art ic l e i nf o
a b s t r a c t
Article history: Received 25 January 2016 Received in revised form 5 April 2016 Accepted 29 April 2016 Communicated by: Chennai Guest Editor
For approximate nearest neighbor (ANN) search in many vision-based applications, vector quantization (VQ) is an efficient compact encoding technology. A representative approach of VQ is product quantization (PQ) which quantizes subspaces separately by Cartesian product and achieves high accuracy. But its space decomposition still leads to quantization distortion. This paper presents two optimized solutions based on residual vector quantization (RVQ). Different from PQ, RVQ simulates restoring quantization error by multi-stage quantizers instead of decomposing it. To further optimize codebook and space decomposition, we try to get a better discriminated space projection. Then an orthonormal matrix R is generated. The RVQ's nonparametric solution alternately optimizes R and stage-codebooks by Singular Value Decomposition (SVD) in multiple iterations. The RVQ's parametric solution assumes that data are subject to Gaussian distribution and uses Eigenvalue Allocation to get each stage-matrix {Rl} (1 r lrL) at once, where L is the stage number of RVQ. Compared to various optimized PQ-based methods, our methods have good superiority on restoring quantization error. & 2016 Published by Elsevier B.V.
Keywords: Residual vector quantization Vector quantization optimization Parametric optimization Nonparametric optimization Stage codebook
1. Introduction In many vision-based applications, large scale approximate nearest neighbor (ANN) search is widely used [5,25,28,30], such as recent popularity in geographic information system [22], locationbased service [13,14,18,33,36] and bioinformatics data classification [20,37,38]. ANN can effectively return the most approximate search result with the minimum distance among the whole vector space of data samples [3,4,16,35]. But its calculation with highdimensional vector is usually much time consuming. Especially, more and more vision-based applications are expanded on mobile devices. Memory usage and fast processing ability on low-end devices for ANN search are two key criterions. Therefore, the vector quantization (VQ) technology is proposed [12]. VQ compacts high-dimensional vector into some bit codes with a codebook. Thus the main memory usage is just required by a small codebook. More importantly, with a small memory usage, VQ still remains high efficiency of nearest neighbor query [16]. Early the VQ technology indices space partition [23,24,27]. For example, FLANN [23] and E2LSH [8,26] simplify indexing structure using tree-based structure or binary code [21]. But they have to n
Corresponding author. E-mail addresses:
[email protected] (D. Guo),
[email protected] (C. Li),
[email protected] (L. Wu).
keep original high-dimensional vectors for final re-ranking. Then, more hash-based methods are proposed to calculate hamming distance between short hashed codes [1,15,29–32,35]. However, their accuracies are influenced by the limited number of possible hamming distance with fixed code length. Later on, product quantization (PQ) is proposed, which quantizes subspaces separately by Cartesian product and achieves high accuracy [16]. PQ has achieved a more accurate distance approximation than spectral hashing (SH). PQ is very computationally efficient and attractive with its precision and rapid query even in exhaustive search for large-scale applications. [16] also proposes a non-exhaustive search strategy named IVFADC, which applies an inverted file system to asymmetric distance computation. To optimize search performance [16], some works use PCA projection to handle data's independence [6,32,34,35] and some other works focus on projecting data to balance variance for each component [17]. For example, [17] hypothesized the data components with balanced variances and optimized by a Householder transformation or random rotation on the data. In addition, to reduce quantization error, optimized product quantization (OPQ) is proposed to formulate product quantization with two solutions [9,10]: a non-parametric optimization without any hypothesis on data distribution and a parametric optimization under Gaussian distribution assumption. OPQ has been verified that it has a better performance than PQ [16], transform coding (TC) [6] and iterative quantization (ITQ) [11]. Furthermore, LOPQ [19] combines inverted
http://dx.doi.org/10.1016/j.neucom.2016.04.061 0925-2312/& 2016 Published by Elsevier B.V.
Please cite this article as: D. Guo, et al., Parametric and nonparametric residual vector quantization optimizations for ANN search, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.04.061i
D. Guo et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
2
lists and multi-index structure with OPQ to achieve non-exhaustive search. However, all above PQ-based methods have a defect that space decomposition still leads to quantization distortion [7]. Therefore, there still remains unaddressed problem on codebook and distortion quantization optimization. This paper is dedicated to two aspects: (1) We introduce two quantization distortion properties of the residual vector quantization (RVQ) model. The essence of RVQ is to treat each vector as a whole and simulate the quantization errors by multi-stage quantizers. In other words, RVQ try to restore distortion error instead of resolving distortion error. An L-stage RVQ's quantization distortion decreases monotonically with increasing L, where L is the total stage number of RVQ. (2) Given an L-stage RVQ, we optimize RVQ by nonparametric and parametric solutions as follow:
The RVQ's nonparametric solution (RVQ-NP) does not have any
data distribution assumption. It alternately optimizes an orthonormal matrix R and stage-codebooks by Singular Value Decomposition (SVD) in multiple iterations. The RVQ's parametric solution (RVQ-P) assumes that data are subject to Gaussian distribution and uses Eigenvalue Allocation to get each stage-matrix {Rl}(1 rlr L) at once.
Both R and {Rl} are optimized to project data to a better discriminated data space. The paper is organized as follows: we summarize the models of RVQ, RVQ-P and RVQ-NP in Section 2. Then we respectively introduce the detail of RVQ in Section 3, the details of RVQ-P and RVQ-NP in Section 4. Section 5 evaluates the search performance. Finally, we conclude in Section 6.
2.2. Two optimized RVQs' optimality formulations To further optimize stage-codebook and space decomposition, we try to get a better discriminated space projection. Then an orthonormal matrix R is generated. Similarly to applying R in OPQ [9,10], R also can be used to transform data space in RVQ. We respectively apply a non-parametric optimization in multiple iterations or an L-stage parametric optimization on the L-stage RVQ to optimize R. The detail derivations of optimized RVQ models are represented in Section 4. Here we just summarize the models. The objective function for non-parametric RVQ-NP is essentially: 2
min
R, C1′,..., Cl′,..., CL′
∑
Rx−
( ( ))
c′l i ξ x, l
1≤ l≤ L
x
s . t . c′l ∈ Cl′,
∑
RT R
= I , 1 ≤ l ≤ L.
(2)
The objective function for parametric RVQ-P is essentially:
min
{R l}, {Cl′}
∑ x
x−
∑
1≤ l≤ L
⎡ ⎢ ∏ R j T c′l i ξ x, l ⎢⎣ 1 ≤ j ≤ l
( ( ))
⎤ ⎥ ⎥⎦
2
s . t . c′l ∈ Cl′, R lΤ R l = I, 1 ≤ l ≤ L.
(3)
The differences are as follow: Nonparametric RVQ-NP without any data distribution assumption optimizes an orthonormal matrix R for the whole L-stage RVQ. RVQ-NP can be conducted in multiple iterations. But parametric RVQ-P assumes that data are subject to Gaussian distribution and uses Eigenvalue Allocation to get each stage-matrix {Rl}(1 rl rL) at once. Thus RVQ-P has L stages parametric optimization by Eigenvalue Allocation.
3. Residual vector quantization 2. Our basic models
In this section, we introduce the optimality of RVQ's distortion in detail and give two properties of RVQ's quantization distortion.
Quantization distortion is used as an objective function to measure quantization performance [12]. To compare the “optimality” of quantization as in the PQ [16] and OPQ [9,10] models, we list RVQ, RVQ-NP and RVQ-P in the same formulation. The details of RVQ-based models are respectively represented in Section 3 and Section 4. 2.1. RVQ's optimality formulation An L-stage RVQ in this paper has L different quantizers: {Q1,… ,Ql,…,QL}, where L is the stage number of RVQ. Each stage quantizer has a stage-codebook. The l-th stage-codebook of Ql is Cl ¼ {cl (i)}, where cl(i) is the i-th codeword (centroid) in codebook Cl. Any vector x ∈D in RVQ, on the l-th each stage, has a residual vector ξx,l. The objective function for RVQ is essentially:
min ∑ ‖x − C1, …, Cl, …, CL x s . t . cl ∈ Cl, 1 ≤ l ≤ L.
∑
(
)
cl i (ξ x, l ) ‖2
1≤ l≤ L
(1)
where || || is the Euclidean norm and i(ξx,l) is the index of the nearest centroid in Cl to ξx,l. Different from that PQ splits data space into M separate subspaces to optimize the distortion on vector partition [16], RVQ optimizes the distortion on extending residual vector. In other words, RVQ tries to restore distortion error instead of resolving distortion error.
3.1. RVQ model Residual vector quantization [2,7] is a branch of the VQ technique. RVQ has multiple different low complexity quantizers to reduce the quantization error sequentially. An L-stage RVQ in this paper has L different quantizers: {Q1,…,Ql,…,QL}. As shown in Fig. 1, each stage-quantizer Ql has its corresponding stage-codebook Cl (1 r lr L), and generates corresponding quantization outputs {xl} and residual vectors {ξl}. At first, while l ¼1, RVQ maps a vector x ∈D into its nearest centroid x1 in the first stage-codebook C1 ¼{c1(i)} by the first stage-quantizer Q1 and get a new residual vector ξ1:
⎧ x1 = Q 1 ( x) = c1(i (x)) = arg min‖x − c1(i)‖ ⎪ c1(i)∈ C1 ⎨ . ⎪ ⎩ ξ1 = x − Q 1 ( x) = x − x1
(4)
Then, while 2 rl rL, the l-th stage-quantizer Ql quantizes the residual vector ξl 1 succeeded from the previous stage-quantizer Ql 1 and get a new residual vector ξl:
ξl = ξl − 1 − Q l ( ξl − 1) = ξl − 1 − x l.
(5)
where we denote the quantized output by the l-th stage-quantizer Ql as xl. Finally, after RVQ's L-stage quantization, as in Eq. (6), we simulate vector x by restoring its L-stages' quantization outputs {x1, x2, …, xL} and the last non-quantized residual vector ξl is omitted.
Please cite this article as: D. Guo, et al., Parametric and nonparametric residual vector quantization optimizations for ANN search, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.04.061i
D. Guo et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
3
Input: Image descriptor database
Step 1: Quantizer Ql in learining process: generate the l-th stage-codebook C ={cl(i)} by k-means algorithm
Step 2: Calculate the l-th stage quantization outputs {xl} and the l-th stage residual vectors {ξl} No Step 3: l = = L ? Yes
l = l +1, input residual vectors {ξl}
Step 5: Obtain the RVQ code of each image descriptor Fig. 1. The flowchart of the proposed RVQ model.
RVQ (x) =
∑ 1≤ l≤ L
x l + ξL ≈
∑
x l.
1≤ l≤ L
(6)
3.2. RVQ's quantization process 3.2.1. Quantization learning As shown in Fig. 2, the learning process describes how to construct stage-codebooks {Cl}(1rl rL). Each stage-codebook is trained by k-means algorithm. At the first stage, the training data X ¼{x} is provided to learn the first stage-codebooks C1, where X is a D-by-n data matrix (each column is a vector sample x∈D ) and n is the number of samples. After that, we get the difference between X and X's quantization output matrix, and denote it as residual vector E1. E1 is used to learn the successive stage-codebook C2. Analogously, we get the l-th stage residual vector set El and set it as the training vector set for next stage-codebooks Cl þ 1 learning.
Finally, stage-quantizers { oQ1: C1 4,…, oQl: Cl 4,…, oQL: CL 4 } are obtained. Stage-codebooks (1 rl rL) are stored in memory. 3.2.2. Quantization encoding In the encoding process, an input vector x in database is quantized by stage-quantizers {Q1,…, Ql,…,QL} in turn. Thus we get x's L-stage quantization outputs: {x1,…, xl,…,xL} and store corresponding indices in stage-codebooks {Cl} as Bit ¼ { b1,…,bl,…,bL } (1 rl rL), where bl is the index of the nearest centroid to x's (l 1)-th stage residual vector ξl 1 in Cl. Here is ξ0 ¼ x. After encoding, the bit rate Bit is stored into memory. Hence, for an L-stage RVQ with k codewords in each stage-codebook, the compact bit rate is Llog2k per vector. 3.2.3. Quantization decoding After encoding, we can get the L-stage quantization outputs: {x1,…, xl,…,xL} by looking up stage-codebooks {Cl} with Bit ¼{b1,…
Fig. 2. The quantization process of L-stage RVQ.
Please cite this article as: D. Guo, et al., Parametric and nonparametric residual vector quantization optimizations for ANN search, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.04.061i
D. Guo et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
4
,bl,…,bL } (1 rl rL). We restore vector x by the quantization calculation in formula (6). 3.3. RVQ's quantization distortion Property 1. The (l þ1)-th stage-quantizer Ql þ 1's quantization distortion is less than Ql's. Proof. By the definition of distortion function and Eq. (5), the (l þ1)-th stage-quantizer Ql þ 1's quantization distortion Dis(Ql þ 1) is deformed as follow:
Dis ( Q l + 1) = =
1 1 ‖El + 1‖2 = n nξ
∑
ξl + 1T ξl + 1
l + 1∈ E l + 1
1 ∑ (ξ − x l + 1)T (ξl − x l + 1) = E ( Q l ) n ξ ∈E l l
⇒ Δ = Dis ( Q l + 1) − Dis ( Q l )
l
(
‖x l + 1‖2
−
l
cli+ ∈ Cl + 1 1
2) ∑ξ ∈ E ξl = nli+ 1⋅cli+ 1, where El,i is the subset of El that is l, i
learned to generate codeword cli+ 1, and nli+ 1 is the total number of data samples in El,i, 1 rir k, k is the number of codewords in each stage-codebook. 1 Thus Δ = n ∑ξ ∈ E ‖x l + 1‖2 − 2⋅ξlT x l + 1 l
=
l
(
1 ∑ n 1≤i≤k
)
⎡ n i ⋅‖ci ‖2 − 2⋅ n i ⋅ci T ⋅ci ⎤ = − l+1 l+1 l + 1⎦ ⎣ l+1 l+1
∑1 ≤ i ≤ k nli+ 1‖cli+ 1‖2
(
( ( ))
c l i ξ x, l
= Rx−
)
∑
( ( ))
R⋅cl i ξ x, l
1 n
≤ 0.
Therefore, Dis ( Q l + 1) ≤ Dis ( Q l ) is proved. Property 2. L-stage RVQ's quantization distortion decreases monotonically with increasing L. Proof. By formula (6), we know that X − RVQ (X) = EL , thus RVQ's quantization distortion is the last stage-quantizer QL's quantization distortion Dis(QL). Then by Property 1, Property 2 is proved. 4. Optimized residual vector quantization As the optimization summary of RVQs in Section 2.2, we try to get a better orthonormal matrix R for better data space projection. There are nonparametric and parametric RVQs in this paper. As
. (7)
1≤ l≤ L
Then the objective distortion function Eq. (1) is deformed in Eq. (2), as introduced in Section 2.2, where c'l( )¼ R cl( ). When R is fixed, we project training data by R and conduct k-means on the projected training data to learn {c′l(i)}. Thus we obtain transformed stage-codebooks {C′l}(1 rl rL). Optimize R: With the transformed stage-codebooks {C′l}(1 rl rL) is fixed, for vector x, there is Rx~x′ = ∑1 ≤ l ≤ L c′l i ξx, l . Given training data
).
2⋅ξlT x l + 1
where n is the total number of data samples and El is the l-th residual vector set of training data X generated by Ql. Suppose that the stage-codebook Cl þ 1 has k codewords {c1l + 1, ... , cli+ 1, ... , c lk+ 1}, there is: 1) x l + 1 = arg min‖x l + 1 − cli+ 1‖. l
∑
1≤ l≤ L
l
1 = ∑ n ξ ∈E
4.1. Nonparametric RVQ-NP For L-stage quantization, RVQ-NP optimizes the orthonormal matrix R and its stage-codebooks {C'l}(1rl rL) in an alternative way. Optimizing R is used to get a better discrimination data projection. Optimizing stage-codebooks {C'l} is always a general way to optimize vector encoding. We apply the non-parametric solution in [9–11] on our RVQ to optimize R. Optimize the stage-codebooks {C'l}(1 rl rL): Since R is orthonormal, there is an equation in Eq. (7):
x−
l
1 + ∑ (x lT+ 1x l + 1 − 2⋅ξl T x l + 1) n ξ ∈E l
shown in Fig. 3, nonparametric RVQ-NP without any data distribution assumption optimizes R in multiple iterations. But parametric RVQ-P assumes that data are subject to Gaussian distribution and uses Eigenvalue Allocation to get each stage-matrix {Rl}(1rl rL) at once.
( ( ))
matrix X, there is RX–X′. Note X′ is already fixed by stage-codebooks {C′l}(1 rl rL). Eq. (2) is deformed as:
min ∑ ‖RX − X′‖2 R
X
s . t . RT R = I.
(8)
Singular Value Decomposition (SVD) is applied to solve Eq. (8) in this problem [9–11]. We also compute XX′T ¼USVT and get R ¼VUT. Thus we conduct R optimization. As shown in Fig. 3(a), we alternatively optimize the orthonormal matrix R and stage-codebooks {C′l}(1rl rL). A pseudocode of RVQ-NP is in Algorithm 1. Algorithm 1. RVQ-NP Input: training data X ¼{x}, stage number L, codeword number k, iteration threshold num Output: orthonormal matrix R, stage-codebooks {C′l}, stage-indices { i(x′l) }for vector x, (1 rl rL) 1: Initialize R. 2: while the iteration number is less than or equal to num 3: X′1 ¼RX. // data projection 4: for l¼ 1 to L do 5: optimize stage-codebooks:
...
...
Fig. 3. Illustration of optimized residual vector quantization.
Please cite this article as: D. Guo, et al., Parametric and nonparametric residual vector quantization optimizations for ANN search, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.04.061i
D. Guo et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Algorithm 2. RVQ-P Input: training data X ¼{x}, stage number L, codeword number k Output: orthonormal matrices {Rl}, stage-codebooks {C′l}, stage-indices { i(x′l) } for vector x, (1 rl rL) 1: X1 ¼X. 2: for l ¼1 to L do 3: Rl ¼eigenvalue_allocation(Xl); // optimize R 4: X′l ¼ R Xl. // data projection; 5: run k-means on the data samples X’l to construct the l-th stage-codebook C′l; 6: for ∀ x′ in X′l: set i(x′l) by the index of the codeword c′l in C′l that is nearest to x′, end for; 7: get residual vector set El ¼X′l-[c′l(i(x′))] and update the next stage data samples X′l þ 1 ¼El; 8: end for.
run k-means on the data samples X′l to construct the l-th stagecodebook C′l; 6: for ∀ x′l in X′l: set i(x′l) by the index of the codeword c′l in C′l that is nearest to x′l end for; 7: get residual vector set El ¼X′l-{c′l(i(x′l)) } and update the next stage data samples X′l þ 1 ¼El; 8: end for. 9: optimize R by Eq. (8). 10: end while
4.2. Parametric RVQ As RVQ-NP has some inconveniences that its optimality is limited by R's initialization and iteration threshold num, a parametric solution is proposed to assure R. Optimizing distortion in [9,10] hypothesized that each dimension of a sample could be subject to mutually independent Gaussian distributions with zero mean. Under Gaussian distribution x N(0, Σ), the bound of quantization distortion Dis can be expressed as [9,10]:
Σ
1 D.
5. Experiments and results
(9)
We use two public datasets for ANN search: SIFT1M and GIST1M [16]. We compare our non-parametric RVQ-NP and parametric RVQ-P to following methods: (1) PQ: original PQ [16]. (2) PQ-RO: PQ with data projection by randomly order dimensions [9,10]. (3) PQ-RR: PQ with data projection by both PCA and randomly rotation. (4) OPQ-P: parametric OPQ. (5) OPQ-NP: nonparametric OPQ. (6) RVQ: original RVQ. We conduct 20 k-means iterations in each method. To keep the same time complexity of RVQ-NP and OPQ-NP to PVQ-P, we set iteration threshold in nonparametric solutions (RVQ-NP and OPQ-NP) are respectively L and M. We follow Symmetric Distance Computation (SDC) strategy and Asymmetric Distance Computation (ADC) strategy in ANN search [9,10,17]. [9] and [10] have pointed that SDC has less accuracy than ADC with the same time complexity. Thus we mainly experiment on ADC and compare our methods on SDC and ADC at the last of this paper.
We also apply this hypothesis on our RVQ and denote it as RVQ-P. RVQ-P tries to find the minimum distortion Dis by applying R to project data. Distortion bound of RVQ-P: When k and D are fixed, the minimum distortion D is depends on |Σ|. By R's orthogonality, here is Σ′ ¼RΣRT. Furthermore, by property 2, RVQ's quantization distortion is the last stage-quantizer QL's quantization distortion. The distortion bound is expressed as follow:
min Σ = min Σ ′ = RΣ RT = min ΣL′ R
R
R
(10)
s . t . RT R = I.
To minimize the distortion bound, we have to gradually decrease each stage-quantizer's distortion. As shown in Fig. 3(b), on each stage, the Eigenvalue Allocation (EA) in [9,10] is used to minimize distortion caused by data projection. It well fits the Gaussian distribution hypothesis. EA satisfies two criteria: (i) dimensions' mutual independence by PCA projection and (ii) partition variance balance by principal direction re-ordering. A pseudo-code of this parametric algorithm RVQ-P is in Algorithm 2. It is a greedy algorithm. On each stage, the data are transformed by Rl. Finally the quantization output of vector x is as follow:
x′ = RVQ_P (x) =
RΤj x l + ξ L ≈
∑ ∏ 1≤ l≤ L 1≤ j ≤ l
∑ ∏
RTj x l.
5.1. Performances of RVQ-based methods on different L and k At first, we compare our RVQ-based methods. The result of recall@10 comparison on SIFT1M is shown in Fig. 4. We can see that RVQ, RVQ-P and RVQ-NP have the same trend on parameters L and k. The search accuracy increases monotonically with increasing L or k. Under the same bit length, the accuracy with bigger k is better than bigger L. The performance of RVQ-NP and RVQ are close. RVQ-NP is slightly better than RVQ, while RVQ-P obviously outperforms RVQ and RVQ-NP.
(11)
1≤ l≤ L 1≤ j ≤ l
Therefore, the objective distortion function Eq. (1) for RVQ-P is deformed in Eq. (3) as the summary in Section 2.2.
1
k=256
0.6
k=64
L=2 L=4 L=8 L=16
0.4
0
0.8
0.8 Recall@10
Recall@10
0.8
0.2
1
1
k=16 16 32
64 bits
96
0.6
L=2 L=4 L=8 L=16
k=64
0.4 0.2
128
k=256
0
k=16 16 32
64 bits
96
Recall@10
Dis ≥
2 k− D D
5
128
k=256
0.6
k=64
L=2 L=4 L=8 L=16
0.4 0.2 0
k=16 16 32
64 bits
96
128
Fig. 4. Recall@10 associated with L and k on SIFT1M (ADC).
Please cite this article as: D. Guo, et al., Parametric and nonparametric residual vector quantization optimizations for ANN search, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.04.061i
D. Guo et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
6
0.4 PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
Recall@1
0.15 0.1 0.05 0
0.2 0.1 0
Methods
Recall@1
0.2 0.1
PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
0.6 Recall@1
PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
0.3
Methods
(b) L=M=4, 32bits(SIFT1M)
(a)L=M=2, 16bits (SIFT1M) 0.4
PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
0.3 Recall@1
0.2
0.5 0.4 0.3 0.2 0.1 0
Methods
(c) L=M=8, 64bits 0.2
PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
Recall@1
0.15 0.1 0.05 0
0.25
0.15 0.1 0.05 0
Methods
PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
0.3 0.2 0.1
Methods
(g) L=M=16, 64bits (GIST1M)
Methods
(f) L=M=8, 32bits (GIST1M) 0.4
PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
0.3 Recall@1
0.4
Recall@1
PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
0.2
(e)L=M=4, 16bits (GIST1M)
0
Methods
(d) L=M=16, 128bits (SIFT1M)
(SIFT1M)
Recall@1
0
0.2 0.1 0
Methods (h) L=M=32, 128bits (GIST1M)
Fig. 5. Recall@1 on SIFT1M and GIST1M versus bit allocation per point. (ADC). (a), (b), (c) and (d) are the results with k ¼28 ¼256 codewords on SIFT1M respectively corresponding to 16, 32, 64 and 128 bits. (e), (f), (g) and (h) are results with k ¼ 24 ¼ 16 codewords on GIST1M dataset respectively corresponding to 16, 32, 64 and 128 bits.
Please cite this article as: D. Guo, et al., Parametric and nonparametric residual vector quantization optimizations for ANN search, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.04.061i
D. Guo et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
7
Fig. 6. Recall@R on SIFT1M versus bit allocation per point (ADC).
Recall@R:GIST PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
80 60 40 20 0 0
50
100 R
150
Recall@R:GIST 100 Recall@R(%)
Recall@R(%)
100
80 60 40 20 0 0
200
80 60 40 20 100 R
150
100 R
150
200
200
100 Recall@R(%)
Recall@R(%)
PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP 50
50
Recall@R:GIST
Recall@R:GIST 100
0 0
PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
80 60 40 20 0 0
50
100 R
Fig. 7. Recall@R on GIST1M versus bit allocation per point (ADC).
150
200
D. Guo et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
8
SIFT
GIST
mAP
0.6 0.4 0.2 0 16
32
64
PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
0.6 mAP
PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
0.8
0.8
0.4 0.2 0 16
128
32
bits
64
128
bits Fig. 8. Mean average precision vs. code-length (ADC).
5.2. Accuracy comparison Then we compare accuracy performance of different methods. Both recall@R and mean average precision (mAP) measures are used to evaluate accuracy performance. To be fair, our experiments set L ¼M, where L is the stage number in RVQ-based methods and M is the number of subspaces in PQ-based methods. Moreover, to describe different quantization compressibilities, we deliberately set a small bit number to high dimension feature. The SIFT descriptor is a local 128-d descriptor and the GIST descriptor is a global 960-d descriptor. On SIFT1M dataset, each quantizer is assigned 8 bits and thus it has k ¼28 ¼256 codewords in each stagecodebook or each subspace codebook. But on GIST1M dataset, each quantizer is only assigned 4 bits and thus it has k ¼24 ¼16 codewords in each stage-codebook or each subspace codebook. There are results of Recall@R and mAP comparisons as follow. Recall@1 always reflects the rapid response ability of each method. Fig. 5 gives the results of Recall@1 with different codelengths on different dataset. Both on SIFT1M and GIST1M, their comparison trends are consistent. RVQ-based methods (RVQ, RVQP and RVQ-NP) and Optimized PQ methods (OPQ-P and OPQ-NP) have better performance than original PQ-based methods (PQ, PQRO and PQ-RR). This indicates that RVQ has a better accuracy than PQ. RVQ-based methods have superiority on restoring quantization error. When the length of bit allocation per point is small, RVQ-based
5.5
x 10
methods obviously have a better accuracy than other methods. When the length of bit allocation per point becomes larger, RVQ, RVQ-NP, OPQ-P and OPQ-NP, these four methods are close on SIFT1M. That's because quantization compressibility becomes fewer and the global optimization of orthonormal matrix R in these methods are subtle. Anyway, RVQ-P is always the best. RVQP successfully optimizes the quantization distortion by multiple iterative stage-matrices {Rl}. What is more, except RVQ-P, other methods are sensitive to the different compressibilities, such as 128-d SIFT descriptor with k¼ 28 ¼256 codewords and 960-d GIST descriptor with k¼ 24 ¼ 16 codewords. Their accuracies on GIST1M are much less than on SIFT1M. But RVQ-P still has strong robustness on different compressibilities. Figs. 6 and 7 give different Recall@R curves on SIFT1M and GIST1M. Either in Figs. 7 and 8, RVQ-NP and RVQ-P outperform PQ-based methods. RVQ-P method is still the best. Especially, its advantage is more significant when the code-length is short. The results of mAP are shown in Fig. 8. We consider the true neighbor in the first 200 Euclidean nearest neighbors. We get the same conclusion that RVQ-P and RVQ-NP have better accuracies. Especially, RVQ-P still has the best accuracy. Given a longer code-length, each method's accuracy is closer to that of the original data which are not compressed, and the difference between RVQ-based methods and PQ-based methods is
SIFT
4
GIST 1.5
5
1.4
4.5 1.3 3.5 3 2.5 2 1.5 1 0.5 16
distrotion
distrotion
4
PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
1.2 1.1 1 0.9 0.8
32
64
128
0.7 16
PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
bits
32
64
128
bits
Fig. 9. Distortion of learning set vs. code-length.
Please cite this article as: D. Guo, et al., Parametric and nonparametric residual vector quantization optimizations for ANN search, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.04.061i
D. Guo et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
SIFT
5
7
9
x 10
GIST 18 16
6
14 12
4
distrotion
distrotion
5
PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
3 2 1 0 16
PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
10 8 6 4 2
32
64
0 16
128
32
bits
64
128
bits
Fig. 10. Distortion of database set vs. code-length (ADC).
reduced. RVQ-P performs obvious superiority when the codelength is shorter. This means that RVQ-P remains more original information than other methods, thus RVQ-P can be well applied in low-end mobile devices. 5.3. Distortion comparison Figs. 9 and 10 compare the performance of quantization distortion with different code-lengths on SIFT1M and GIST1M. Fig. 9 illustrates the distortion of learning set and Fig. 10 illustrates the
distortion of database set. We can see that RVQ-P and RVQ-NP have similar distortions, and their distortions are lower than others'. Moreover, the difference of RVQ-based methods and PQ-based methods in Fig. 10 is bigger than in Fig. 9. Especially, the distortions of RVQ-based methods are only about one tenth of PQ-based methods' with 16 bit code-length on distortion of database set. It indicates that PQ-based methods are sensitive to the training (learning) data. The performances of RVQ-based methods are not
Recall@R:SIFT
Recall@R:SIFT PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
80 60 40 20 0 0
50
100 R
150
100 Recall@R(%)
Recall@R(%)
100
40
0 0
50
100 R
150
200
Recall@R:SIFT
Recall@R:SIFT PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
80 60 40 20
100 Recall@R(%)
Recall@R(%)
60
20
200
100
0 0
PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
80
PQ PQ−RO PQ−RR OPQ−P OPQ−NP RVQ RVQ−P RVQ−NP
80 60 40 20
50
100 R
150
200
0
50
100 R
150
200
Fig. 11. Recall@R on SIFT1M versus bit allocation per point (SDC). (a) L ¼ M ¼2, 16 bits, (b)L ¼M ¼ 4, 32 bits, (c)L ¼ M ¼8, 64 bits and (d)L ¼ M ¼16, 128 bits.
Please cite this article as: D. Guo, et al., Parametric and nonparametric residual vector quantization optimizations for ANN search, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.04.061i
D. Guo et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
10
Table 1 Speed comparison on SIFT1M with 32-bit code length (ADC), L ¼ M ¼4, k ¼256. Method
PQ [16] PQ-RO [16] PQ-RR [16] OPQ-P [9,10] OPQ-NP [9,10] RVQ RVQ-P RVQ-NP
Time (s)
Recall@R (%)
Training Encoding Decoding/search R¼ 1
R ¼ 10 R ¼100
774.50 765.56 766.61 764.03 1389.29 2826.48 2862.39 5064.58
22.21 16.92 10.15 22.24 26.21 33.52 67.44 34.84
8.23 8.31 8.08 8.31 7.82 14.34 18.04 13.04
105.40 103.29 103.21 103.55 99.83 101.17 102.15 94.82
5.76 4.09 2.31 5.68 6.34 9.80 29.21 9.51
58.82 49.29 35.32 58.01 64.03 72.12 93.53 72.80
limited by learning data. RVQ-based methods have better stability and robustness than PQ-based methods. The reason is that RVQbased methods not only restore quantized error, but they also well maintain data distance projection in the compressed space. 5.4. Comparison of ADC strategy and SDC strategy At last, we compare SDC strategy and ADC strategy in ANN search. Compared to the results of ADC strategy on SIFT1M as shown in Fig. 6, we experiment the SDC strategy with same parameters. The curves of Recall@R by following SDC strategy are illustrated in Fig. 11. [9] and [10] have pointed that SDC has less accurate than ADC. As shown in Fig. 11, except RVQ-P, other methods have the same mechanism. But RVQ-P is not limited by the defect of SDC strategy. It still has a very good accuracy with SDC strategy, even under a small bit code-length. It verifies the good stability and robustness of RVQ-P again. 5.5. Speed comparison Given D-dimension data, the time complexities of PQ-based methods in search process are the same O(M*t(D/M)), where t(d) is a time function to pick the nearest codeword in a d-dimension codebook. O(M*t(D/M)) means picking the nearest (D/M)-dimension centroid in M sub-codebooks. The time complexities of RVQbased methods is O(L*t(D)). It means picking the nearest D-dimension centroid in L stage-codebooks. Here we have L ¼M. Therefore, RVQ-based methods have more time complexity than PQ-based methods. Table 1 lists the training, encoding and search times. All methods are implemented in Matlab on a server with an Intel Xeon E5-2620 V2 2.1 GHz CPU and 32 GB RAM. The result is consistent with theoretical analysis. RVQ-based methods have more time consumption than PQ-based methods. Supported by high performance computing in real applications, the time consumptions on pre-training and pre-encoding processes are generally tolerated. With the same code length in the decoding/search process, there is relatively little difference among different methods. Therefore, the time consumptions of RVQ-based methods are well acceptable. In addition, RVQ-P still has much higher accuracy than others. For example, at recall@1, the accuracy of RVQ-P is at least 2.98 times that of others'. 6. Conclusions We have introduced two optimized solutions of residual vector quantization for ANN search. Both RVQ-NP and RVQ-P methods try to optimize quantization distortion by data projection with an orthogonal matrix. Different from local optimization of PQ-based methods on subspaces partition, RVQ-P and RVQ-NP simulate global restoring on residual error with multiple stage quantizers. RVQ-NP and RVQ-P work well. Especially, RVQ-P has a very
obvious advantage and robustness over other methods. RVQ-P is not sensitive to training data, even no matter SDC or ADC strategy in ANN search. It performs particularly well with short bit allocation per point. Next we will improve the time efficiency on training process in the future work.
Acknowledgements This work was supported by the National Natural Science Foundation of China (NSFC) under grant 61305062. We also would like to thank the authors of literatures [16] and [9,10] to share their codes on the web. With the Matlab codes of PQ [16] and OPQ [9,10], we make the experiment comparisons more conveniently.
References [1] A. Andoni, P. Indyk, Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions, FOCS IEEE 51 (1) (2006) 459–468. [2] L. Ai, J. Yu, T. Guan, Y. He, Efficient approximate nearest neighbor search by optimized residual vector quantization, in: Proceedings of the CBMI, IEEE, 2014, pp. 1–4. [3] K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft, When is “Nearest Neighbor” Meaningful? Database Theory—ICDT’99,, Springer, Berlin Heidelberg 1999, pp. 217–235. [4] C. Böhm, S. Berchtold, D. Keim, Searching in high-dimensional spaces: index structures for improving the performance of multimedia databases, ACM CSUR 33 (2001) 322–373. [5] O. Boiman, E. Shechtman, M. Irani, In defense of nearest-neighbor based image classification, in: Proceedings of the CVPR, IEEE, 2008, pp. 1–8. [6] J. Brandt, Transform coding for fast approximate nearest neighbor search in high dimensions, in: Proceedings of the CVPR, IEEE, 2010, pp. 1815–1822. [7] Y. Chen, T. Guan, C. Wang, Approximate nearest neighbor search by residual vector quantization, Sensors 10 (12) (2010) 11259–11273. [8] M. Datar, N. Immorlica, P. Indyk, V.S. Mirrokni, Locality-sensitive hashing scheme based on p-stable distributions, in: Proceedings of the 20th Annual ACM SoCG, 2004, pp. 253–262. [9] T. Ge, K. He, Q. Ke, J. Sun, Optimized product quantization for approximate nearest neighbor search, in: Proceedings of the CVPR, IEEE, pp. 2946–2953. [10] T. Ge, K. He, Q. Ke, J. Sun, Optimized product quantization, IEEE Trans. Pattern Anal. Mach. Intell. 36 (4) (2014) 744–755. [11] Y. Gong, S. Lazebnik, Iterative quantization: a procrustean approach to learning binary codes, in: Proceedings of the CVPR, IEEE, 2011, pp. 817–824. [12] R.M. Gray, Vector quantization, ASSP Mag. IEEE 1 (2) (1984) 4–29. [13] T. Guan, Y. He, L. Duan, J. Yu, Efficient BOF generation and compression for ondevice mobile visual location recognition, IEEE Multimed. 21 (2) (2014) 32–41. [14] T. Guan, Y. Wang, L. Duan, R. Ji, On-device mobile landmark recognition using binarized descriptor with multifeature fusion, ACM Trans. Intell. Syst. Technol. 7 (1) (2015) 1–29. [15] K. He, F. Wen, J. Sun, K-Means Hashing: an affinity-preserving quantization method for learning binary compact codes, in: Proceedings of the CVPR, IEEE, 2013, pp. 2938–2945. [16] H. Jegou, M. Douze, C. Schmid, Product quantization for nearest neighbor search, IEEE Trans. Pattern Anal. Mach. Intell. 33 (1) (2010) 117–128. [17] H. Jegou, M. Douze, C. Schmid, P. Perez, Aggregating local descriptors into a compact image representation, in: Proceedings of the CVPR, IEEE, 2010, pp. 3304–3311. [18] R. Ji, Y. Gao, W. Liu, X. Xie, Q. Tian, X. Li, When location meets social multimedia: a comprehensive survey on location-aware social multimedia, ACM Trans. Intell. Syst. Technol. 6 (1) (2015) 1–18. [19] Y. Kalantidis, Y. Avrithis, Locally optimized product quantization for approximate nearest neighbor search, in: Proceedings of the CVPR, IEEE, 2014, pp. 2321–2328. [20] C. Lin, W. Chen, C. Qiu, Y. Wu, S. Krishnan, Q. Zou, LibD3C: ensemble classifiers with a clustering and dynamic selection strategy, Neurocomputing 123 (2014) 424–435. [21] H. Liu, R. Ji, Y. Wu, W. Liu, Towards optimal binary code learning via ordinal embedding, in: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016. [22] Y. Luo, T. Guan, B. Wei, H. Pan, J. Yu, Fast terrain mapping from low altitude digital imagery, Neurocomputing 156 (2015) 105–116. [23] M. Muja, D.G. Lowe, Fast approximate nearest neighbors with automatic algorithm configuration, in: Proceedings of the VISAPP, 2009, pp .331–340. [24] D. Nister, H. Stewenius, Scalable recognition with a vocabulary tree, in: Proceedings of the CVPR, IEEE, 2006, pp. 2161–2168. [25] H. Pan, T. Guan, Y. Luo, L. Duan, Y. Tian, L. Yi, Y. Zhao, J. Yu, Dense 3D reconstruction combining depth and RGB information, Neurocomputing 175 (2016) 644–651. [26] G. Shakhnarovich, P. Indyk, T. Darrell, Nearest-neighbor methods in learning
Please cite this article as: D. Guo, et al., Parametric and nonparametric residual vector quantization optimizations for ANN search, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.04.061i
D. Guo et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ and vision: theory and practice, , Pattern Anal. Appl. (2006). [27] C. Silpa-Anan, R. Hartley, S. Machines, A. Canberra, Optimised KD-trees for fast image descriptor matching, in: Proceedings of the CVPR, IEEE, 2008, pp. 1–8. [28] J. Sivic, A. Zisserman, Video google: a text retrieval approach to object matching in videos, in: Proceedings of the ICCV, IEEE, 2003, pp. 1470–1477. [29] D. Song, W. Liu, D. A. Meyer, D. Tao, R. Ji, Rank preserving hashing for rapid image search, in: Proceedings of the DCC, IEEE, 2015, pp. 353–362. [30] A. Torralba, R. Fergus, Y. Weiss, Small codes and large image databases for recognition, in: Proceedings of the CVPR, IEEE, 2008, pp. 1–8. [31] B. Wang, Z. Li, M. Li, W.Y. Ma, Large-scale duplicate detection for web image search, in: Proceedings of the ICME, IEEE, 2006, pp. 353–356. [32] J. Wang, S. Kumar, S.F. Chang, Semi-supervised hashing for scalable image retrieval, in: Proceedings of the CVPR, IEEE, 2010, pp. 3424–3431. [33] B. Wei, T. Guan, L. Duan, J. Yu, T. Mao, Wide area localization and tracking on camera phones for mobile augmented reality systems, Multimed. Syst. 21 (2015) 381–399. [34] B. Wei, T. Guan, J. Yu, Projected residual vector quantization for ANN search, IEEE Multimed. 21 (3) (2014) 41–51. [35] Y. Weiss, A. Torralba, R. Fergus, Spectral hashing, in: Proceedings of the NIPS, vol. 282 (3), 2008, pp. 1753–1760. [36] Y. Zhang, T. Guan, L. Duan, B. Wei, J. Gao, T. Mao, Inertial sensors supported visual descriptors encoding and geometric verification for mobile visual location recognition applications, Signal Process. 112 (2015) 17–26. [37] Q. Zou, X. Li, W. Jiang, Z. Lin, G. Li, K. Chen, Survey of MapReduce frame operation in bioinformatics, Brief. Bioinform. 15 (4) (2014) 637–647. [38] Q. Zou, J. Zeng, L. Cao, R. Ji, A novel features ranking metric with application to scalable visual and bioinformatics data classification, Neurocomputing 173 (2016) 346–354.
11 Chuanqing Li received the B.S. degree from China University of Mining and Technology, China. He is currently a M.E. student at the School of Computer and Information, Hefei University of Technology. His research interests include multimedia retrieval, digital image analysis and processing.
Lv Wu is a Ph.D. candidate in Wuhan University of Technology, China. She received the B.S. degree from Wuhan University of Technology either. Her research interests include machine learning and non-parametric Bayesian learning in computer vision.
Dan Guo received the Ph.D. Degree from Huazhong University of Science & Technology, China. She Is Currently an Associate professor at the School of Computer and Information, Hefei University of Technology. Her Research Interests Include Multimedia Analysis, Video Analysis and Pattern Recognition.
Please cite this article as: D. Guo, et al., Parametric and nonparametric residual vector quantization optimizations for ANN search, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.04.061i