Discrete Applied Mathematics xxx (xxxx) xxx
Contents lists available at ScienceDirect
Discrete Applied Mathematics journal homepage: www.elsevier.com/locate/dam
Dynamic index and LZ factorization in compressed space ∗
Takaaki Nishimoto a , , Tomohiro I c , Shunsuke Inenaga b , Hideo Bannai b , Masayuki Takeda b a b c
RIKEN Center for Advanced Intelligence Project, Japan Department of Informatics, Kyushu University, Japan Kyushu Institute of Technology, Japan
article
info
Article history: Received 2 March 2017 Received in revised form 30 December 2018 Accepted 15 January 2019 Available online xxxx Keywords: Dynamic texts LZ77 factorization Dynamic index Compressed data structures Dynamic data structures
a b s t r a c t In this paper, we propose a new dynamic compressed index of O(w ) space for a dynamic text T , where w = O(min(z log N log∗ M , N)) is the size of the signature encoding of T , z is the size of the Lempel–Ziv77 (LZ77) factorization of T , N is the length of T , and M ≥ N is the maximum length of T . Our index supports searching for a pattern P in T in O(|P |fA +log w log |P | log∗ M(log N +log |P | log∗ M)+occ log N) time and insertion/deletion of a substring of length y in O((y + log N log∗ M) log w log N log∗ M)√time, where occ is log log M log log w
log w
the number of occurrences of P in T and fA = O(min{ log log log M , log log w }). Also, we propose a new space-efficient LZ77 factorization algorithm for a given text T , which runs in O(NfA + z log w log3 N(log∗ N)2 ) time with O(w ) working space. © 2019 Elsevier B.V. All rights reserved.
1. Introduction 1.1. Dynamic compressed index Given a text T , the string indexing problem is to construct a data structure, called an index, so that querying occurrences of a given pattern in T can be answered efficiently. As the size of data is growing rapidly in the last decade, many recent studies have focused on indices working in compressed text space (see e.g. [9,10,15,16]). However most of them are static, i.e., they have to be reconstructed from scratch when the text is modified, which makes difficult to apply them to dynamic texts. Formally, a dynamic text index for T of length N is the data structure supporting the following queries:
• INSERT (Y , i): update T ← T [1..i − 1]YT [i..|T |] for a given string Y of length y and an integer i. • DELETE(j, y): update T ← T [1..j − 1]T [j + y..|T |] for two given integers j and y. • FIND(P): return all occurrence positions of P in T . In this paper, we consider the dynamic compressed text indexing of maintaining a dynamic text index in compressed space. Although there exist several dynamic non-compressed text indices (see e.g. [3,12,36] for recent work), there has been little work for the compressed variants. Hon et al. [21] proposed the first dynamic compressed index of O( 1ϵ (NH0 + N)) bits of space which supports FIND(P) queries P in O(|P | log2 N(logϵ N + log |Σ |) + occ log1+ϵ N) time and update operations in √ O((y + N) log2+ϵ N) amortized time, where 0 < ϵ ≤ 1 and H0 ≤ log |Σ | denotes the zeroth order empirical entropy of ∗ Corresponding author. E-mail addresses:
[email protected] (T. Nishimoto),
[email protected] (T. I),
[email protected] (S. Inenaga),
[email protected] (H. Bannai),
[email protected] (M. Takeda). https://doi.org/10.1016/j.dam.2019.01.014 0166-218X/© 2019 Elsevier B.V. All rights reserved.
Please cite this article as: T. Nishimoto, T. I, S. Inenaga et al., Dynamic index and LZ factorization in compressed space, Discrete Applied Mathematics (2019), https://doi.org/10.1016/j.dam.2019.01.014.
2
T. Nishimoto, T. I, S. Inenaga et al. / Discrete Applied Mathematics xxx (xxxx) xxx
T [21]. Salson et al. [38] also proposed a dynamic compressed index, called dynamic FM-Index. Although their approach works well in practice, updates require O(N log N) time in the worst case. To our knowledge, these are the only existing dynamic compressed indices to date. We propose a new dynamic index in compressed space, as follows: Theorem 1. Let M be the maximum length of the dynamic text to index. There exists a dynamic index of size O(w ) space for a string T supporting update operations in amortized O((y + log N log∗ M) log w log N log∗ M) time, FIND(P) in O(|P |fA + log w log |P | log∗ M(log N + log |P | log∗ M) + √ occ log N) time, where w = O(min(z log N log∗ M , N)), occ = |FIND(P)|, z = |LZ77(T )|, and fA = O(min{
log log M log log w log log log M
,
log w log log w
}).
fA is the time for predecessor/successor queries on a set of w integers from an M-element universe [4] and LZ77(T ) is the factorization of T which is called Lempel–Ziv77 (LZ77) factorization [41].1 The size of our index is small when z is small. Our index can find pattern occurrences faster than the index of Hon et al. [21] when M = O(N) holds. This is because fA = o(log N), (log∗ M)2 = o(logϵ N), and w = O(N) hold. Our index √ also allows faster substring insertion/deletion on the text since log w log N log∗ M = o(log2+ϵ N) and log N log∗ M = o( N). Moreover, our dynamic index supports INSERT ′ (j, y, i) operation in amortized O(log w (log N log∗ M)2 ) time, where INSERT ′ (j, y, i) updates T ← T [..i − 1]T [j..j + y − 1]T [i..] for given integers i, j, and y. 1.1.1. Related work To achieve the above result, technically speaking, we use the signature encoding G of T and the locally consistent parsing technique [28], which will be defined in Section 3. Locally consistent parsing is a factorization of an input text such that each factor is determined by a short context in the string and the signature encoding of T is a context-free grammar (CFG) representing T which is based on locally consistent parsing. Signature encoding has the property that the number of changed nodes is not many when T is edited. This is because each node of the derivation tree of T is determined by a context in T thanks to locally consistent parsing. Mehlhorn et al. proposed signature encodings for equality testing on a dynamic set of strings [28]. Since then, the signature encoding and the related ideas have been used in many applications. In particular, Alstrup et al. proposed dynamic index [2,3] (not compressed) which is based on the signature encoding of strings, while improving the update time of signature encodings [3] and the locally consistent parsing algorithm [2] using the simplified representation of dynamic strings and a precomputed table to compute factors. Recently, Gawrychowski et al. proposed a new dynamic index, which is based on the randomized version of signature encodings [18,19]. They did not analyze the space usage of their index, but the randomized signature encoding can be represented in compressed space with high probability [8]. Our data structure uses Alstrup et al.’s fast string concatenation/split algorithms (update algorithm) and linear-time computation of locally consistent parsing, however, our searching algorithm is different from their algorithm. Especially, their searching algorithm requires to maintain specific locations called anchors over the parse trees of the signature encodings. More precisely, an anchor is the substring represented by a node sequence over parse trees. Signature encodings can be represented in compressed space, but their index needs O(N) anchors to find patterns. For this reason, their index is not compressed. On the other hand, our index does not use anchors and finds patterns in compressed space. Our searching algorithm is based on the searching algorithm of the compressed index proposed by Claude and Navarro [9]. The text of their index is represented by a CFG, and the searching algorithm works in compressed space using range reporting queries on variables in the CFG. Since our dynamic text is also represented by a CFG, we can use their searching algorithm in compressed space. In addition, we show that by modifying their searching algorithm, it works faster when the text is represented by signature encodings. We can also maintain our data structure for the search algorithm under the edit operations to the text. As a result, we get Theorem 1, which is one of our contributions. Our index has a close relationship with the ESP-indices [39,40], which are compressed indices and the texts are represented by signature encodings. However the definition of signature encoding is slightly different from the definition by Mehlhorn et al. There are two significant differences between ours and ESP-indices: The first difference is that the ESPindex [39] is static and its online variant [40] allows only for appending new characters to the end of the text, while our index is fully dynamic allowing for insertion and deletion of arbitrary substrings at arbitrary positions. The second difference is that the pattern search time of the ESP-index is proportional to the number occ c of occurrences of the so-called ‘‘core’’ of a query pattern P, which corresponds to a maximal subtree of the ESP derivation tree of a query pattern P. If occ is the number of occurrences of P in the text, then it always holds that occ c ≥ occ, and in general occ c cannot be upper bounded by any function of occ. In contrast, as can be seen in Theorem 1, the pattern search time of our index is proportional to the number occ of occurrences of a query pattern P. This became possible due to our discovery of a new property of signature encodings [2] (stated in Lemma 7). As another application of signature encodings, Nishimoto et al. [31] showed that signature encodings for a dynamic string T can support Longest Common Extension (LCE) queries on T efficiently in compressed space. They also showed signature encodings can be updated in compressed space (Lemma 4). Our algorithm uses properties of signature encodings shown 1 Strictly speaking, we should refer to the LZ76 factorization [26], but we stick to the name LZ77 as it has been widely used to represent LZ76 and its variants. Please cite this article as: T. Nishimoto, T. I, S. Inenaga et al., Dynamic index and LZ factorization in compressed space, Discrete Applied Mathematics (2019), https://doi.org/10.1016/j.dam.2019.01.014.
T. Nishimoto, T. I, S. Inenaga et al. / Discrete Applied Mathematics xxx (xxxx) xxx
3
in [31] (the full version is [30]), more precisely, Lemmas 3–4, but Lemma 7 is a new property of signature encodings not described in [31]. In relation to our problem, there exists the library management problem of maintaining a text collection (a set of text strings) allowing for insertion/deletion of texts (see [29] for recent work). While in our problem a single text is edited by insertion/deletion of substrings, in the library management problem a text can be inserted to or deleted from the collection. Hence, algorithms for the library management problem cannot be directly applied to our problem. 1.2. Computing LZ77 factorization in compressed space As an application of our dynamic compressed index, we propose a new LZ77 factorization algorithm in compressed space. Although the primary use of LZ77 factorization is data compression, it has been shown that it is a powerful tool for many string processing problems [16,17]. Hence the importance of algorithms to compute LZ77 factorization is growing. Particularly, in order to apply algorithms to large-scale data, reducing the working space is an important matter. Hence we focus on LZ77 factorization algorithms working in compressed space. The following is our main result. Theorem 2. Given the signature encoding G of a string T , we can compute the LZ77 factorization of T in O(z log w log3 N(log∗ N)2 ) time and O(w ) working space, where w = O(min(z log N log∗ N , N)) is the size of G and z = |LZ77(T )|. In [30], it was shown that the signature encoding G can be constructed efficiently from various types of inputs, in particular, in O(NfA ) time and O(w ) working space from uncompressed string T . Therefore we can compute LZ77 factorization of a given T in O(NfA + z log w log3 N(log∗ N)2 ) time and O(w ) working space. 1.2.1. Related work Many LZ77 factorization algorithms using O(N log |Σ |) or O(N log N) bits of space have been proposed (e.g., [6,14,24]). For example, the algorithm proposed by Fischer et al. can compute the LZ77 factorization of T in O(N) time and O(N log |Σ |) bits of space [14]. It is faster than our algorithm, however, our algorithm works in small space when z ≪ N. As already mentioned, we focus on LZ77 factorization algorithms working in compressed space. Goto et al. [20] showed how, given the grammar-like representation for string T generated by the LCA algorithm [37], to compute the LZ77 factorization of T in O(z log2 m log3 N + m log m log3 N) time and O(m log2 m) space, where m is the size of the given representation. Sakamoto et al. [37] claimed that m = O(z log N log∗ N) and hence our algorithm seems to use less space than the algorithm of Goto et al. [20]. Recently, Fischer et al. [13] showed a Monte-Carlo randomized algorithms to compute an approximation of the LZ77 factorization with at most 2z factors in O(N log N) time, and another approximation with at most (i + ϵ )z factors in O(N log2 N) time for any constant ϵ > 0, using O(z) space each. Another line of research is LZ77 factorization working in compressed space in terms of Burrows–Wheeler transform (BWT) based methods. Policriti and Prezza recently proposed algorithms running in NH0 + o(N log |Σ |) + O(|Σ | log N) bits of space and O(N log N) time [32], or O(R log N) bits of space and O(N log R) time [33], where R is the number of runs in the BWT of the reversed string of T . Their and our algorithms are established on different measures z and R of compression, and thus, cannot be easily compared. Our algorithm is more space efficient than their algorithm when z = o( log N Rlog∗ N ) but it is not clear when it happens. In practice, it has been observed that z is often much smaller than R [5,33]. On the other hand, several theoretical relations between z and R are known: First, z = O(R log2 (N /R)) always holds [23], and thus, z cannot be significantly worse than R. Second, there exist string families such that R = Ω (z log N) [5,34] or z = Ω (R log N) [34] holds. We also refer to LZ-End [25], which is a variant of LZ77. The compressed size (the number of factors) z ′ of LZ-End is comparable to that of LZ77 in practice, and it can support fast random access. Kempa and Kosolobov [22] presented an LZ-End compression algorithm which runs in O(N log N) expected time and O(z ′ + ℓ) space, where ℓ is the length of the longest factor. Their algorithm has an option to fix ℓ in advance and compute LZ-End like factorization in which all factor lengths are limited to up to ℓ. Our algorithm is more space efficient than their algorithm when w = o(z ′ + ℓ). 2. Preliminaries Let Σ be an ordered alphabet. An element of Σ ∗ is called a string. The empty string ε is a string of length 0. Let Σ + = Σ ∗ − {ε}. Let T be a string with its length denoted by |T |. Let N = |T | throughout this paper. For a string s = xyz, x, y and z are called a prefix, substring, and suffix of s, respectively. For any 1 ≤ i ≤ j ≤ |s|, s[i] denotes the ith character of s, s[i..j] denotes the substring of s that begins at position i and ends at position j, s[i..] = s[i..|s|], and s[..i] = s[1..i]. Let sR denote the reversed string of s, that is, sR = s[|s|] · · · s[2]s[1]. For any strings x and y, let LCP(x, y) (resp. LCS(x, y)) denote the length of the longest common prefix (resp. suffix) of x and y, and also, for two integers i, j, let LCE(x, y, i, j) = LCP(x[i..], y[j..]). Let Occ(x, y) denote all occurrence positions of x in y, namely, Occ(x, y) = {i | x = y[i..i + |x| − 1], 1 ≤ i ≤ |y| − |x| + 1}. For a string p, we say that (x, y) stabs p at position i (1 ≤ i < |p|) if p[..i] is a suffix of x and p[i + 1..] is a prefix of y. A string sequence f1 , . . . , fk is called a factorization of a string s if f1 · · · fk = s holds, where fi ̸ = ε for any 1 ≤ i ≤ k. Each string of a factorization is called a factor and k is the size of the factorization of s. Let RLE(s) denote the factorization of s such that each factor is a maximal run of the same character a as ak , where k is the length of the run. For example, let s = aabbbbbabb. Then RLE(s) = a2 , b5 , a1 , b2 . RLE(s) can be computed in O(|s|) time. Please cite this article as: T. Nishimoto, T. I, S. Inenaga et al., Dynamic index and LZ factorization in compressed space, Discrete Applied Mathematics (2019), https://doi.org/10.1016/j.dam.2019.01.014.
4
T. Nishimoto, T. I, S. Inenaga et al. / Discrete Applied Mathematics xxx (xxxx) xxx
LZ77(s) is the factorization f1 , . . . , fz of s such that (1) f1 = s[1], (2) for 1 < i ≤ z, if the character s[|f1 ..fi−1 | + 1] does not occur in s[|f1 ..fi−1 |], then fi = s[|f1 ..fi−1 | + 1], otherwise fi is the longest prefix of fi · · · fz which occurs in f1 · · · fi−1 . For example, let s = abababcabababcabababcd. Then LZ77(s) = a, b, ab, ab, c , abababc , abababc , d. Let z = |LZ77(T )| throughout this paper. The base of logarithms is two in this paper. Our model of computation is the unit-cost word RAM with machine word size of log M bits, and space complexities will be evaluated by the number of machine words, where M is the maximum length of the input text T , i.e., N ≤ M always holds. Bit-oriented evaluation of space complexities can be obtained with a log M multiplicative factor. Since commodity computers nowadays have a word size of 64 bits, our RAM model can handle the text of maximal length 264 , which is enough size in practice. Therefore our assumption is reasonable. 2.1. Straight-line programs A straight-line program (SLP) is a context-free grammar in the Chomsky normal form that generates a single string. Formally, an SLP that generates T is a quadruple S = (Σ , V , D, S), such that Σ is an ordered alphabet of terminal characters; V = {X1 , . . . , Xn } is a set of positive integers, called variables; D = {Xi → expr i }ni=1 is a set of deterministic productions (or assignments) with each expr i being either of form Xℓ Xr (1 ≤ ℓ, r < i), or a single character a ∈ Σ ; and S = Xn ∈ V is the start symbol which derives the string T . In this paper, we use variables as integers. We also assume that the grammar neither contains redundant variables (i.e., there is at most one assignment whose righthand side is expr) nor useless variables (i.e., every variable appears at least once in the derivation tree of S ). The size of S is the number n of productions in D. See also Example 1. Notation. Let val(X ) be the derived string by X for an X ∈ V , and let val+ (y) = val(y[1]) · · · val(y[|y|]) for any variable sequence y ∈ V + . For two variables Xi , Xj ∈ V , we say that Xi occurs at position c in Xj if Xi derives val(Xj )[c ..c + |val(Xi )| − 1] in the derivation tree of Xj . We define the function vOcc(Xi , Xj ) which returns all positions of Xi in the derivation tree of Xj . For a variable Xi → Xℓ Xr ∈ D and a string P, we say that Xi stabs P at position 1 ≤ j ≤ |P | if (val(Xℓ ), val(Xr )) stabs P at position j. Let pOcc S (P , j) be the set of variables stabbing P at position j in S . Since an occurrence of P (of length at least 2) is stabbed by a variable in the derivation tree of S , the following observation holds. Observation 1 (E.g. [9]). Occ(P , T ) = where |P | > 1.
⋃
1≤i<|P |
{k + |val(Xℓ )| − i | X ∈ pOcc S (P , i), X → Xℓ Xr ∈ D, k ∈ vOcc(X , S)} holds,
See also Example 2. Example 1 (SLP). Let S = (Σ , V , D, S) be the SLP such that Σ = {A, B, C }, V = {X1 , . . . , X11 }, D = {X1 → A, X2 → B, X3 → C , X4 → X3 X1 , X5 → X4 X2 , X6 → X5 X5 , X7 → X2 X3 , X8 → X1 X2 , X9 → X7 X8 , X10 → X6 X9 , X11 → X10 X6 }, S = X11 . Then the derivation tree of S represents T = CABCABB CABCABCAB. Example 2 (Stab). Let S be the SLP of Example 1 and P = BCAB. Then Occ(P , T ) = {3, 7, 10, 13}. On the other hand, X6 , X9 and X11 stab P at position 1, 2 and 1 respectively. Hence pOcc S (P , 1) = {X6 , X11 }, pOcc S (P , 2) = {X9 } and pOcc S (P , 3) = ∅. Occ(P , T ) = {3, 7, 10, 13} since vOcc(X6 , S) = {1, 11}, vOcc(X9 , S) = {7}, and vOcc(X11 , S) = {1}. See also Fig. 3. 2.1.1. Repetitive straight-line programs We define repetitive SLPs (RSLPs), as an extension to SLPs, which allow run-length encodings in the righthand sides of productions, i.e., D might contain a production X → Xˆ k , where (Xˆ , k) ∈ V × N . We naturally extend some notations for SLPs to RSLPs. The size of the RSLP is still the number of productions in D as each production can be encoded in constant space. Similarly, the DAG of the RSLP can be represented in linear space. Example 3 is the example of an RSLP and Fig. 1 shows the derivation tree and DAG of the RSLP. Example 3 (RSLP). Let S = (Σ , V , D, S) be the RSLP such that Σ = {A, B, C }, V = {X1 , . . . , X10 }, D = {X1 → A, X2 → B, X3 → C , X4 → X3 X1 , X5 → X4 X2 , X6 → X54 , X7 → X2 X3 , X8 → X1 X2 , X9 → X7 X8 , X10 → X6 X9 }, and S = X10 . Then the derivation tree of S represents T = CABCABCABCABBC AB. See also Fig. 1. 2.2. 2D range reporting query Let X and Y denote subsets of two ordered sets, and let p ∈ X × Y be a point on the two-dimensional plane. Sometimes we call p a 2D point. We consider following operations for a set of 2D points R ⊆ X × Y , where |X |, |Y | = O(|R|).
• insert R (p, xpred , ypred ): given a point p = (x, y), xpred = max{x′ | x′ ∈ X , x′ ≤ x} and ypred = max{y′ | y′ ∈ Y , y′ ≤ y}, insert p into R, x after xpred , and y after ypred .
• deleteR (p): given a point p = (x, y) ∈ R, delete p from R, x from X , and y from Y . • report R (x1 , x2 , y1 , y2 ): given a rectangle (x1 , x2 , y1 , y2 ) with x1 , x2 ∈ X and y1 , y2 ∈ Y , returns {(x, y) ∈ R | x1 ≤ x ≤ x2 , y1 ≤ y ≤ y2 }. Please cite this article as: T. Nishimoto, T. I, S. Inenaga et al., Dynamic index and LZ factorization in compressed space, Discrete Applied Mathematics (2019), https://doi.org/10.1016/j.dam.2019.01.014.
T. Nishimoto, T. I, S. Inenaga et al. / Discrete Applied Mathematics xxx (xxxx) xxx
5
Fig. 1. (1) The left figure is the derivation tree of RSLP S of Example 3, which derives the string T . Variables are represented by positive integers. The red rectangles on T represent all occurrences of P = BCA in T . (2) The right figure is the DAG for S of Example 3. In the DAG, the black and red arrows represent X → Xℓ Xr and X → Xˆ k respectively . (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
A data structure for R is called dynamic 2D range reporting data structure if it supports above all operations. In this paper, we use the following data structure. Lemma 1 ([7]). There exists a dynamic 2D range reporting data structure which supports report R (x1 , x2 , y1 , y2 ) in O(log |R| + occ(log |R|/log log |R|)) time, and insert R (p, i, j), deleteR (p) in amortized O(log |R|) time, where occ is the number of the elements to output. This structure uses O(|R|) space.2 2.2.1. Range reporting queries for SLPs For an SLP S = (Σ , V , D, S), we can compute pOcc S (P , i) by a single reporting query on the set of 2D points representing variables in S . To explain this fact simply, we define some notations as follows. Let RS = {(val(Xℓ )R , val(Xr )) | X → Xℓ Xr ∈ V }. Then, we can regard each variable X → Xℓ Xr of S as a 2D point on the two-dimensional plane defined by XS = {val(X )R | X ∈ V } and YS = {val(X ) | X ∈ V }, where elements in XS and YS are sorted by lexicographic order. For a string P, let yPpred (resp. yPsucc ) denote the lexicographically smallest (resp. largest) element in YS that has P as a prefix. Namely, the strings of the range [yPpred ..yPsucc ] in YS have P as a prefix. Similarly, let xPpred (resp. xPsucc ) denote the lexicographically P [..i]
P [..i]
smallest (resp. largest) element in XS that has P R as a prefix. Then for any variable X → Xℓ Xr such that Xℓ is on [xpred ..xsucc ] and Xr is on [
P [i+1..] ypred
..
P [i+1..] ysucc
], X stabs P at position i. Hence the following observation holds.
Observation 2 (E.g. [9]). For a string P of length at least 2 and an integer 1 ≤ i < |P |, the following equation holds. P [..i]
P [i+1..]
P [..i] , ypred pOcc S (P , i) = report RS (xpred , xsucc
P [i+1..] , ysucc ).
See also Example 4. Example 4. Let S be the SLP of Example 1. Then, XS = {x1 , x4 , x2 , x8 , x5 , x9 , x6 , x11 , x10 , x3 , x7 }, YS = {y1 , y8 , y2 , y7 , y9 , y3 , y4 , y5 , y6 , y10 , y11 },
where xi = val(Xi )R , yi = val(Xi ) for any Xi ∈ V . See also Figs. 2 and 3. 2.2.2. 2D range minimum query We say that a 2D point p is a weighted 2D point if p has an integer representing the weight, namely p ∈ X × Y × N . Let R ⊆ X × Y × N be a set of weighted 2D points, where |X |, |Y | = O(|R|). An RmQ R (x1 , x2 , y1 , y2 ) query returns the weight of the lightest point in the rectangle represented by (x1 , x2 , y1 , y2 ) on R, namely, min{d | (x, y, d) ∈ report R (x1 , x2 , y1 , y2 )}. In this paper, we use the following lemma to support RmQ queries. Lemma 2 ([1]). Consider n weighted 2D points R on a two-dimensional plane. There exists a data structure which supports an RmQ R query in O(log2 n) time, occupies O(n) space, and requires O(n log n) time to construct. 2 The original problem considers a real plane in the paper [7], however, his solution only requires to compare any two elements in R in constant time. Hence we can apply his solution to our range reporting problem by maintaining X and Y using the data structure of order maintenance problem proposed by Dietz and Sleator [11], which enables us to compare any two elements in a list and insert/delete an element to/from the list in constant time. Please cite this article as: T. Nishimoto, T. I, S. Inenaga et al., Dynamic index and LZ factorization in compressed space, Discrete Applied Mathematics (2019), https://doi.org/10.1016/j.dam.2019.01.014.
6
T. Nishimoto, T. I, S. Inenaga et al. / Discrete Applied Mathematics xxx (xxxx) xxx
Fig. 2. The figure shows strings in XS and YS of Example 4.
Fig. 3. The left figure is the derivation tree of SLP S of Example 1, which derives the string T . Variables are represented by positive integers. The red rectangles on T represent all occurrences of P = BCAB in T . The right grid represents the relation between XS , YS and RS of Example 4. The red P [..i] P [..i] P [i+1..] P [i+1..] P [..i] P [..i] P [i+1..] P [i+1..] rectangle on the grid is a query rectangle (xpred , xsucc , ypred , ysucc ), where i = 1, xpred = x2 , xsucc = x10 , ypred = y5 and ysucc = y11 . Therefore, P [..i]
P [..i]
P [i+1..]
report RS (xpred , xsucc , ypred
[i+1..] , yPsucc ) = {X6 , X11 }.
3. Signature encoding and locally consistent parsing In this section, we explain locally consistent parsing and signature encodings. Locally consistent parsing [28] is a factorization of a run-free string s, i.e., |RLE(s)| = |s|. This factorization has a property such that inner factors of common substrings are same because the form of each factor is determined only by a short substring close to the factor. The factorization is based on the technique called the alphabet reduction. We consider integer array s′ [1..|s|] obtained as follows: For any position i, we look at the bit representations of s[i − 1] and s[i] (let s[0] be a special character not occurring in s), and let s′ [i] = 2l + a, where l is the length of the longest match in the least significant bits of s[i − 1] and s[i] and a is the first mismatched bit in s[i]. By definition, s′ is still run-free and each character is represented by an integer in {0, . . . , 2⌊log |Σ |⌋ + 1}, i.e., the alphabet size is logarithmically reduced. See also Fig. 4. Applying this procedure recursively log∗ |Σ | + 2 times to s, we obtain integer array s6 [1..|s|] over the alphabet {0, 1, 2, 3, 4, 5} of size 6 while keeping the run-free property (Lemma 2 of [28]). Note that s6 [i] depends only on s[i − log∗ |Σ | − 2..i]. We can further reduce the alphabet into {0, 1, 2} by, for each c ∈ {3, 4, 5}, replacing each occurrence of c with the smallest integer in {0, 1, 2} that is not in the neighbors. Let s3 denote the resulting array. Since the replacement of each c ∈ {3, 4, 5} introduces additional dependencies to its neighbors, s3 [i] depends on s[i − log∗ |Σ | − 5..i + 3]. Now let b[1..|s|] be a bit sequence such that for 3 < i < |s| − 2, b[i] = 1 iff s3 [i] is local maximum (i.e., s3 [i − 1] < s3 [i] > s3 [i + 1]). By the definition of s3 [i], b[4..|s| − 3] satisfies the condition that there is no run of 1 and no run of 0 longer than 3. The first and last three bits of b[1..|s|] are defined rather arbitrarily so that b[1..|s|] satisfies the same condition; let b[1..2] = 10, b[|s| − 1..|s|] = 00, b[3] = 1 iff b[4] = 0, and b[|s| − 2] = 1 iff b[|s| − 3] = 0. Remark that the value b[i] depends on s[i − ∆L ..i + ∆R ], where ∆L = log∗ |Σ | + 6 and ∆R = 4. Namely, b[i] = b[i′ ] holds for any i′ with s[i − ∆L ..i + ∆R ] = s[i′ − ∆L ..i′ + ∆R ]. Given a run-free string s of length at least two, we can compute the bit sequence b in linear time using a precomputed table of size o(log |Σ |), which can be computed in o(log |Σ |) time [2]. Using the bit sequence b defined above, the locally consistent parsing f1 , . . . , fd of s is defined as follows: fi = s[pi ..pi+1 − 1] for 1 ≤ i ≤ d, where pi is the position of the ith 1 in b, pd+1 = |s|+ 1, and d is the number of 1 in b. In other words, pi represents the starting position of fi for 1 ≤ i ≤ d. This factorization has the following properties by b: (1) The length of each factor is from 2 to 4 if |s| > 1. (2) For two occurrences i, i′ ∈ Occ(P , s) of a string P, let px , . . . , py (resp. px′ , . . . , py′ ) be starting positions of factors on s[i + ∆L ..i + |P | − 1 − ∆R ] (resp. s[i′ + ∆L ..i′ + |P | − 1 − ∆R ]), i.e., x (resp. x′ ) is the minimum integer such that i + ∆L ≤ px (resp. i′ + ∆L ≤ px′ ) and y (resp. y′ ) is the maximum integer such that py + ∆R ≤ i + |P | − 1 (resp. py′ + ∆R ≤ i′ + |P | − 1). Then fx , . . . , fy−1 = fx′ , . . . , fy′ −1 holds if y − x ≥ 1. Intuitively, ∆L + 1 and |P | − ∆R represent the Please cite this article as: T. Nishimoto, T. I, S. Inenaga et al., Dynamic index and LZ factorization in compressed space, Discrete Applied Mathematics (2019), https://doi.org/10.1016/j.dam.2019.01.014.
T. Nishimoto, T. I, S. Inenaga et al. / Discrete Applied Mathematics xxx (xxxx) xxx
Fig. 4. An example of the alphabet reduction. Let s 1, 31, 0, 9, 2, 1, 24, 6.
7
= 32767, 65535, 234, 986, 0, 12345, 57401,1 and |Σ | = 65536. Then we obtain s′ =
Fig. 5. The derivation trees of a signature encoding G = (Σ , V , D, S) (left) and a signature encoding represented by an RSLP (right), where Σ = {A, B}, ∗ V = {1..24}, D = {1 → A, . . . , 24 → 23} and S = 24. Signatures are given as positive integers. Let log 4M = 4, and hence ∆L = 10 and ∆R = 4. Assuming LC (PowT0 ) = (3, 4)3 , (3, 5, 3), (4, 3, 4), (3, 4, 6, 4), (3, 4), (3, 4, 3), LC (PowT1 ) = (11, 12), (13, 14), (15, 16) and LC (PowT2 ) = (20, 21, 22) in G .
starting and ending positions of inner common factors on P occurring in s. In other words, P [∆L + 1..|P | − ∆R ] contains inner common factors. See also Example 5. Note that the second property holds between two occurrences of P in s and another run-free string s′ if s and s′ are on Σ . Example 5 (Locally Consistent Parsing). Let LC (s) be the locally consistent parsing of a run-free string s.
An example of locally consistent parsing of s, where let log∗ |Σ | = 2, and hence ∆L = 8 and ∆R = 4. Since the string in round brackets occurs twice in s, the factors on two underlined substrings are same by locally consistent parsing. 3.1. Signature encoding A signature encoding [28] of a string T is a context-free grammar G = (Σ , V , D, S) representing T by recursively applying locally consistent parsing and run-length encoding to T until a start symbol S is obtained. The derivation tree of S is balanced and consists of node sequences of same height from bottom to top. Formally, let ShrinkT0 , PowT0 . . . , ShrinkTh , PowTh = S be the sequences from bottom to top in the derivation tree. Then ShrinkT0 , . . . , PowTh are defined as follows : ShrinkTt PowTt
Assgn+ (T [1], . . . , T [|T |])
for t = 0,
Assgn+ (LC (PowTt−1 ))
for 0 < t ≤ h,
{ =
+
= Assgn
(RLE(ShrinkTt ))
for 0 ≤ t < h,
ˆ = {0, . . . , 4M }, and Assgn+ is the where h is the minimum integer satisfying |PowTh | = 1, LC is locally consistent parsing on Σ ∗ injective function which maps V or Σ to V with respect to D, that is, each input factor f ∈ V ∗ is mapped to a variable e such that e → f ∈ D. We call each variable of the signature encoding a signature and say that a node is in level t in the derivation tree of S if the node is produced by ShrinkTt or PowTt . We can also define the RSLP version of signature encodings when each factor of locally consistent parsing is represented by a binary tree instead of a single node. Fig. 5 shows the derivation trees of signature encodings represented by context-free grammars and RSLPs. Please cite this article as: T. Nishimoto, T. I, S. Inenaga et al., Dynamic index and LZ factorization in compressed space, Discrete Applied Mathematics (2019), https://doi.org/10.1016/j.dam.2019.01.014.
8
T. Nishimoto, T. I, S. Inenaga et al. / Discrete Applied Mathematics xxx (xxxx) xxx
Fig. 6. An abstract image of the common sequence of P in the derivation tree of the signature encoding of T . Black rectangles represent Uniq(P), which occurs on every occurrence of P in T and Uniq(P) represents P, i.e., val+ (Uniq(P)) = P.
We can apply locally consistent parsing to PowTt for any 1 ≤ t ≤ h because PowTt is a run-free string in V ∗ . Therefore signature encodings have the following properties : (1) The number of nodes in the derivation tree of S is less than 4N ≤ 4M. (2) ∆L = log∗ 4M + 6 and ∆R = 4 in LC . (3) h = O(log N) holds because |ShrinkTt | ≤ |PowTt−1 |/2 for any 0 < t ≤ h. (4) Let w = |V |, then w = O(min(z log N log∗ M , N)) holds [35]. (5) Since any signature represents a variable sequence of length from 2 to 4, a run of a signature or a character, every right-hand in production rules can be represented in O(1) space. Therefore we can represent G in O(w ) space. This means that G requires only compressed space. Next, we explain the most important property of the signature encoding, which ensures the existence of common short signature sequence to all occurrences of same substrings because the run-length encoding and locally consistent parsing replace same substrings with almost same factors, recursively. Formally, we define such a sequence by XShr Pt and XPowPt , which are defined as follows: Definition 1. For a substring P of T , let XShr Pt =
Assgn+ (P [1], . . . , P [|P |])
for t = 0,
Assgn+ (subseq(LC (XPowPt−1 ), ∆L , ∆R ))
for 0 < t ≤ h′ ,
{
XPowPt = Assgn+ (subseq(RLE(XShr Pt ), 1, 1)) for 0 ≤ t < h′ , where h′ is the minimum integer such that |RLE(XShr Pt )| ≤ δ = 4(∆L + ∆R ) + 2 and subseq(f , i, j) = fi+1 , . . . , fk−j for a factorization f = f1 , . . . , fk and two integers i, j such that i + j ≤ k. Note that XShr Pt+1 ̸ = ε when |RLE(XShr Pt )| > δ because |XPowPt | > 4(∆L + ∆R ), and every length of factors created by locally consistent parsing is at most 4. XShr P0 , XPowP0 , . . . , XPowPh′ −1 , XShr Ph′ represent the forest representing P because XPowPt and XShr Pt occur on XShr Pt and XPowPt−1 respectively. Let Uniq(P) be the variable sequence represented by roots of the forest. Then |RLE(Uniq(P))| = O(log |P | log∗ M) holds since h′ = O(log |P |) and |RLE(XShr Ph′ )| ≤ δ . This forest always occurs in the derivation tree of T and represents the substring T [i..i + |P | − 1] = P for any i ∈ Occ(P , T ). This is because XPowPt always occurs on XShr Pt by locally consistent parsing and XShr Pt always occurs on XPowPt−1 by run-length encoding. Therefore Uniq(P) occurs on any occurrence of P in T and we call Uniq(P) the common sequence of P.3 Figs. 6 and 7 show an abstract image and an example of Uniq(P). In summary, the following lemma holds. Lemma 3 (Common Sequences [35]). For a substring P of T , Uniq(P) has following properties: (1) Uniq(P) represents T [i..i + |P | − 1] = P in the derivation tree of S for any i ∈ Occ(P , T ), (2) |RLE(Uniq(P))| = O(log |P | log∗ M), (3) Uniq(P)[1] → P [1] ∈ D, i.e., the first signature of Uniq(P) represents P [1], and (4) for any occurrence Uniq(P) in the derivation tree of T , the number of ancestors of nodes corresponding to the occurrence is O(log N + log |P | log∗ M) [30]. We can compute Uniq(val(e)[j..j + y − 1]) for a given e and two integers j, y in O(log |val(e)| + log y log∗ M) time using the DAG of G [30]. 3.2. Dynamic signature encoding We consider a dynamic signature encoding G of T supporting INSERT (Y , i), INSERT ′ (j, y, i) and DELETE(j, y). During updates, we recompute ShrinkTt and PowTt for some part of new T , and also, we add new signatures to G and remove useless signatures from G appropriately. If we naively recompute these, it needs Ω (|T |) time. However we can update efficiently G using the following lemma. 3 The common sequences are conceptually equivalent to the cores [27] which are defined for the edit sensitive parsing of a text, a kind of locally consistent parsing of the text. Please cite this article as: T. Nishimoto, T. I, S. Inenaga et al., Dynamic index and LZ factorization in compressed space, Discrete Applied Mathematics (2019), https://doi.org/10.1016/j.dam.2019.01.014.
T. Nishimoto, T. I, S. Inenaga et al. / Discrete Applied Mathematics xxx (xxxx) xxx
9
Fig. 7. An example of the common sequence of P. Let ∆L = 3 and ∆R = 2 although ∆L is at least 7 and ∆R = 4 by the definition of locally consistent parsing. We assume that subseq(LC (XPowP0 ), ∆L , ∆R ) = (8, 9), (7, 8), (7, 8, 7), (8, 9, 5, 7). Yellow rectangles on P represent XShr P0 , XPowP0 , and XShr P1 , i.e., XShr P1 = 11, 12, 13, 14, XPowP0 = 8, 7, 8, . . . , 7, 9, 8, and XShr P0 = 1, 2, 1, . . . , 3, 2, 3. h′ = 1 since |RLE(XShr P0 )| = 23 > δ = 22 and |RLE(XShr P1 )| = 4 ≤ δ . Therefore Uniq(P) = 1, 8, 7, 8, 7, 8, 9, 11, 12, 13, 14, 9, 7, 9, 8, 3 and it occurs on both occurrences of P in T by Lemma 3.
Lemma 4 (Dynamic Signature Encoding [30]). Given G of size w , we can enhance G in O(w fA ) time, by adding data structures of O(w ) space, to support INSERT (Y , i) and DELETE(j, y) in O(fA (y + log N log∗ M)) time and INSERT ′ (j, y, i) in amortized O(fA log N log∗ M) time. And also, G supports computing the signature encoding of a given string P in O(|P |fA ) time and O(|P |) space using D, which means that Lemma 3 holds in the derivation tree of P. We can upper bound the number of signatures added to or removed from G after a single update operation using the following lemma.4 Lemma 5. After INSERT (Y , i) or DELETE(j, y) operation, O(y + log N log∗ M) signatures are added to or removed from G , where |Y | = y. After INSERT ′ (j, y, i) operation, O(log N log∗ M) signatures are added to or removed from G . Proof. Consider INSERT ′ (j, y, i) operation. Let T ′ = T [..i − 1]T [j..j + y − 1]T [i..] be the new text. This means that the signature encoding of T ′ is created over X = Uniq(T [..i − 1])Uniq(T [j..j + y − 1])Uniq(T [i..]) by Lemma 3. Since the number of the ancestor nodes of X is O(log N log∗ M) by Property 4 in Lemma 3, O(log N log∗ M) new signatures can be added to G . Also, O(log N log∗ M) signatures, which were created over Uniq(T [..i − 1])Uniq(T [i..]), may be removed. For INSERT (Y , i) operation, we additionally think about the possibility that O(y) signatures are added to create Uniq(Y ). Similarly, for DELETE(j, y) operation, O(y) signatures, which are used in and under Uniq(T [j..j + y − 1]), can be removed. □ We also maintain the DAG of G using O(w ) space to remove useless signatures from G . We remark that there is a small difference in our DAG representation from the one in [30]; each node e has a doubly-linked list representing the parents of the node. Therefore we can check if a signature is useless or not by checking if the list is empty or not in constant time. Since we also let every parent e′ of e have the pointer to the corresponding element in the list, the lists can be maintained in constant time after adding/removing an assignment. The DAG can also support the following queries: (1) We can compute vOcc(e, S) in O(|vOcc(e, S)| log N) time for e ∈ V since h = O(log N). (2) We can compute the LCE of two represented strings (or reversed strings) s1 and s2 for two given signatures in O(log |s1 | + log |s2 | + log ℓ log∗ M) time, where ℓ is the answer to the LCE query [31]. 4. Dynamic compressed index In this section, we show Theorem 1 by proposing our dynamic compressed index based on signature encodings. In Section 4.1, we describe the idea of our searching algorithm, introducing needed data structures. In Section 4.2, we show how to dynamize the data structures. Note that we focus on the case |P | > 1 because we can compute easily Occ(P , T ) when |P | = 1. 4.1. The idea of our searching algorithm As already mentioned in the introduction, our strategy for pattern matching is different from that of Alstrup et al. [2]. It is rather similar to the one taken in the static index for SLPs of Claude and Navarro [9]. We show that the idea can be extended to RSLPs, and hence, to signature encodings (recall that signature encodings can be represented in RSLPs). Besides, we show how to speed up pattern matching by utilizing the properties of signature encodings. We roughly explain our searching algorithm. We can compute occurrences of a given pattern P using Observation 1 on an SLP. Since Observation 1 means that any occurrence of P is stabbed by a variable at a position 1 ≤ i < |P |, we can compute 4 The property is used in [30], but there is no corresponding lemma to state it clearly. Please cite this article as: T. Nishimoto, T. I, S. Inenaga et al., Dynamic index and LZ factorization in compressed space, Discrete Applied Mathematics (2019), https://doi.org/10.1016/j.dam.2019.01.014.
10
T. Nishimoto, T. I, S. Inenaga et al. / Discrete Applied Mathematics xxx (xxxx) xxx
Fig. 8. Examples of variables stabbing P in an SLP and a signature encoding. The variable in the SLP can stab P at |P | − 1 positions but the signature can stab P at O(log |P | log∗ M) positions by the common sequence of P.
all occurrences of P by checking |P | − 1 positions using range reporting queries according to Observation 2. Since a signature encoding is a context-free grammar, we can apply this idea to the pattern search on the signature encoding of T . Besides, we can reduce the number of range queries to O(log |P | log∗ M) using properties of common sequences. The common sequence of P is represented by a signature sequence whose run-length encoded size is O(log |P | log∗ M), and it occurs on any occurrence of P in the derivation tree of T . This means that there are O(log |P | log∗ M) positions such that a signature stabs P (see also Fig. 8). Hence we can speed up pattern matching by utilizing the properties of signature encodings and dynamize our index using dynamic signature encodings and dynamic range reporting data structures. Index for SLPs. For a given SLP S , we can create a static index of the SLP using Observations 1 and 2. Namely, we can compute Occ(P , T ) by computing pOcc S using range reporting queries and all occurrences of variables in pOcc S (P) using the DAG of S , where pOcc S (P) = {(X , i) | X ∈ pOcc(P , i), 1 ≤ i < |P |}. Let pred/succ(P) query denote a query which asks for xPpred , xPsucc , yPpred and yPsucc in S for a given string P. Let qpred/succ , qreport and qv Occ denote respectively query times for computing pred/succ(P) queries, for reporting queries in RS , and for computing all occurrences of variables in pOcc S (P) for a given string P. Then the following lemma holds by Observations 1 and 2. Lemma 6 (E.g. [9]). For an SLP S representing a string T , given a pattern P of length at least 2, we can compute Occ(P , T ) in O(|P |(qpred/succ + qreport ) + qv Occ ) time. Index for RSLPs. Next, we extend the above idea for SLPs to RSLPs. The difference from SLPs is that we have to deal with occurrences of P that is ‘stabbed’ by a node labeled with X → Xˆ k ∈ D in the derivation tree of S . This ‘stab’ means that (val(Xˆ )c , val(Xˆ )k−c ) stabs P at some position for some integer 1 ≤ c < k. We can find such occurrences of P by range reporting queries, i.e., by creating 2D points representing (val(Xˆ )R , val(Xˆ )1 ), . . . , (val(Xˆ )R , val(Xˆ )k−1 ) for X → Xˆ k ∈ D and adding the points to the data structure for pred/succ(P) queries. Therefore we can also construct an index for RSLPs as that for SLPs. However, the size of this index can go beyond compressed space because the number of the 2D points can be as large as N. For example, when S → X N and X → a, the number of the 2D points is clearly N − 1. Since our goal is creating a dynamic index in compressed space, we reduce the number of 2D points using the following observation. Observation 3. For a production X → Xˆ k in D, if (val(Xˆ ), val(Xˆ )k−c ) stabs a string P at position i and |P [i + 1..]| ≤ |val(Xˆ )k−c −1 | holds for some integer 1 ≤ c < k, then (val(Xˆ ), val(Xˆ )k−c −1 ) also stabs P at position i. Roughly speaking, Observation 3 implies that X ‘stabs’ periodically P at position i. Since we can compute such periodic occurrences of P about i in constant time using |val(Xˆ )| and k if we know that (val(Xˆ ), val(Xˆ )k−1 ) stabs P at position i, it is sufficient that we create one 2D point for X → Xˆ k ∈ D. As a conclusion, we modify the definition of ‘stab’ for X → Xˆ k as follows: we say that X stabs P at position i if (val(Xˆ ), val(Xˆ )k−1 ) stabs P at position i, and we modify the definition of pOcc S (P , i) and Observation 1 appropriately. Then Lemma 6 is still valid for RSLPs. See Example 3 and Fig. 9. (val(X5 ), val(X5 )3 ), (val(X5 )2 , val(X5 )2 ) and (val(X5 )3 , val(X5 )1 ) stab P = BCA at position 1 in X6 → X54 . On the other hand, X6 located at P [..i] P [..i] P [i+1..] P [i+1..] (x5 , y35 ) = (val(X5 ), val(X5 )3 ) is captured by a query rectangle (xpred , xsucc , ypred , ysucc ), where i = 1. Index for signature encodings. For simplicity, we explain our index using the signature encoding G represented by an RSLP. Therefore, we can use the RSLPs version of Lemma 6 for G . Note that the RSLP still has properties mentioned in Section 3. Moreover, we can reduce the number of range reporting queries in Lemma 6 because P is represented by the common sequence u = Uniq(P) of run-length O(log |P | log∗ M). The following lemma is stronger than Observation 1, and this is a new property of signature encodings. Lemma 7. pOcc G (P , i) = ∅ holds for i ∈ [1..|P | − 1] \ P , where P is defined as follows :
{ P =
{|val+ (u[1..i])| | 1 ≤ i < |u|, u[i] ̸= u[i + 1]} {1}
if |RLE(u)| > 1, otherwise.
Please cite this article as: T. Nishimoto, T. I, S. Inenaga et al., Dynamic index and LZ factorization in compressed space, Discrete Applied Mathematics (2019), https://doi.org/10.1016/j.dam.2019.01.014.
T. Nishimoto, T. I, S. Inenaga et al. / Discrete Applied Mathematics xxx (xxxx) xxx
11
Fig. 9. The grid represents the relation between XS , YS and RS of the RSLP S in Example 3 as the grid in Fig. 3. xki and yki represent ((val(Xi ))k )R and P [..i]
P [..i]
P [i+1..]
(val(Xi ))k respectively. The red rectangle on the grid is a query rectangle (xpred , xsucc , ypred P [i+1..]
ypred
[i+1..] [..i] P [..i] P [i+1..] P [i+1..] = y4 and yPsucc = y10 . Therefore, report RS (xPpred , xsucc , ypred , ysucc ) = {X6 }.
[i+1..] P [..i] P [..i] , yPsucc ), where i = 1, P = BCA, xpred = x2 , xsucc = x10 ,
P represents starting positions of factors of RLE(u) on P when |RLE(u)| > 1. Note that |P | = O(|RLE(Uniq(P))|) = O(log |P | log∗ M). To show Lemma 7, we use the following observation and lemma.
Observation 4. If a signature e stabs P at position j, then j = |val+ (u[..i])| for some integer i (1 ≤ i < |u|). Proof. Recall that the substring P stabbed by a signature e at position j is represented by u for any (e, j) ∈ pOcc G (P) by Lemma 3 because the derivation tree of e is a subtree of the derivation tree of G . In other words, u is derived by e. Hence Observation 4 holds by the definition of the ‘stab’. □ Lemma 8. If a signature e stabs P at position j = |val+ (u[..i])| with u[i] = u[i + 1] for some integer i (1 ≤ i < |u|), then e → eˆ k ∈ D, eˆ = u[i], i = 1. In other words, P = a|P | , j = 1, and e represent ak for some character a ∈ Σ . Proof. Recall that u is derived by e. u[i] and u[i + 1] are in ShrinkTt for some level t because u[i] = u[i + 1] and PowTt is a run-free string. Hence u[i] and u[i + 1] are encoded into the same signature by RLE in the derivation tree of e, and that the parent of two nodes corresponding to u[i] and u[i + 1] has a signature e′ in the form e′ → u[i]k ∈ D. Since e stabs P at position j = |val+ (u[..i])|, e = e′ holds. Hence i = 1 holds by the definition of the ‘stab’ for e → eˆ k . Note that u[1] → P [1] by Lemma 3. □ We show Lemma 7. Proof of Lemma 7. Note that pOcc G (P , i) = ∅ clearly holds for i ∈ [1..|P | − 1] \ P ′ by Observation 4, where P ′ = {|val+ (u[1])|, . . . , |val+ (u[..|u| − 1])|}. And also pOcc G (P , i) = ∅ for any i ∈ P ′ with u[i] = u[i + 1] and i > 1 by Lemma 8. If |RLE(u)| = 1, then P = a|P | for some character a ∈ Σ by Lemma 3. Hence pOcc G (P , i) = ∅ directly holds for 1 < i < |P |. We focus on the case |RLE(u)| > 1. We assume that pOcc G (P , 1) ̸ = ∅ holds with u[1] = u[2]. This means that |RLE(u)| = 1 by Lemma 8, which contradicts |RLE(u)| > 1. Therefore the statement holds. □ Lemma 7 means that all we have to do is to compute pOcc(P , i) for i ∈ P . Note that for computing P , we need to compute Uniq(P) in O(|P |fA + log |P | log∗ M) time using Lemma 4. In addition, we can compute vOcc(e, S) using the DAG of G and get qvOcc = O(occ log N). Hence the following lemma holds by Lemmas 6 and 7. Lemma 9. Let G be a dynamic signature encoding of T using Lemma 4. Given a pattern P, we can compute Occ(P , T ) in O(|P |fA + log |P | log∗ M + |P |(qreport + qpred/succ ) + occ log N) time. 4.2. Dynamic compressed index for signature encodings In this section, we propose our dynamic compressed index based on Lemma 9. Since the signature encoding G can be dynamized using Lemma 4, our remaining tasks are considering dynamic data structures that support pred/succ(P) queries, and reporting queries on 2D points RG . We can construct a dynamic data structure supporting pred/succ(P) queries using the following lemma. Please cite this article as: T. Nishimoto, T. I, S. Inenaga et al., Dynamic index and LZ factorization in compressed space, Discrete Applied Mathematics (2019), https://doi.org/10.1016/j.dam.2019.01.014.
12
T. Nishimoto, T. I, S. Inenaga et al. / Discrete Applied Mathematics xxx (xxxx) xxx
Lemma 10. There exists a data structure of size O(w ) which can compute pred/succ(P) for a pattern P given as a substring of a signature e ∈ V in O(log w (log N + log |e| log∗ M)) time. For a given signature e added to/removed from G , xepred , and yepred , this data structure can be updated in O(log w ) time, where xepred = max{x′ ∈ XG | x′ ≤ val(e)R } and yepred = max{y′ ∈ YG | y′ ≤ val(e)}. Proof. Consider two self-balancing search trees for XG and YG . We can efficiently compute pred/succ(P) using LCE queries on G and binary search on these trees. And also, for a given signature e added to/removed from G , we can update XG and YG in O(log w ) time using xepred and yepred . Hence Lemma 10 holds. □ Next, we can immediately get the following lemma by applying Lemma 1 to RG . Lemma 11. There exists a data structure of size O(w ) which can compute report RG (x1 , x2 , y1 , y2 ) in O(log w+occ(log w/log log w )) eℓ er time. For a given signature e → eℓ er added to/removed from G , xpred , and ypred , this data structure can be updated in amortized O(log w ) time. Note that we can update this data structure in amortized O(log w ) time for a given e → eˆ k added to/removed from G . Finally, considering a dynamic index of size O(w ) which consists of a dynamic signature encoding G and two data structures of Lemmas 10 and 11, we get Theorem 1. Proof of Theorem 1. By Lemma 10, we get qpred/succ = O(log w (log N + log |P | log∗ M)) for P [..i] and P [i + 1..], where 1 ≤ i < |P |. By Lemma 11, we get qreport = O(log w + occ(log w/log log w )). Hence, by Lemma 9, our dynamic index supports FIND queries in O(|P |fA + log w log |P | log∗ M(log N + log |P | log∗ M) + occ log N) time. Next, we consider the update of our dynamic index. For a given signature e → eℓ er added to/removed from G , we can eℓ er compute xepred , yepred , xpred and ypred in O(log w log N log∗ M) time using Lemma 10. It is the same for a given e → eˆ k . Since we can update G by Lemma 4, the remaining task is to update the data structures of Lemmas 10 and 11 for G . Since the number of signatures added to or removed from G during a single update operation is upper bounded by Lemma 5, we can get the desired update time from Lemmas 4, 10 and 11. □ 5. LZ77 factorization in compressed space In this section, we show Theorem 2 using signature encodings represented by RSLPs. Let LZ77(T ) = f1 , . . . , fz and each fi can be represented by the pair (xi , |fi |), where xi is an occurrence position of fi in f1 · · · fi−1 . By modifying our dynamic index, we can construct an index of T that can answer the leftmost occurrence of a given pattern efficiently, and then, we compute (xi , |fi |) incrementally using the index. Formally, let Fst(j, k) = i be the leftmost occurrence of T [j..j + k − 1], i.e., the minimum integer i such that i < j and T [i..i + k − 1] = T [j..j + k − 1] if it exists. Our algorithm is based on the following fact: Fact 1. Given f1 , . . . , fi−1 (2 ≤ i ≤ z), we can compute fi with O(log |fi |) calls of Fst(j, k) (by doubling the value of k = 1, followed by a binary search), where j = |f1 · · · fi−1 | + 1. We compute Fst(j, k) using the following lemma. Lemma 12. Given a signature encoding G = (Σ , V , D, S) of T , we can construct a data structure of O(w ) space in O(w log w log N log∗ M) time to support queries Fst(j, k) in O(log w log k log∗ M(log N + log k log∗ M)) time. Proof. Let P = T [j..j + k − 1]. Consider a signature e and integer i such that e stabs P at position i. Although there exist many nodes representing e in the derivation tree of G , only the leftmost one can contain the leftmost occurrence of P in T . We compute Fst(j, k) using this idea. Formally, let e. min = min vOcc(e, S) + |eℓ | for a signature e → eℓ er ∈ D. We define e. min for e → eˆ k ∈ D in a similar way. We also define FstOcc(P , i) for a string P and an integer i as follows: FstOcc(P , i) = min{e. min | e ∈ pOcc G (P , i)} Then Fst(j, k) can be represented by FstOcc(P , i) as follows: Fst(j, k) = min{FstOcc(T [j..j + k − 1], i) − i | i ∈ {1, . . . , k − 1}}
= min{FstOcc(T [j..j + k − 1], i) − i | i ∈ P }, where P is the set of integers in Lemma 7 with P = T [j..j + k − 1]. Hence we can compute Fst(j, k) efficiently by computing P [..i] P [..i] P [i+1..] P [i+1..] FstOcc(P , i). Fortunately, we can compute FstOcc(P , i) by an RmQ (xprec , xsucc , yprec , ysucc ) query of Lemma 2 because we can regard a signature e as a weighted 2D point of weight e. min (recall that we regarded e as a 2D point in Section 4.1). P [..i] P [..i] P [i+1..] P [i+1..] Note that we already described how to compute xprec , xsucc , yprec and ysucc , and also, we can compute Uniq(P) in ∗ O(log N +log k log M) time using the DAG for G . Hence we can compute Fst(j, k) in O(log k log∗ M(log w (log N +log k log∗ M)+ log2 w )) = O(log w log k log∗ M(log N + log k log∗ M)) time. For construction, we first compute e. min in O(w ) time for all e ∈ V using the DAG of G . Next, given G , XG (and YG ) can be sorted in O(w log w log N log∗ M) time by LCE algorithm on G and a standard comparison-based sorting. Finally, we build the data structure of Lemma 2 in O(w log w ) time. □ Please cite this article as: T. Nishimoto, T. I, S. Inenaga et al., Dynamic index and LZ factorization in compressed space, Discrete Applied Mathematics (2019), https://doi.org/10.1016/j.dam.2019.01.014.
T. Nishimoto, T. I, S. Inenaga et al. / Discrete Applied Mathematics xxx (xxxx) xxx
13
Theorem 2 immediately follows Fact 1 and Lemma 12. We remark that we can similarly compute the LZ77 factorization with self-reference of a text (defined below) in the same time and same working space. Definition 2 (Lempel–Ziv77 Factorization with Self-reference). LZ77 factorization of a string T with self-references is a factorization f1 , . . . , fz of s such that f1 = T [1], and for 1 < i ≤ z, if the character T [|f1 ..fi−1 | + 1] does not occur in T [|f1 ..fi−1 |], then fi = T [|f1 ..fi−1 | + 1], otherwise fi is the longest prefix of fi · · · fk which occurs at some position p, where 1 ≤ p ≤ |f1 · · · fi−1 |. Acknowledgments We would like to thank Paweł Gawrychowski for drawing our attention to the work by Alstrup et al. [2,3] and for fruitful discussions. Tomohiro I was supported by JSPS KAKENHI Grant Number 16K16009. References [1] P.K. Agarwal, L. Arge, S. Govindarajan, J. Yang, K. Yi, Efficient external memory structures for range-aggregate queries, Comput. Geom. 46 (3) (2013) 358–370. [2] S. Alstrup, G.S. Brodal, T. Rauhe, Dynamic pattern matching, Tech. rep., Department of Computer Science, University of Copenhagen, 1998. [3] S. Alstrup, G.S. Brodal, T. Rauhe, Pattern matching in dynamic texts, in: Proceedings of SODA, 2000, pp. 819–828. [4] P. Beame, F.E. Fich, Optimal bounds for the predecessor problem and related problems, J. Comput. System Sci. 65 (1) (2002) 38–72. [5] D. Belazzougui, F. Cunial, T. Gagie, N. Prezza, M. Raffinot, Composite repetition-aware data structures, in: Proceedings of Combinatorial Pattern Matching, 2015, pp. 26–39. [6] D. Belazzougui, S.J. Puglisi, Range predecessor and Lempel-Ziv parsing, in: Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, 2016, pp. 2053–2071. [7] G.E. Blelloch, Space-efficient dynamic orthogonal point location, segment intersection, and range reporting, in: Proceedings of SODA, 2008, pp. 894– 903. [8] A.R. Christiansen, M.B. Ettienne, Compressed indexing with signature grammars, in: Proceedings of 13th Latin American Symposium, 2018, pp. 331– 345. [9] F. Claude, G. Navarro, Self-indexed grammar-based compression, Fund. Inform. 111 (3) (2011) 313–337. [10] F. Claude, G. Navarro, Improved grammar-based compressed indexes, in: Proceedings of String Processing and Information Retrieval, 2012, pp. 180– 192. [11] P.F. Dietz, D.D. Sleator, Two algorithms for maintaining order in a list, in: Proceedings of the 19th Annual ACM Symposium on Theory of Computing, 1987, pp. 365–372. [12] A. Ehrenfeucht, R.M. McConnell, N. Osheim, S.W. Woo, Position heaps: A simple and dynamic text indexing data structure, J. Discrete Algorithms 9 (1) (2011) 100–121. [13] J. Fischer, T. Gagie, P. Gawrychowski, T. Kociumaka, Approximating LZ77 via small-space multiple-pattern matching, in: ESA 2015, 2015, pp. 533–544. [14] J. Fischer, T. I, D. Köppl, K. Sadakane, Lempel-Ziv factorization powered by space efficient suffix trees, Algorithmica 80 (7) (2018) 2048–2081. [15] T. Gagie, P. Gawrychowski, J. Kärkkäinen, Y. Nekrich, S.J. Puglisi, A faster grammar-based self-index, in: Proceedings of LATA, 2012, pp. 240–251. [16] T. Gagie, P. Gawrychowski, J. Kärkkäinen, Y. Nekrich, S.J. Puglisi, LZ77-based self-indexing with faster pattern matching, in: Proceedings of LATIN 2014, 2014, pp. 731–742. [17] T. Gagie, P. Gawrychowski, S.J. Puglisi, Approximate pattern matching in LZ77-compressed texts, J. Discrete Algorithms 32 (2015) 64–68. [18] P. Gawrychowski, A. Karczmarz, T. Kociumaka, J. Lacki, P. Sankowski, Optimal dynamic strings, CoRR abs/1511.02612. URL http://arxiv.org/abs/1511. 02612, 2015. [19] P. Gawrychowski, A. Karczmarz, T. Kociumaka, J. Lacki, P. Sankowski, Optimal dynamic strings, in: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, 2018, pp. 1509–1528. [20] K. Goto, S. Maruyama, S. Inenaga, H. Bannai, H. Sakamoto, M. Takeda, Restructuring compressed texts without explicit decompression, CoRR abs/1107.2729, 2011. [21] W.K. Hon, T.W. Lam, K. Sadakane, W. Sung, S.M. Yiu, Compressed index for dynamic text, in: Proceedings of Data Compression Conference, 2004 pp. 102–111. [22] D. Kempa, D. Kosolobov, LZ-end parsing in compressed space, in: Proceedings of Data Compression Conference, 2017, pp. 350–359. [23] D. Kempa, N. Prezza, At the roots of dictionary compression: String attractors, in: Proceedings of ACM Symposium on the Theory of Computing (STOC), 2018, URL http://arxiv.org/abs/1710.10964. [24] D. Köppl, K. Sadakane, Lempel-Ziv computation in compressed space (LZ-CICS), in: Proceedings of Data Compression Conference, 2016, pp. 3–12. [25] S. Kreft, G. Navarro, LZ77-like compression with fast random access, in: Proceedings of Data Compression Conference, 2010, pp. 239–248. [26] A. Lempel, J. Ziv, On the complexity of finite sequences, IEEE Trans. Inf. Theory 22 (1) (1976) 75–81. [27] S. Maruyama, M. Nakahara, N. Kishiue, H. Sakamoto, ESP-index: A compressed index based on edit-sensitive parsing, J. Discrete Algorithms 18 (2013) 100–112. [28] K. Mehlhorn, R. Sundar, C. Uhrig, Maintaining dynamic sequences under equality tests in polylogarithmic time, Algorithmica 17 (2) (1997) 183–198. [29] J.I. Munro, Y. Nekrich, J.S. Vitter, Dynamic data structures for document collections and graphs, CoRR abs/1503.05977, 2015. [30] T. Nishimoto, T. I, S. Inenaga, H. Bannai, M. Takeda, Fully dynamic data structure for LCE queries in compressed space, CoRR abs/1605.01488, 2016. [31] T. Nishimoto, T. I, S. Inenaga, H. Bannai, M. Takeda, Fully dynamic data structure for LCE queries in compressed space, in: Proceedings of 41st International Symposium on Mathematical Foundations of Computer Science, 2016, pp. 72:1–72:15, URL http://dx.doi.org/10.4230/LIPIcs.MFCS.2016. 72. [32] A. Policriti, N. Prezza, Fast online Lempel-Ziv factorization in compressed space, in: Proceedings of String Processing and Information Retrieval, 2015, pp. 13–20. [33] A. Policriti, N. Prezza, LZ77 computation based on the run-length encoded BWT, Algorithmica 80 (7) (2018) 1986–2011. [34] N. Prezza, Compressed Computation for Text Indexing, Ph.D. thesis (Ph.D. thesis), University of Udine, 2016. [35] S.C. Sahinalp, U. Vishkin, Data compression using locally consistent parsing, Technical report, University of Maryland Department of Computer Science, 1995. [36] S.C. Sahinalp, U. Vishkin, Efficient approximate and dynamic matching of patterns using a labeling paradigm (extended abstract), in: Proceedings of FOCS, 1996, pp. 320–328.
Please cite this article as: T. Nishimoto, T. I, S. Inenaga et al., Dynamic index and LZ factorization in compressed space, Discrete Applied Mathematics (2019), https://doi.org/10.1016/j.dam.2019.01.014.
14
T. Nishimoto, T. I, S. Inenaga et al. / Discrete Applied Mathematics xxx (xxxx) xxx
[37] H. Sakamoto, S. Maruyama, T. Kida, S. Shimozono, A space-saving approximation algorithm for grammar-based compression, IEICE Trans. 92-D (2) (2009) 158–165. [38] M. Salson, T. Lecroq, M. Léonard, L. Mouchard, Dynamic extended suffix arrays, J. Discrete Algorithms 8 (2) (2010) 241–257. [39] Y. Takabatake, Y. Tabei, H. Sakamoto, Improved ESP-index: Apractical self-index for highly repetitive texts, in: Proceedings of Experimental Algorithms - 13th International Symposium, 2014, pp. 338–350. [40] Y. Takabatake, Y. Tabei, H. Sakamoto, Online self-indexed grammar compression, in: Proceedings of String Processing and Information Retrieval, 2015, pp. 258–269. [41] J. Ziv, A. Lempel, A universal algorithm for sequential data compression, IEEE Trans. Inform. Theory IT-23 (3) (1977) 337–349.
Please cite this article as: T. Nishimoto, T. I, S. Inenaga et al., Dynamic index and LZ factorization in compressed space, Discrete Applied Mathematics (2019), https://doi.org/10.1016/j.dam.2019.01.014.