Bidirectional data verification for cloud storage

Bidirectional data verification for cloud storage

Journal of Network and Computer Applications 45 (2014) 96–107 Contents lists available at ScienceDirect Journal of Network and Computer Applications...

553KB Sizes 0 Downloads 87 Views

Journal of Network and Computer Applications 45 (2014) 96–107

Contents lists available at ScienceDirect

Journal of Network and Computer Applications journal homepage: www.elsevier.com/locate/jnca

Bidirectional data verification for cloud storage Mohammad Iftekhar Husain a,n, Steven Y. Ko b, Steve Uurtamo b, Atri Rudra b, Ramalingam Sridhar b a b

Department of Computer Science, California State Polytechnic University, Pomona, CA, United States Department of Computer Science & Engineering, University at Buffalo, The State University of New York, NY, United States

art ic l e i nf o

a b s t r a c t

Article history: Received 13 September 2013 Received in revised form 25 February 2014 Accepted 6 July 2014 Available online 23 July 2014

This paper presents a storage enforcing remote verification scheme, PGV (Pretty Good Verification) as a bidirectional data integrity checking mechanism for cloud storage. At its core, PGV relies on the wellknown polynomial hash; we show that the polynomial hash provably possesses the storage enforcement property and is also efficient in terms of performance. In addition to the traditional application of a client verifying the storage content at a remote server, PGV can also be applied to de-duplication scenarios where the server wants to verify whether the client possesses a significant amount of information about a file (and not just a partial knowledge/fingerprint of the file) before granting access to an existing file. While existing schemes are often developed to handle a malicious adversarial model, we argue that such a model is often too strong of an assumption, resulting in over-engineered, resource-intensive mechanisms. Instead, the storage enforcement property of PGV aims at removing a practical incentive for a storage server to cheat in order to save on storage space in a covert adversarial model. We theoretically prove the power of PGV by combining Kolmogorov complexity and list decoding and experimentally show the simplicity and low overhead of PGV by comparing it with existing schemes. Altogether, PGV provides a good, practical way to perform storage enforcing remote verification. & 2014 Elsevier Ltd. All rights reserved.

Keywords: Storage enforcement Proof of retrievability Cloud storage Proof of data possession Proof of ownership

1. Introduction A general structure of the cloud storage data integrity verification problem is the following: a verifier (client) (i) uploads its data to a remote prover (cloud storage), (ii) deletes the local copy of the data (to save on storage), and (iii) at some later point, tries to verify if the prover is storing the data correctly i.e. assurance of integrity; and the data can be retrieved when necessary i.e. retrievability. This verification should take place without retrieving the complete data from the remote storage (to save on bandwidth). Existing hash functions such as MD5 and SHA1 can provide integrity but fails to provide retrievability. This motivated some interesting contributions to this domain such as proof of data possession (PDP) or proof of retrievability (POR) (Ateniese et al., 2007; Bowers et al., 2009a; Golle et al., 2002; Juels and BSK, 2007; Schwarz and Miller, 2006; Wang C et al., 2009; Wang Q et al., 2009). Existing PDP or POR schemes consider a malicious adversarial model (Canetti, 2006), which translates to assuming that legitimate cloud storage providers such as Amazon will behave

n Corresponding author at: Department of Computer Science, Cal Poly Pomona, Pomona, CA. Tel.: 909 869 2022. E-mail address: [email protected] (M.I. Husain).

http://dx.doi.org/10.1016/j.jnca.2014.07.003 1084-8045/& 2014 Elsevier Ltd. All rights reserved.

arbitrarily to tamper with client's data. In order to satisfy this strict adversarial model, existing schemes are complex (including modification of original data Bowers et al., 2009a; Wang C et al., 2009; Golle et al., 2002) and performance inefficient (both in time and space requirements Ateniese et al., 2007; Golle et al., 2002; Juels and BSK, 2007; Wang Q et al., 2009). The primary motivation for this paper is to show that if we relax this strict adversarial model and focus instead on a more practical adversarial model, we can develop a simpler, far more light-weight verification scheme. Therefore, instead of considering a malicious adversarial model, we consider a covert adversarial model (Aumann and Lindell, 2010). This model assumes that the adversary is willing to cheat if: (i) it has some incentive and (ii) it will not be caught. It nicely captures many real world scenarios including remote storage verification where there is a practical incentive for a provider to cheat in order to save on storage as long as it is not caught. This is more practical since storage is the main commodity for storage providers— a storage provider typically charges its clients primarily for the amount of storage that each client uses. Also, the provider incurs a significant amount of cost in handling and managing the storage for its clients such as the costs for hard drives, storage area networking, and power consumption (Moore et al., 2007; Allalouf et al., 2009). Thus, there is a practical incentive for a provider to cheat in order to save on storage.

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–107

To remove this incentive, a client has to be able to verify that the storage provider is actually committing as much storage space as the amount of data that the client requested to store. In other words, we need a verification scheme with the property of storage enforcement. With this verification, clients can safely assume that they are rightfully paying for the service, i.e., the amount of serverside storage they are getting. In addition, clients should be able to do this without asking the server to return the data it claims to store to avoid communication overhead. The main contribution of this paper is a storage enforcing remote verification scheme, PGV (Pretty Good Verification). This means roughly that, in order to pass our verification, a prover (cloud storage) has to commit as much storage space as the information content of the original data. This removes storage saving as an incentive of cheating for the prover. More specifically, if the prover passes our verification of the original data x with probability ε 4 0, then the prover has to store C(x) bits of data up to a very small additive factor. C(x) is the plain Kolmogorov complexity of x, which is the size of the smallest algorithmic description of x (Li and Vitanyi, 2008; Kolmogorov, 1965). The reason why we enforce C(x) instead of x is because we cannot prevent a prover from compressing x; C(x) is a natural way to represent the amount of information stored in x. PGV is built upon polynomial hash (Bierbrauer et al., 1993; Freivalds, 1977). We show that it has the storage enforcement property, i.e., it can be used to verify that a remote party actually stores as much data as it claims to store; in addition, we also show that it has a weaker form of proof of retrievability property, i.e., it can be used to verify that a remote party has enough information to recreate the data that it claims to store with good probability. Most importantly, the polynomial hash provides these properties with a significant performance benefit as it is just a simple hash function and without modifying the original data compared to the existing schemes. Simply put, PGV works as follows: the client (i) picks a random number (our key) β a 0 from a Galois field (Plank, 2003), (ii) divides the data block into equal-sized symbols, S0 ; …; Sk  1 , where the size is equal to the field size, and (iii) computes the polynomial i 1 hash H c ¼ ∑ki ¼ 0 Si β . The client then stores the β and Hc locally, sends the data to the cloud server and deletes the local copy of the data. For verification of a data block, the client sends the β to the server. In return, the server computes and sends back the hash value (Hs) of the data block using the polynomial hash function and β. If this Hs is equal to the Hc stored locally with the client, then the client declares that the verification is successful. For the clarity of presentation, we have illustrated our storage enforcement scheme with only one scenario where a client verifies a remote server (or multiple servers). However, our scheme is much more flexible and applicable to many scenarios; as we detail in Section 2, it is applicable in proof of ownership (Halevi et al., 2011) applications where a remote server verifies a client before granting access to its data; resulting in the bidirectional verification property of the PGV scheme. Our proofs combining Kolmogorov complexity and list decoding show that this simple construction can guarantee many of the desired properties for remote storage verification as follows (details in Section 3): Simplicity of construction: our scheme does not require expensive primitives or significant pre-processing. As a result, our hash (also known as tag, token, or authenticator) generation and verification incur very little overhead. Storage enforcement: if a cloud storage stores y that is different from the original data x and is able to pass our verification, the cloud storage has to commit as much storage space as the size of the original data, i.e., jyj  jxj.

97

Proof of retrievability: if a cloud storage stores y that is different from the original data x and is able to pass our verification, then it is possible to reconstruct x from y. Non-transformation of data: we do not transform the original data before we store it. More specifically, a cloud storage does not need to store any extra information and our verification ability has little effect on normal read/write operations. This property also helps in the extended application of PGV for proof of ownership problem. Resource optimality: a client requires a minimum amount of storage to store challenges. Also, a server does not need to store anything other than the original data. In addition, verification requires a minimum amount of communication since both the β and Hc are very small in size (usually a few hundred bits). Computationally unbounded adversary: we allow an adversary to have unbounded computational power, unlike cryptographic assumptions where the security is against randomized polynomial time adversaries. In contrast, we prove our results against much stronger adversaries that are only required to halt on all inputs (but do not have any pre-specified time bounds). Repeatability: PGV has practical constructions to support repeatability (unlimited verification) feature which is common to existing schemes with asymmetric cryptographic support (Section 3.6). Complete detection: our scheme provides guarantees over the entire data, unlike many of previous schemes (Juels and BSK, 2007; Wang C et al., 2009; Bowers et al., 2009a; Ateniese et al., 2007; Wang Q et al., 2009; Golle et al., 2002) that provide verification guarantees for some fraction of data (sampling). This fraction can be large, but nevertheless, not complete. However, for a fair comparison with existing schemes, PGV also provides a sampling construction with probabilistic guarantee (Section 3.5). We have implemented our widely applicable storage enforcing remote verification scheme, PGV and evaluated through extensive experiments with data files of sizes 1 MB to 1 GB. Our experimental results demonstrate that PGV is significantly performance efficient compared to existing proof of data possession (PDP Ateniese et al., 2007) and proof of ownership (PoW Halevi et al., 2011) schemes. The rest of the paper is organized as follows. Section 2 discusses the diverse applications of the PGV scheme. A detailed construction of the PGV scheme with proofs of different features is discussed in Section 3. Section 4 discusses practical considerations for the PGV implementation. We measure the performance of the PGV scheme and compare it with an existing remote verification scheme, PDP in Section 5. We also compare the performance of PGV with an existing de-duplication mechanism, PoW in Section 6. Section 7 provides a comprehensive comparison of features of the existing mechanisms as well as their drawbacks. Finally, Section 8 concludes the paper.

2. Applications of storage enforcing remote verification Our storage enforcing verification scheme is potentially applicable in many scenarios in addition to the aforementioned scenario of client verifying a remote server. An additional application is proof of ownership (Halevi et al., 2011; Zheng and Xu, 2012). Consider the case when a client wants to upload its data x to a storage service (such as Dropbox) that uses ‘deduplication’. In particular, if x is already stored on the server, then the server will ask the client not to upload x and give it ownership of x. To save on communication, the sever asks the client to send it a hash h(x) and if it matches the hash of the stored x on the server, the client is issued an ownership of x. As identified by Halevi et al. (2011), this simple deduplication scheme can be abused for malicious purposes. For example, if a malicious user gets hold of the hashes of the files stored on the server (through server break-in

98

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–107

or unintended leakage), it will get access to those files. The simple deduplication scheme can also be exploited as an unintended content distribution network. A malicious user can upload a potentially large file and share the hash of the file with accomplices. Now, all these accomplices can present the hash to storage server and get access to large file as if it is a content distribution network. A storage enforcing remote verification scheme such as PGV can address such a situation. Since the prover and the verifier are reversed from the client verifying storage provider scenario, the performance restriction is even more severe. The computation that runs at the client-side has to be light-weight because of the limited capacity of client devices such as smartphones. Further, we cannot expect the client to modify the data to aid in the verification process (as it is needed in some existing PDP/POR schemes). There are application scenarios where storage enforcement alone suffices. For example, we can use storage enforcement to provide a remote “kill” switch that wipes a remote untrusted hard drive clean, as we can send a random string of the same size as the hard drive and verify if the remote end stored the same size string. (For random string x it is known that with high probability CðxÞ Z jxj Oð1Þ so this will force the remote hard disk to be wiped clean.) Another example is when a storage provider has different classes of servers, e.g., premium servers and normal servers, where premium servers provide better QoS, but are more expensive to operate. In this scenario, storage enforcement alone can remove the incentive for the provider to move a premium client's data from the premium servers to normal servers; the premium servers need to store as much data, and the provider cannot reduce the cost of operation by “cheating”.

3. Pretty good verification This section presents the general construction of our storage enforcing remote verification scheme, PGV and formal proofs of the properties that our scheme provides. For the clarity of presentation, we have illustrated our storage enforcement scheme with only one scenario where a client verifies a remote server (or multiple servers). However, our scheme is much more flexible and applicable to many scenarios; as we detail in Section 2, it is applicable in proof of ownership (Halevi et al., 2011) applications where a remote server verifies a client before granting access to its data. 3.1. Basic scheme Our basic scheme is very simple—we use a polynomial hash over a finite field to check whether the storage provider still has the data that we sent them. We generate one or more keys at random, hash the data with the keys, and store the keys and hashes together. When time comes to do verification, we send a single key to the remote end, and ask for them to compute the keyed hash and send it back, comparing the result with what is stored. To compute a polynomial hash, a client needs to pick systemwide parameters once and perform three steps whenever exporting a data block to a remote storage. There are two important system-wide parameters: the finite field size q (e.g., 232 or equivalently, 4 B) and the block size (e.g., 64 KB). Picking the right values for these parameters involves considerations for performance and security. We discuss these considerations further in our evaluation in Section 5. Once the client finishes picking the system-wide parameters, the client can compute a polynomial hash for each data block to be exported: (i) pick a random number (our key) β from the field, (ii) divide the data block into equal-sized symbols, S0 ; …; Sk  1 , where the size is equal to the field size (e.g., 4 B), and (iii) compute the i 1 polynomial hash H c ¼ ∑ki ¼ 0 Si β . This is equivalent to computing a

single entry in a Reed–Solomon code word. The rest of the section proves that this simple hash can provide interesting properties for remote verification. This process is per-block. For a file, there is one additional step that divides the file into blocks. Then the client needs to repeat the above three steps for each block. 3.2. Summary of our guarantees and proof roadmap We can summarize our two main results for remote verification as follows:

 (Proved in Section 3.4.1) If the server can pass our verification protocol with a probability greater than ε (where ε can be made



to be very small, e.g. 1%), then the server must provably store almost as much as the Kolmogorov complexity of the user string (data). Another way to state this is that, if, for example, ε ¼1% and the server stores slightly less than Kolmogorov complexity C(x), then the server will be caught with probability at least 99%. (Proved in Section 3.4.2) If the server can pass our verification protocol with a probability greater than 1/2 þ γ (where γ can be very small, e.g. 1%), then the server has enough information to recreate the user string. Another way to state this is that, if, for example, γ ¼1% and the server cannot recreate the user data, then it will be caught with probability at least 49%.

In a nutshell, even if a storage provider intends to cheat, we are able to prove that they cannot simultaneously satisfy our requests with high probability and save on space usage. Moreover, if they do satisfy our requests with high enough probability, we could reconstruct our data from their request responses alone. The proofs take advantage of the fact that a polynomial hash is just a single entry in an appropriately-chosen Reed–Solomon code word. The proofs and the scheme itself in fact generalize to any errorcorrecting code and corresponding keyed hash, although this is out of the scope of this paper. Our proofs in the rest of this section present the results in terms of multiple servers, not a single server. The reason is not that PGV requires multiple servers; rather, it is because our proofs are easily generalizable to multiple servers. 3.3. Definitions We use Fq to denote the finite field over q elements. We also use [n] to denote the set f1; 2; ‥; ng. Given any string x A Fnq , we use jxj to denote the length of x in bits. Additionally, all logarithms will be base 2 unless otherwise specified. We now formally define the different parameters of a verification protocol. We use U to denote the user/client. We assume that U wants to store its data x A Fkq among s service providers P 1 ; …; P s . In the pre-processing step of our setup, U sends x to P ¼ fP 1 ; …; P s g by dividing it up equally among the s servers – we will denote the n=s chunk sent to server i A ½s as xi A Fq .1 Each server is then allowed to apply any computable function2 to its chunk and to store a string yi A Fnq . Ideally, we would like yi ¼xi. However, since the servers can compress x, we would at the very least like to force jyi j to be as close to Cðxi Þ as possible. For notational convenience, for any subset T D ½s, we denote yT (xT resp.) to be the concatenation of the strings fyi gi A T (fxi gi A T resp.). To enforce the conditions above, we design a protocol. We will be primarily concerned with the amount of storage at the client 1 We will assume that s divides n. Our protocols do not need the xi's to have the same size, only that x can be partitioned into x1 ; …; xs . For ease of explication, we ignore this possibility for the rest of the paper. 2 A computable function can be computed by an algorithm that halts on all its inputs though there is no pre-specified bound on its time complexity.

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–107

side and the amount of communication and want to minimize both simultaneously while giving good verification properties. The following definition captures these notions. (The definition also allows for some of the servers to collude.) Definition 1. Let s; c; m Z 1 and 0 r r r s be integers, 0 r ρ r 1 be a real and f : ½qn -R Z 0 be a function. Then an (s,r)-party verification protocol with resource bound (c,m) and verification guarantee ðρ; f Þ is a randomized protocol with the following guarantee. For any string x A ½qk , U stores at most m bits and communicates at most c bits with the s servers. At the end, the protocol either outputs a 1 or a 0. Finally, the following is true for any T D½s with jTj r r: if the protocol outputs a 1 with probability at least ρ, then assuming that every server i A ½s\T followed the protocol and that every server in T possibly colluded with one another, we have jyT j Zf ðxT Þ. Our protocol requires that we first pick a family of “keyed” hash functions. For its speed and flexibility, we use Reed–Solomon codes for the protocol. The protocol will pick random keys and store the corresponding hash values for x (along with the keys) during the pre-processing step. During the verification step, U sends one or more keys (depending upon the parameters desired in the guarantee) as a challenge to the s servers. Throughout this section, we will assume that each server i has an algorithm Ax;i such that on challenge β it returns an answer Ax;i ðβ ; yi Þ to U (but we have no pre-specified bounds on the time complexity of A). The protocol then outputs 1 or 0 by applying a (simple) Boolean function on the answers and the stored hash values. Next, we state some definitions related to codes that we need in our proofs. An (error-correcting) code H with dimension k Z 1 and block length n Z k over an alphabet of size q is any function H : ½qk -½qn . A linear code H is any error-correcting code that is a linear function, in which case we correspond [q] with Fq . A message of a code H is any element in the domain of H. A codeword in a code H is any element in the range of H. The Hamming distance Δðx; yÞ of two same-length strings is the number of symbols in which they differ. The relative distance δ of a code is minx a y Δðx; yÞ=n, where x and y are any two different codewords in the code. Definition 2. A ðρ; LÞ list-decodable code is any error-correcting code such that for every vector e the ambient space of the code, the set E0 of codewords that are Hamming distance ρn or less from e is always L or fewer. Geometric intuition for a ðρ; LÞ list-decodable code is that it is one where Hamming balls of radius ρn centered at arbitrary vectors in ½qn always contain L or fewer other codewords. Next we define Kolmogorov complexity. Definition 3. The plain Kolmogorov complexity C(x) of a string x is the minimum sum of sizes of a compressed representation of x, along with its decoding algorithm D, and a reference universal Turing machine T that runs the decoding algorithm. Because the reference universal Turing machine size is constant, it is useful to think of C(x) as simply measuring the amount of inherent (i.e. incompressible) information in a string x. Most strings cannot be compressed beyond a constant number of bits. This is seen by a simple counting argument. C(x) measures the extent to which this is the case for a given string. 3.4. Protocol guarantees For our hash function, we let H : Fkq -Fnq be the Reed–Solomon code, with n¼ q. Recall that for such a code, given message x ¼ ðx0 ; …; xk  1 Þ A Fkq , the codeword is given by HðxÞ ¼ ðP x ðβÞÞβ A Fq ,

99

i 1 where P x ðYÞ ¼ ∑ki ¼ 0 xi Y . It is well-known that such a code H has distance n  k þ 1. By the Johnson Bound p (Guruswami, 2004), H is ffiffiffiffiffiffiffiffi ð1  ε; 2q2 Þlist decodable, provided ε Z k=q. Note that given any x A Fkq and a random β A ½n, HðxÞβ corresponds to the widely used polynomial hash. Further, HðxÞβ can be computed in one pass over x with storage of only a constant number of Fq elements. (Keep in mind that after reading each entry in x, the algorithm just needs to perform one addition and one multiplication over Fq .) Using such a hash, we obtain the following guarantees:

Theorem 1. Let ε 4 0, q be a prime power and s Z 1 be an integer. Then (i) Provided k r ε2 sq, there exists an (s,s)-party verification protocol with resource bound ððs þ 1Þ log q; 2s log qÞ and verification guarantee ðε; f Þ, where for any x A Fkq , f ðxÞ ¼ CðxÞ  Oðs log qÞ. (ii) Let k r ε2 q. Assuming at most e servers do not respond to challenges, there exists an (r,s)-party verification protocol with resource bound ðð2r þ eþ 1Þ log q; 2s log qÞ and verification guarantee ðε; f Þ, where for any x A Fkq , f ðxÞ ¼ CðxÞ  Oðs þlog qÞ. Further, in both the protocols, honest parties can implement their required computation with a one-pass, Oðlog qÞspace (in bits) and ~ Oðlog qÞupdate time data stream algorithm. The claim on the computation requirements on the honest parties follows from the fact that the hash value HðxÞβ can be computed in one pass and with the claimed storage and update time. 3.4.1. Verification by storage enforcement For part (i) of Theorem 1, we implicitly assume the following: (a) we are primarily interested in whether some server was cheating and not in identifying the cheater(s) and (b) we assume that all servers always reply back (possibly with an incorrect answer). Proof of Theorem I (i). For the proof, we will assume that H : k=s Fq -Fqq is the Reed–Solomon that is ð1  ε; LÞlist decodable with pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi L ¼ 2q2 . (Note that the assumption on k implies that ε Z ðk=sÞ=q as required.) We begin by specifying the protocol. In the preprocessing step, the client U does the following on input x ¼ k=s ðx1 ; …; xs Þ A ðFq Þs : 1. Generate a random β A Fq . 2. Store ðβ ; γ 1 ¼ Hðx1 Þβ ; …; γ s ¼ Hðxs Þβ Þ and send xi to the server i for every i A ½s. Server i on receiving x, saves a string yi A ½qn . The server is allowed to use any computable function to obtain yi from xi. During the verification phase, U does the following: 1. It sends β to all s servers. 2. It receives ai A Fq from server i for every i A ½s. (ai is supposed to be Hðxi Þβ .) 3. It outputs 1 (i.e. none of the servers “cheated”) if ai ¼ γ i for every i A ½s, else it outputs a 0. Here we assume that server i, on receiving the challenge, uses some algorithm Ax;i : Fq  ½qn -Fq to compute ai ¼ Ax ðβ ; yi Þ and sends ai back to U. The claim on the resource usage follows immediately from the protocol specification. Next we prove its verification guarantee. Let T D ½s be the set of colluding servers. We will prove that yT is large by contradiction: if not, then using the list decodability of H, we will present a description of xT of size o CðxT Þ. Consider the following

100

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–107

algorithm that uses yT and an advice string v A ðf0; 1gjLj ÞjTj , which is the concatenation of shorter strings vi A ðf0; 1gjLj Þ for each iA T: 1. For every j A T, compute zj ¼ ðAx;j ðβ ; yj ÞÞβ A Fq . k=s 2. Do the following for every j A T: by cycling through all xj A Fq , k=s retain the set Lj D Fq such that for every u A Lj , ΔðHðuÞ; zj Þ r ρn. 3. For each jA T, let wj be the vjth string from Lj . 4. Output the concatenation of fwj gj A T .

Note that by the definition of ρ, Δðzj ; Hðxj ÞÞ r ρq, for every j A T. Next, since H is ðρ; LÞlist decodable, for every jA T, there exists an advice string vj such that wj ¼ xj. Thus, there exists an advice string v such that the algorithm above outputs xT. Further, since H is the Reed–Solomon code, there is an algorithm E that can compute a description of H from k; q and s.3(Note that using this description, we can generate any codeword H(u) in Step 2.) Thus, we have description of xT of size jyT jþ jvj þ∑j A T jAx;j j þ jEj þ ðOðlog q þ log k þ log sÞ þ sÞ (where the term in parentheses is for encoding the different parameters and T), which means that if jyT jo CðxT Þ jvj  ∑j A T jAx;j j jEj  ðs þOðlog q þ log k þlog sÞÞ ¼ f ðxÞ, then we have a description of xT of size o CðxT Þ, which is a contradiction. Note that jvjr s⌈log L⌉ ¼ Oðs log qÞÞ, which implies that we need to have f ðxÞ ¼ CðxÞ Oðs log qÞ, as desired. □ 3.4.2. Proof of retrievability We now argue how our protocol in part (i) gives a proof of retrievability. (The argument is already present in the proof above but we make it explicit here.) For simplicity we focus on the case of s ¼1. As observed earlier the Reed–Solomon code H :

Fkq -Fqq

has

distance q  k þ 1 and hence has relative distance δ Z 1 k=q Z 1  ε2 (where the last inequality follows from our bound on k). Let ρ o 1=2  ε2 =2 r δ=2. Let y be the string stored by server for the client string x. Further, assume that for a random β A Fq , the server is able to return the correct answer (i.e. HðxÞβ ) with probability at least 1  ρ (using an algorithm Ax ). Then we claim that the server has enough information to recover x. The recovery algorithm (i.e. the algorithm to retrieve x from y) is the same algorithm as the four step algorithm above except the list output in Step 2 will only have x in it (and hence Steps 3 and 4 are not required).4 In particular, the algorithm will be as follows:

 Compute z ¼ ðAx ðβ; yÞÞβ A F .  Run the unique decoding algorithm for the Reed–Solomon code q

on z and return the computed message x0 . The reason we will have x0 ¼ x is that the string z computed in first step will have a unique by closest codeword: H(x). Further, the second step can be implemented by any efficient unique decoding algorithm for the Reed–Solomon code (e.g. the well-known Berlekamp–Massey algorithm). One somewhat unsatisfactory part of the argument above is that the server can cheat with probability 1=2 þ ε2 =2. We can decrease this probability at the expense of increasing the client's storage.

3 In particular, the k=s  q generator matrix G of H has as its columns the vector k=s ð1; α; α2 ; …; αk=s  1 ÞT for every α A Fq . For every message x A Fq , the codeword HðxÞ ¼ x  G. 4 One might not have access to the algorithm A– however, one can always send all the challenges β A Fq to the server. Actually one might be able to query only n r q many β A Fq . To make sure the proofs go through one has to make sure that n is large enough so that 1 k=nZ 1  ε2 .

3.4.3. Catching the cheaters and handling unresponsive provers We now observe that since the protocol in the Proof of Theorem 1 part (i) checks each answer ai individually to see if it is the same as γi, it can easily handle the case when some prover does not reply back at all. Additionally, if the protocol outputs a 0 then it knows that at least one of the provers in the colluding set is cheating. (It does not necessarily identify the exact set T.5) At the cost of higher user storage and a stricter bound on the number of colluding provers, we show how to get rid of these shortcomings. ℓ A Reed–Solomon code RS : Fm q -Fq can be represented as a systematic code (i.e. the first k symbols in any codeword are exactly the corresponding message) and can correct r errors and e erasures as long as 2r þ e r ℓ  m. Further, one can correct from r errors and e erasures in Oðℓ3 Þ time. The main idea in the following result is to follow the same protocol as before, but instead of storing all the s hashes, U only stores the parity symbols in the corresponding Reed–Solomon codeword. Proof of Theorem 1 (ii). Let H : Fkq -Fqq be the Reed–Solomon 2 code that is ð1  ε; LÞlist decodable pffiffiffi for L ¼ 2q . (Note that the assumption on k implies that ε Z kq as required.) We begin by specifying the protocol. First, define x^ i , for iA ½s, to be the string xi extended to the vector in Fkq , which has zeros in the positions that do not belong to prover i. Further, for any subset T D½s, define x^ T ¼ ∑i A T x^ i . Finally, let RS : Fsq -Fq ℓ be a systematic Reed–Solomon code where ℓ ¼ 2r þ eþ s.6 In the pre-processing step, the verifier U does the following on input x A Fkq : 1. Generate a random β A Fq . 2. Compute the vector v ¼ ðHðx^ 1 Þβ ; …; Hðx^ s Þβ Þ A Fsq . 3. Store ðβ; γ 1 ¼ RSðvÞs þ 1 ; …; γ 2r þ e ¼ RSðvℓ ÞÞ and send xi to the prover i for every i A ½s. Prover i on receiving xi, saves a string yi A ½qn . Again, the prover is allowed to use any computable function to obtain yi from xi. During the verification phase, U does the following: 1. It sends β to all s provers. 2. For each prover i A ½s, it either receives no response or receives ai A Fq . (ai is supposed to be Hðx^ i Þβ .) 3. It computes the received word z A Fℓq , where for iA ½s, zi ¼ ? (i.e. an erasure) if the ith prover does not respond else zi ¼ai and for s oi r ℓ, zi ¼ γ i . 4. Run the decoding algorithm for RS to compute the set T 0 D ½s to be the error locations. (Note that by Step 2, U already knows the set E of erasures.) We assume that prover i on receiving the challenge, uses an algorithm Ax;i : Fq  ½qn -Fq to compute ai ¼ Ax ðβ ; yi Þ and sends ai back to U (unless it decides not to respond). The claim on the resource usage follows immediately from the protocol specification. We now prove the verification guarantee. Let T be the set of colluding provers. We will prove that with probability at least 1  ρ, U using the protocol above computes

∅a T 0 D T (and jyT j is large enough). Fix a β A ½n. If for this β, U obtains T 0 ¼ ∅, then this implies that for every i A ½s such that prover i responds, we have ai ¼ Hðx^ i Þβ . This is because of our choice of RS, the decoding in Step 4 will return v (which in turn 5 We assume that identifying at least one prover in the colluding set is motivation enough for provers not to collude. 6 Note that both H and RS are Reed–Solomon codes but they are used for different purposes in the proof.

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–107

allows us to compute exactly the set T 0 DT such that for every ^ β ).7 Thus, if the protocol outputs a T 0 a∅ with j A T 0 , aj a HðxÞ

probability at least 1  ρ over the random choices of β, this implies that ΔðHðx^ T Þ; ð∑j A T Ax;j ðβ ; yj ÞÞβ A ½n Þ r ρn. Using an argument similar

to the proof of part (i), this further implies that jyT jZ CðxT Þ s  Oðlog ðsqLÞÞ. The proof of complete by noting that log L is Oðlog qÞ. □

3.5. Sampling Although PGV prefers and is able to efficiently handle complete verification of the data (i.e. all the data blocks in a file), PDP uses sampling to scale with the increment in file sizes. To make a fair comparison, we now discuss a probabilistic framework of error detection using PGV. Let us assume a scenario where the server cheats8 t blocks out of an n-block file. Let rPDP and rPGV be the number of sampled blocks during the verification in PDP and PGV, respectively. Let p be the probability that the server can avoid detection of its cheating. For PDP, p is the probability that none of the sampled blocks match the blocks on which the server cheats; i.e., we have: p ¼ ð1  t=nÞrPDP which leads to r PDP ¼ ððlog 1=pÞ= ðlog n=ðn  tÞÞÞ. For PGV, p will have two components, one component is due to the sampling error and another component is the verification error ε (we get the latter from Theorem 1). So, p ¼ ð1 t=nÞrPGV þ ε. If we use weight 0 r α r1 as a factor of p for these two components, we have: ð1  αÞp ¼ ð1  t=nÞrPGV and αp ¼ ε. The first term gives us: r PGV ¼ ððlog ð1=ð1  αÞÞpÞ=ðlog n=ðn  tÞÞÞ which is equivalent to r PGV ¼ ðr PDP þ ðlog 1=ð1  αÞÞ=ðlog n=ðn  tÞÞÞ. When t¼1% of n, if we want a 99% detection rate (i.e. p¼ 1%), and 8 α ¼ 10 , we pget: ffiffiffiffiffiffiffiffi r PGV ¼ ðr PDP þ 23Þ blocks. Now, from Theorem 1, we have ε r k=q, or k r ε2 q, where q is the number of field elements and k is the number of symbols in each block. In turn, this bounds the block size for PGV to be br log 8 qε2 q bytes. When we use GFð232 Þ, if we want a 99% detection rate (p ¼1%), we get b ¼ 8 1677α2 KB since αp ¼ ε. For α ¼ 10 , the block size can be as large as 1.04 MB. 3.6. Repeatability PGV has bounded repeatability i.e. inability to allow an unlimited number of remote verifications. We show that there are practical ways to overcome this issue since the token generation and verification time are very fast in PGV (Section 5). Also, for the proof of ownership problem, repeatability is not an issue since the server will always have the original data. The client/verifier can keep a buffer of tokens when it sends out the data to the outsourced storage. Now, when a normal read access occurs, the client can check whether it has enough tokens in the buffer. If not, it can regenerate new tokens and keep the buffer up-to-date. Repeating this process, PGV can handle unlimited verification by piggybacking the generation of new tokens with regular block access operations. To make the discussion more concrete, let us compare PDP (which has unbounded repeatability) and PGV (with the piggybacking scheme). In particular let ðg PDP ; vPDP Þ and ðg PGV ; vPGV Þ be the generation time and verification time for one block for PDP and PGV respectively. Further define Rg ¼ g PDP =g PGV . Let B be the maximum number of tokens that the PGV keeps at any point of time per block. Since PDP has unbounded repeatability, it only needs to store one token per block. Next we try to figure out what 7

We will assume that T \ E ¼ ∅. If not, just replace T by T\E. 8 Cheating in the case of PGV means that the server stores less than the Kolmogorov complexity of the block. This example can handle the case when sufficiently bits from a block has been erased. For PDP this means that the server alters the block so that it cannot re-create the original block.

101

B should be in order to keep the overall overhead of PGV still smaller than PDP. At the beginning both PDP and PGV need to generate the tokens. Thus, PGV spends B  g PGV time while PDP spends gPDP time. Thus, to be ahead of PDP at this stage, we need to make sure Br

g PDP ¼ Rg : g PGV

ð1Þ

Now later on, we will have verification for the block interspersed with normal reads for the block. Let us assume that C 4 0 is the maximum number of verification calls between two normal reads for a block. Now consider any “run” of x consecutive verification calls followed by a normal read (so we have x rC). Note that in this case for PGV we (i) need to make sure that B Z x as we cannot “replenish” the buffer till get to the normal read and (ii) make sure that the overhead incurred by PGV is smaller than PDP. For (ii) we note that the time expended by PDP is x  vPDP while the time expended by PGV is xðvPGV þ g PGV Þ (as it needs to perform x verifications and replenish as many tokens in its buffer). Thus, PGV would have a lower overhead if vPGV þ g PGV r vPDP :

ð2Þ

Note that we have ended up with the following constraints on B: C r B r Rg : In our experiments (Section 5), we have found that (2) is satisfied. (In fact, we observe that g PGV Z vPGV . Further, vPDP Z 10  g PGV , which implies that PDP is at least five times as slow as PGV with B tokens during the verification phase.) Finally, we observe in our experiments that Rg Z 10. So if we pick B¼C and if we have C o10 (the latter is a realistic assumption for the cloud computing applications we envision for PGV), then PGV comes out ahead even in the token generation part. 3.7. Extensions Some of the existing schemes (Ateniese et al., 2007; Wang Q et al., 2009; Wang C et al., 2009; Bowers et al., 2009a) discuss public verifiability in terms of a trusted third party auditor's ability to verify the remote storage. Now, when the client is honest, this can be very straightforward: client sends β and Hðxi Þβ to the auditor, auditor uses server i's algorithm Ax;i and compares whether Hðxi Þβ and Ax;i ðβ ; yi Þ are equal. However, this naïve approach fails in arbitration when the client is dishonest, because it can send wrong β and Hðxi Þβ to the auditor in the first place. To the best of our knowledge, none of the existing schemes discuss or handle this scenario. We briefly outline two methods of handling public verifiability when the client is dishonest. The first one is to retrieve stored data from other servers and see whether yi is consistent as a codeword. This approach although storage efficient, is intense in communication and computation overhead since it is almost equivalent to the overhead of constructing the original data. The other method is to use locally decodable codes (Yekhanin, 2007) to reconstruct only yi without using all the blocks. However, this approach requires considerable amount of extra storage but efficient in terms of communication and computation overhead.

3.8. Summary of properties We can summarize the properties that our scheme provides as follows:

 The amount of storage for the verifier depends only on ε and the logarithm of the block size (Theorem 1).

102

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–107

 No additional storage for the provers (Theorem 1).  Constant amount of bandwidth usage for a challenge (Theorem 1).  Reconstructability of the data from challenge responses alone, 

 

provided the prover(s) pass the verification protocol with high enough probability (Section 3.4.2). Assurance that provers have stored as much data as there is inherent information in the data sent, provided the prover (s) pass the verification protocol with high enough probability (Section 3.4.1). Low computational complexity for the verifier and the prover (Theorem 1). Prover can store original unmodified data and follow protocol (since our protocol does not modify the original data).

4. Considerations for implementation Basic PGV operations include polynomial evaluation as well as addition and multiplication over a finite field. In practice, there are finite field libraries one can use such as James Plank's Galois field library (Plank, 2003). 4.1. Polynomial evaluation over finite field The core of PGV is an evaluation of a polynomial over a finite field for token generation and verification. A naïve way of evaluating a polynomial of degree n, PðxÞ ¼ an xn þ an  1 xn  1 þ ⋯ þa0 , over a finite field at a point β is iteratively computing the n powers βi for i ¼ 1; …; n  1. This method is inefficient as it requires 2n  1 multiplications and n additions. However, in PGV, we use an efficient technique proposed by Horner (1819) for polynomial evaluation. This technique iteratively evaluates PðβÞ as ð⋯ððan β þ an  1 Þβ þ an  2 Þβ þ ⋯Þβ þ a1 Þβ þ a0 . It is more efficient as it requires only n multiplications and n additions. 4.2. Implementation of addition and multiplication of field elements There are known methods to implement efficient addition and multiplication over a finite field. The “standard” representation of elements in Fq for q ¼ ps with prime p is as a polynomial of degree s  1 over Fp and addition of two elements in Fq is a simple polynomial addition. When p ¼2, which will of interest to us, one can think of the polynomial representation as a bit string and the addition operation as bit-wise xor. For multiplication, it is convenient to think of every non-zero element in a field Fq in its “multiplicative” form, i.e. as a power of the “generator” γ. That is, every element α A Fq \f0g can be represented as α ¼ γ j for some j A f0; 1; …; q 2g. Thus, we have γ i  γ j ¼ γ ði þ jÞmodðq  1Þ . This implies that the multiplication of two numbers in the standard representation (which will be our default) can be done by two table lookups into a “log” table to convert from the standard representation into the multiplicative representation and one look-up into the “anti-log” table to do the converse. Further, we will need one modular addition. The simplicity of these operations contributes to PGV's overall simplicity and low overhead.

Our main metrics of comparison are token (hash) generation time (Section 5.3), challenge generation and verification time (Section 5.4), and storage overhead (Section 5.5). We demonstrate PGV's low overhead in these metrics compared to other schemes, which leads to scalability of PGV. 5.1. Experimental platform and parameters We use open source C libraries on an Intel Centrino Duo machine at 1.73 GHz with 2.5 GB memory running Linux kernel 2.6.32 to compare PGV with existing schemes. We use a single thread in all implementation and measurements. For cryptographic operations, we use OpenSSL 0.9.8k. For field operations and erasure coding, we use James Plank's Galois field library (Plank, 2003) and Jerasure library (Plank, 2008). 5.2. Performance of basic primitives Although we mainly compare PGV to PDP, we recognize that a direct comparison is inherently prone to unfairness. This is due to the fact that they rely on several different types of primitives/ techniques. Thus, we first show the benchmark results of the basic primitives. Table 1 shows a benchmark summary on the experimental platform of cryptographic algorithms. Table 2 shows a benchmark summary of Galois field multiplication and encoding. Table 3 summarizes basic primitives used by different schemes and their instantiated algorithms including PoW which we discuss in Section 6. On the experimental platform, the measurements were performed on files of different sizes ranging from 1 MB to 1024 MB. Schemes are configured to detect any corruption in the file blocks as well as sampling. Also, we report performances both including and excluding the I/O to give a clear idea about the complete scheme and core operations overhead, respectively. When comparing with PDP, we have chosen parameters that are favorable for it. With PDP, we use GFð232 Þ and 16 KB block size since it performs Table 1 Performance of different primitives. Block size

1024 B

Algorithm

Rate (MB/s)

AES RC4 MD5 SHA1

81 184 231 148

8192 B

83 184 288 170

Table 2 Performance of multiplication and encoding. Field

Multiplication (MB/s)

Encoding (MB/s)

GFð216 Þ

269

110

GFð232 Þ

106

25

5. Client verifying storage provider experiments This section presents various performance aspects of PGV as well as comparisons with the performance of a crypto-based scheme, PDP (Ateniese et al., 2007). We emphasize that we have only considered the main schemes in PDP, although there are many variants of the main schemes. Since the variants of each main scheme make design tradeoffs on multiple properties, it is difficult for us to consider those variants altogether.

Table 3 Basic primitives for different schemes. Scheme

Primitives

Algorithms used

PDP PoW PGV

Modular exponentiation Hash tree Multiplication

RSA SHA256 Poly hash

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–107

5.3. Token generation time Since the constructions and terminologies are different from scheme to scheme, we first briefly discuss what we measure in pre-processing and token generation time. We note that token is also referred to as hash, tag, or authenticator in the literature. In all of the schemes, the file was divided into blocks of the same size and then tokens were generated for each block. PGV's token generation time includes the generation of random numbers as key, file I/O and poly hash generation for the file blocks. Figure 1 shows PGV's token generation time for various file sizes (64 MB and 256 MB) and block sizes (4 KB, 8 KB, 16 KB, 32 KB, and 64 KB), including and excluding the I/O. Although we have done experiments with file sizes up to 1024 MB, for better visibility, we plot the data from a few file sizes. With the increment in the block size, the token generation time becomes faster. PGV can generate token for a 1 GB file with 64 KB block size in 36.91 s excluding the I/O. For PDP, the token generation time includes the time required to generate the asymmetric keys, file I/O and per-block token generation time. Figure 4 shows the token generation time comparison between PDP and PGV. (For both schemes we used a block size of 16 K.) As shown in the figure, PGV has little overhead as it just evaluates the polynomial hash. PDP has higher overhead as it relies on computationally intensive cryptographic primitives. It uses asymmetric cryptographic primitive based on modular exponentiation. On average, PGV is approximately 25 times faster than PDP in token generation for different file sizes. 5.4. Challenge generation, verification, and sampling analysis

client–server communication overhead, which is one network round-trip); it includes all of the following if required by a scheme – challenge generation at the client side, proof computation at the server side, and verification at the client side. PGV's verification time includes the poly hash generation using the stored random keys and file I/O. Figure 2 shows PGV's verification time for various file sizes (64 MB and 256 MB) and block sizes (4 KB, 8 KB, 16 KB, 32 KB, and 64 KB), including and excluding I/O. (The verification time is for all the blocks in a file.) Although we have done experiments using files of size up to 1024 MB, again we show a few in the plots for better visibility. PGV is able to verify a file of size 1 GB with 64 KB block sizes in 35.41 s excluding the I/O. Figure 5 compares PDP verification time with PGV considering both completeness (i.e. in both cases we verify all the blocks) and sampling (Section 3.5 with the following setting of parameters: t=n ¼ 1%, p ¼1%, α ¼ 0.8 and q ¼ 232 ). PDP incurs the most overhead due to asymmetric cryptographic operations in all three phases. It also shows the sampling verification time with 99% detection rate. As discussed earlier, in the probabilistic framework, PDP can guarantee 99% detection rate by sampling 460 blocks. For PGV, it can guarantee the same detection rate by sampling 483 blocks. As in token generation, PGV outperforms PDP since it involves only a number of simple finite field additions and

Token Storage(MB)

best at 16 KB block size.

Now, we compare the performance in challenge generation and verification of different schemes including PGV. This comparison considers the per-verification overhead holistically (excluding the

2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0

4KB Block 8KB Block 16KB Block 32KB Block 64KB Block

1 64

256

512 File Size (MB)

1024

Fig. 3. Tag storage for PGV for different file sizes.

18 16 14 12 10 8 6 4 2 0

PGV w/o I/O PGV w/ I/O

10000

B 6M 25 B M 64 B 6M 25 B M 64 B 6M 25 B M 64 B 6M 25 B M 64 B 6M 25 B M 64 4KB

8KB

16KB 32KB Block Size

Performance (Sec)

Time (Sec)

103

64KB

PDP PDP w/o I/O PGV PGV w/o I/O

1000 100 10 1 0.1

1 64

256

Fig. 1. Token generation time for PGV.

512 File Size (MB)

1024

Fig. 4. Preprocessing and token generation time comparison for PDP and PGV.

12

PGV w/o I/O PGV w/ I/O

100000

8 6 4 2 0

B 6M 25 B M 64 B 6M 25 B M 64 B 6M 25 B M 64 B 6M 25 B M 64 B 6M 25 B M 64 4KB

8KB

16KB

32KB

64KB

Block Size Fig. 2. Challenge generation and verification time for PGV.

Performance (Sec)

Time (Sec)

10

10000 1000

PDP-Complete PDP-Sampling 99% Confidence PGV-Complete PGV-Sampling 99% Confidence

100 10 1 0.1 64

256

512

1024

File Size (MB) Fig. 5. Challenge generation and verification time comparison for PDP and PGV.

104

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–107

10

PDP Token Storage PGV Token Storage

1

Performance (Sec)

Storage (MB)

10

0.1 0.01 0.001 0.0001

1 64

256

512 File Size (MB)

1

0.1

1024

A remote storage verification scheme typically requires additional storage on top of the original data for storing tokens and any other necessary information. As with pre-processing, we do not consider keys in this overhead, since the size is negligible compared to the original data. This additional storage overhead can be present at the client side, server side, or both. Figure 3 shows the PGV storage overhead for different file sizes and block sizes. Comparison of storage overhead with PDP is shown in Fig. 6. PDP stores its tokens (authenticators) on the server side. These tokens are usually 1024 bits long. Unlike these schemes, PGV does not have any server-side storage overhead. We only need to store tokens at the client side.

256

512

Fig. 7. Client time comparison for PoW and PGV.

100 Performance (Sec)

5.5. Storage overhead

32 64

File Size (MB)

Fig. 6. Storage overhead comparison of PDP and PGV.

multiplications. On average, PGV verification is almost 120 times faster than PDP verification.

PoW PGV

Naive PoW PGV

10

1

0.1

32 64

256 File Size (MB)

512

Fig. 8. Overall time comparison (client time þ network time) for PoW and PGV with a naive approach with no deduplication over a 100 Mbps network.

that PoW is optimized only for proof of ownership whereas PGV supports two-way verification with proof of storage enforcement.

6. Server verifying client experiments 6.2. Network time In this section, we compare PGV with PoW (Halevi et al., 2011) in terms of client time (the time required by the prover to compute the proof of ownership of deduplicated data) (Section 6.1). We also compare both the schemes' network time to a naive approach where, instead of deduplication, the complete file is transferred over the network (Section 6.2). Since, the results reported in Halevi et al. (2011) were performed on a Xeon 2.53 GHz machine, we also measure PGV performance on a similar platform (Xeon 2.4 GHz) for fair comparison. Other experimental parameters remain the same as described in Section 5.1. In particular, we only use a single thread for performance measurement. However, for PoW, it is not mentioned whether the measurements were parallelized or not. Also, we stress that the results reported for PoW was for their efficient scheme which gives the weakest security guarantee among their proposed schemes.

Network time in PGV corresponds to sending random numbers

β to the client and get the computed hashes back from the client.

As we have discussed in Section 5.4, PGV can guarantee 99% detection by checking 483 blocks. So, in total PGV transfers around 30 K data on the network as a part of the protocol data. For similar guarantee, PoW reports 20 K data transfer requirement as it needs to transfer 20 leaves to the provers. Figure 8 shows the overall time (client time and network time) required by PGV and PoW, compared to a naive scheme which always transfers the complete file over the network without deduplication for different file sizes. Our results show that PGV performs better than both PoW and naive approach. Note that, both PGV and PoW will require additional time to verify the hashes which is 0.5 s for PGV and 0.6 ms for PoW.

6.1. Client time For PGV, client time corresponds to the token generation time as mentioned in Section 5.3 that includes the generation of random numbers as key, file I/O and poly hash generation for the file blocks. In PoW, this time includes the time for reading the file from disk, computing its SHA256 value, performing reduction and mixing phase as well as computation of Merkle tree over this output. Figure 7 compares client times for PGV and PoW for different file sizes. As we can see from the figure, PGV performs faster than PoW in client time for the reported file sizes from 32 MB to 512 MB. However, as the file size increases, the performance follows similar trend. For instance, PoW and PGV takes 15.29 s and 16.98 s, respectively for a 1024 MB file. However, note

7. Related work Previous work on remote storage verification spreads over two broad domains – coding and cryptography. Since there are many schemes that provide different properties, we briefly survey the design space and compare our scheme to others in this section. Although many previous schemes provide a good set of properties, we emphasize that our scheme can do so with a simple construction with low performance overhead. Table 4 summarizes our comparison. It is by no means a comprehensive list; rather, we have included a few schemes to have a representative set (Table 5).

Table 4 Summary of the existing schemes compared to the proposed scheme, PGV. SFC (Schwarz and Miller, 2006)

C-POR (Shacham and Waters, 2008)

SDS (Ren et al., 2012)

HAIL (Bowers et al., 2009a)

PDP (Ateniese et al., 2007)

EPV (Wang Q et al., 2009)

SEC (Golle et al., 2002)

PGV

Method Proof of storage enforcement (Section 7.1) Proof of retrievability (Section 7.2) Data transformation (Section 7.3) Detection completeness (Section 7.4) Sampling (Section 7.4) Repeatability (Section 7.5)

Code No No Yes Complete No Yes

Crypto No Yes No Partial Yes Yes

Code No No Yes Partial Yes No

Code No No Yes Partial Yes Yes

Crypto No No No Partial Yes Yes

Crypto No Yes No Partial Yes Yes

Crypto Yes No Yes Partial Yes Yes

Code Yes Yes No Complete No No

Table 5 Asymptotic performance comparison of existing schemes with the proposed scheme. We assume that data contains n symbols each of size log q bits, that it is then divided into s equal-sized blocks, with the blocks being distributed among the servers. We compare the token generation and verification for all s blocks. ξ is the fraction of symbols checked during each verification for the schemes which check partial data corruption. For cryptographic schemes, we assume Em is the cost of performing modular exponentiation modulo m. Token generation/verification complexity is based on the number of bit operations. Storage and communication complexity is based on the number of bits. PGV-S refers to our proposed simple scheme where we generate one token for each server and PGV-E is a variation where a single token can verify multiple servers. Additional server storage refers to the amount of data that the server stores in addition to the original data, if any. Operation/scheme

SFC (Schwarz and Miller, 2006)

POR (Juels and BSK, 2007)

SDS (Ren et al., 2012)

HAIL (Bowers et al., 2009a)

PDP (Ateniese et al., 2007)

EPV (Wang Q et al., 2009)

SEC (Golle et al., 2002)

PGV-S

PGV-E

Token generation Challenge generation Challenge verification Client storage Add. server storage Communication complexity

Oðn log qÞ Oðn log qÞ O(n log n log q) Oð1Þ 0 O(s log q)

Oððn2 Þlog qÞ Oððn=ξÞlog qÞ Oð1Þ O(s log q) 0 Oððn=ξÞlog qÞ

Oðn log qÞ Oððn=ξÞlog qÞ O(n log n log q) O(s log q) 0 Oððn=ξÞlog qÞ

Oðn log n log qÞ Oððn=ξÞlog qÞ Oðnlog n log qÞ Oð1Þ O(n log q) Oððn=ξÞlog qÞ

OðnEm Þ Oððn=ξÞEm Þ Oððn=ξÞEm Þ Oð1Þ O(s log m) O(s log m)

OðnEm Þ Oððn=ξÞEm Þ Oððn=ξÞEm Þ O(s log m) Oð1Þ O(s log m)

Oðn Em Þ Oððn=ξÞEm Þ Oððn=ξÞEm Þ O(s log m) O(s log m) O(s log m)

Oðn log qÞ Oðn log qÞ Oð1Þ O(s log q) 0 O(s log q)

Oððn=sÞlog qÞ Oðn log qÞ Oð1Þ Oð1Þ 0 O(s log q)

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–107

Metric/scheme

105

106

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–107

7.1. Storage enforcement Our scheme provides the property of storage enforcement. This means roughly that, in order to pass our verification, a cloud storage provider has to commit as much storage space as the size of the original data. This property removes storage saving as an incentive of cheating for a storage provider. To the best of our knowledge, there is only one previous scheme that provides a similar property. Golle et al. (2002) constructs a cryptographic primitive called Storage Enforcing Commitment (SEC) that probabilistically guarantees that a server commits as much storage space as the size of the original data in order to correctly answer the challenges of a client. However, SEC does not provide the proof of retrievability guarantee and relies on an expensive cryptographic primitive as well as transformation of the original data. 7.2. Proof of retrievability Our scheme provides a proof of retrievability. This means roughly that, if a cloud storage provider can pass our verification, then it is possible to reconstruct the original data from what the cloud storage stores that might be different from the original data. Intuitively, this property arises from the fact that if a client collects enough responses from its cloud storage, then the client can reconstruct the original data out of the responses gathered. To the best of our knowledge, our scheme is the first to provably provide both storage enforcement and a proof of retrievability by combining Kolmogorov complexity and list decoding as we detail further in Section 3. (Further, we show that proof of retrievability implies storage enforcement.) Juels and BSK (2007) and Juels and Oprea (2013) were the first to propose this property. Since their scheme only supports a limited number of verifications (the subject of Section 7.5) and no public verifiability, Shacham and Waters (2008) constructed a scheme that provides both unlimited and public verifiability by integrating cryptographic primitives. Dodis et al. (2009) later studied variants of existing POR schemes and Bowers et al. (2009b) showed how to tune different parameters to achieve various performance goals. Other schemes study a similar property called provable data possession (PDP) that provides a probabilistic proof that a third party stores a file. Examples include Ateniese et al. (2007), Curtmola et al. (2008), Wang Q et al. (2009), Erway et al. (2009), and Ateniese et al. (2008). Although the guarantee sounds similar to the proof of retrievability, they are not equivalent. Proof of retrievability guarantees both the existence and extraction of the data whereas provable data possession just guarantees the former. In addition, these schemes rely on expensive cryptographic primitives; this makes it difficult for them to scale up to a large amount of data without sacrificing completeness. We discuss this further in Section 7.4. 7.3. Data transformation Unlike our scheme, some of the previous schemes require data transformation such as encryption, erasure-coding, and insertion of extra blocks (Schwarz and Miller, 2006; Dodis et al., 2009; Golle et al., 2002; Juels and BSK, 2007; Juels and Oprea, 2013; Ren et al., 2012; Bowers et al., 2009a). The advantage of these schemes is that they can provide additional properties such as confidentiality with encryption and erasure tolerance with encoding. However, the problem is that data transformation penalizes clients dealing with honest storage providers; a client needs significant preprocessing, resulting in substantial performance overhead for normal read and write operations. Also, a cloud storage might

need to spend more space than what is necessary to store extra information. Thus, we argue that these properties should be orthogonal to the core verification properties; in our scheme, we do not make any distinction between encrypted data or plain data, hence a client can choose to encrypt data if so desired. Moreover, we can extend our scheme to efficiently integrate erasure-coding as we discuss in Section 3. 7.4. Overhead, sampling, and completeness In general, schemes that either rely on cryptographic primitives or require data transformation have significant overhead involved during pre-processing or verification (Schwarz and Miller, 2006; Dodis et al., 2009; Golle et al., 2002; Juels and BSK, 2007; Juels and Oprea, 2013; Ren et al., 2012; Bowers et al., 2009a; Ateniese et al., 2007, 2008; Curtmola et al., 2008; Wang Q et al., 2009; Erway et al., 2009). Thus, it is difficult for these schemes to scale up and deal with a large amount of data. A typical way to mitigate this issue is sampling that provides a probabilistic guarantee, i.e., a server accesses (possibly small) x% of data in order to verify the entire data with some (possibly high) probability p. Although this technique reduces the overall verification overhead, it comes at a cost of sacrificing completeness; depending on the verification probability, it is possible that corruption in a small portion of data can go undetected in sampling which might in turn weaken the guarantee that proof of retrievability provides. Our scheme favors completeness, however we also provide probabilistic guarantees based on sampling as discussed in Section 3. Moreover, our low verification overhead makes it easier to scale up to a large amount of data. Our evaluation in Section 5 shows our scalability. For example, verification of a 10 MB file is possible in less than one-tenth of a second. 7.5. Repeatability Repeatability is the ability to perform multiple, possibly a very large number of verifications of data without pre-processing the data multiple times. Previous schemes such as Schwarz and Miller (2006), Bowers et al. (2009a), Ateniese et al. (2007), Wang Q et al. (2009), and Golle et al. (2002) provide this property. The low challenge generation overhead of PGV allows us to generate a large number of challenges quickly. Section 5 shows that, for a data size of 1000 KB, it takes roughly 9 s to generate enough challenges to verify monthly for the next 50 years on a commodity PC (Intel Centrino Duo). We also show how PGV can provide repeatability in an alternate way in Section 3.

8. Conclusions This paper presents PGV (Pretty Good Verification), a multipurpose storage enforcing remote verification scheme that utilizes polynomial hash for cloud storage verification. In particular, its simplicity allows for bi-directional verification (i.e., the client can verify that the server is storing its data and the server can verify that a (new) client has data that it is claiming to have). Using a novel combination of Kolmogorov complexity and list decoding, we prove that the polynomial hash provides a strong storage enforcement property as well as proof of retrievability. Our experimental results support our claims of low overhead in preprocessing, token generation, verification, and additional storage space compared to the existing schemes. Our overall results show that it is a good, practical choice for bidirectional cloud storage verification.

M.I. Husain et al. / Journal of Network and Computer Applications 45 (2014) 96–107

References Allalouf M, Arbitman Y, Factor M, Kat RI, Meth K, Naor D. Storage modeling for power estimation. In: Proceedings of SYSTOR 2009: the Israeli experimental systems conference, SYSTOR '09. New York, NY, USA: ACM; 2009. p. 3:1–10. Ateniese G, Burns R, Curtmola R, Herring J, Kissner L, Peterson Z, et al. Provable data possession at untrusted stores. In: Proceedings of the 14th ACM conference on computer and communications security (CCS 2007); 2007. Ateniese G, Pietro RD, Mancini LV, Tsudik G. Scalable and efficient provable data possession. In: Proceedings of the 4th international conference on security and privacy in communication networks (SecureComm '08); 2008. Aumann Y, Lindell Y. Security against covert adversaries: efficient protocols for realistic adversaries. J Cryptol 2010;23(2):281–343. Bierbrauer J, Johansson T, Kabatianskii G, Smeets BJM. On families of hash functions via geometric codes and concatenation. In: Proceedings of 13th annual international cryptology conference (CRYPTO); 1993. p. 331–42. Bowers KD, Juels A, Oprea A. HAIL: a High-Availability and Integrity Layer for cloud storage. In: Proceedings of the 16th ACM conference on computer and communications security (CCS 2009); 2009a. Bowers KD, Juels A, Oprea A. Proofs of retrievability: theory and implementation. In: Proceedings of the 2009 ACM workshop on cloud computing security (CCSW '09); 2009b. Canetti R. Security and composition of cryptographic protocols: a tutorial (Part I). SIGACT News 2006;37(3):67–92. Curtmola R, Khan O, Burns R, Ateniese G. MR-PDP: multiple-replica provable data possession. In: Proceedings of the 28th international conference on distributed computing systems (ICDCS '08); 2008. Dodis Y, Vadhan SP, Wichs D. Proofs of retrievability via hardness amplification. In: Proceedings of the 6th theory of cryptography conference (TCC); 2009. Erway C, Küpçü, A, Papamanthou C, Tamassia R. Dynamic provable data possession. In: Proceedings of the 16th ACM conference on computer and communications security (CCS 2009); 2009. Freivalds R. Probabilistic machines can use less running time. In: IFIP congress; 1977. p. 839–42. Golle P, Jarecki S, Mironov I. Cryptographic primitives enforcing communication and storage complexity. In: Proceedings of the 6th international conference on financial cryptography (FC'02); 2002. Guruswami V. List decoding of error-correcting codes, no. 3282. In: Lecture notes in computer science, Springer; 2004, /http://www.springer.com/computer/secur ityþ and þ cryptology/book/978-3-540-24051-8S.

107

Halevi S, Harnik D, Pinkas B, Shulman-Peleg A. Proofs of ownership in remote storage systems, Cryptology ePrint Archive, Report 2011/207, 〈http://eprint.iacr. org/〉; 2011. Horner WG. A new method of solving numerical equations of all orders, by continuous approximation. Philos Trans R Soc Lond 1819;109:308–35. Juels A, Kaliski Jr., Burton S. Pors: Proofs of Retrievability for Large Files. In: Proceedings of the 14th ACM Conference on Computer and Communications Security (CCS '07), ACM, New York, NY, USA, 2007, 584—597. Juels A, Oprea A. New approaches to security and availability for cloud data. Commun ACM 2013;56(2):64–73. http://dx.doi.org/10.1145/2408776.2408793. Kolmogorov AN. Three approaches to the quantitative definition of information. Probl Inf Transm 1965;1(1):1–7. Li M, Vitányi PMB. An introduction to Kolmogorov complexity and its applications, graduate texts in computer science. 3rd Edition New York: Springer; 2008. Moore RL, D'Aoust J, Mcdonald R, Minor D. Disk and tape storage cost models. In: Archiving 2007, URL 〈http://www.imaging.org/conferences/archiving2007/ details.cfm?pass=21〉; 2007. Plank JS. GFLIB C Procedures for Galois field arithmetic and Reed Solomon Coding, website 〈http://web.eecs.utk.edu/  plank/plank/gflib/index.html〉; 2003. Plank JS. Jerasure: a library in C/Cþ þ facilitating erasure coding for storage applications, website 〈http://web.eecs.utk.edu/ plank/plank/papers/CS-08-627. html〉; 2008. Ren K, Wang C, Wang Q, et al. Security challenges for the public cloud. IEEE Internet Comput 2012;16(1):69–73. Schwarz T, Miller EL. Store, forget, and check: using algebraic signatures to check remotely administered storage. In: Proceedings of the IEEE international conference on distributed computing systems (ICDCS'06); 2006. Shacham H, Waters B. Compact proofs of retrievability. In: Proceedings of the 14th annual international conference on the theory and application of cryptology and information security (ASIACRYPT 2008); 2008. Wang C, Wang Q, Ren K, Lou W. Ensuring data storage security in cloud computing. In: Proceedings of the 17th IEEE international workshop on quality of service (IWQoS 2009); 2009. Wang Q, Wang C, Li J, Ren K, Lou W. Enabling public verifiability and data dynamics for storage security in cloud computing. In: Proceedings of the 14th European symposium on research in computer security (ESORICS 2009); 2009. Yekhanin S. Locally decodable codes and private information retrieval schemes [Ph. D. thesis]. MIT; 2007. Zheng Q, Xu S. Secure and efficient proof of storage with deduplication. In: Proceedings of the second ACM conference on data and application security and privacy, CODASPY '12. New York, NY, USA: ACM; 2012, p. 1–12. URL 〈http:// doi.acm.org/10.1145/2133601.2133603〉.