Generic cost optimized and secured sensitive attribute storage model for template based text document on cloud

Generic cost optimized and secured sensitive attribute storage model for template based text document on cloud

Journal Pre-proof Generic cost optimized and secured sensitive attribute storage model for template based text document on cloud Sumathi M., Sangeetha...

712KB Sizes 0 Downloads 7 Views

Journal Pre-proof Generic cost optimized and secured sensitive attribute storage model for template based text document on cloud Sumathi M., Sangeetha S., Anu Thomas

PII: DOI: Reference:

S0140-3664(19)31331-3 https://doi.org/10.1016/j.comcom.2019.11.029 COMCOM 6031

To appear in:

Computer Communications

Received date : 3 October 2019 Revised date : 11 November 2019 Accepted date : 19 November 2019 Please cite this article as: Sumathi M., Sangeetha S. and A. Thomas, Generic cost optimized and secured sensitive attribute storage model for template based text document on cloud, Computer Communications (2019), doi: https://doi.org/10.1016/j.comcom.2019.11.029. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

Journal Pre-proof

Manuscript Details Manuscript number COMCOM_2019_1245_R1

Research Paper

pro

Article type

of

Title GENERIC COST OPTIMIZED AND SECURED SENSITIVE ATTRIBUTE STORAGE MODEL FOR TEMPLATE BASED TEXT DOCUMENT ON CLOUD

Abstract:

Keywords

urn al P

re-

Cloud computing stands as the most powerful technology for providing and managing resources as a pay-per-usage system. Nowadays, user documents are stocked in the cloud for easy access, less maintenance cost, and better services, etc. Currently, user data stocked up in the cloud are mostly in the form of template-based unstructured text documents. These documents are stocked up in a group by various organizations. Generally, the template-based text document contains large size common information, common terms, conditions, and sensitive information. For protecting the values in the documents, present methods apply encryption algorithms. But these techniques take high encryption time and required more storage space. Applying encryption algorithms is an extremely time-consuming task since the terms and conditions and instructions are common for all documents, and they don’t require any security. But the sensitive information in these documents differs as of one user to another user and as well this sensitive information requires protection. Therefore, there is a requirement for an efficient way of segregating, storing and encrypting sensitive information with minimum storage cost, and computational cost. To tackle these issues, a generic safe data storage model is proposed, which makes use of information extraction techniques of Natural Language Processing for sensitive attribute value identification, and Enhanced ECC for securing sensitive data centered on group key. When weighted against the existing entire document and partition-based encryption technique, the proposed generic secure data storage model for cloud takes lesser encryption time and storage space.

Jo

Template based Text Document; Data Security; Security for Sensitive Data; Secure Data Storage Model for Cloud; Common Template; Common Terms and Conditions; Elliptic Curve Cryptography (ECC); Enhanced Elliptic Curve Cryptography (EECC).

Taxonomy

: Multimedia Communication, Access Network, Quality of Service, Scalability

Corresponding Author : Sumathi M 1

Journal Pre-proof

Corresponding Author's:

National Institute of Technology, Trichy

Order of Authors

Sumathi M, Sangeetha S, Anu Thomas

:

Submission Files Included in this PDF

of

File Name [File Type] Cover Letter.docx [Cover Letter]

Revised Paper.doc [Manuscript File] Conflict of Interest.docx [Conflict of Interest]

pro

Answerw for Review Correction.doc [Response to Reviewers]

Research Data Related to this Submission

re-

To view all the submission files, including those not included in the PDF, click on the manuscript title on your EVISE Homepage, then click 'Download zip file'.

urn al P

There are no linked research data sets for this submission. The following reason is given: The authors are unable or have chosen not to specify which data has been used

COVER LETTER From,

1*Sumathi M, Research Scholar, Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamil Nadu, India, Email id: [email protected], Phone: 9442663241. 2Sangeetha S , Assistant Professor, Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamil Nadu, India, Email id: [email protected] 3Anu Thomas, Research Scholar, Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamil Nadu, India, Email id: [email protected] Dear Editor-in-Chief,

Regards

Jo

I am submitting paper title “Generic Cost Optimized and Secured Sensitive Attribute Storage Model for Template Based Text Document on Cloud”. All co-authors are having no conflict to submit our manuscript in the journal.

Sumathi M 2

Journal Pre-proof

GENERIC COST OPTIMIZED AND SECURED SENSITIVE ATTRIBUTE STORAGE MODEL FOR TEMPLATE BASED TEXT DOCUMENT ON CLOUD 1*Sumathi M, 2Sangeetha S, 3Anu Thomas

of

1*Sumathi M, Research Scholar, Text Analytics Lab, Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamil Nadu, India, Email id: [email protected]; [email protected], Phone: 9442663241. 2Sangeetha S , Assistant Professor, Department of Computer Applications, National Institute of Technology,Tiruchirappalli, Tamil Nadu, India, Email id: [email protected]

pro

3Anu Thomas, Research Scholar, Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamil Nadu, India, Email id: [email protected] Abstract:

urn al P

re-

Cloud computing stands as the most powerful technology for providing and managing resources as a pay-per-usage system. Nowadays, user documents are stocked in the cloud for easy access, less maintenance cost, and better services, etc. Currently, user data stocked up in the cloud are mostly in the form of template-based unstructured text documents. These documents are stocked up in a group by various organizations. Generally, the template-based text document contains large size common information, common terms, conditions, and sensitive information. For protecting the values in the documents, present methods apply encryption algorithms. But these techniques take high encryption time and required more storage space. Applying encryption algorithms is an extremely time-consuming task since the terms and conditions and instructions are common for all documents, and they don’t require any security. But the sensitive information in these documents differs as of one user to another user and as well this sensitive information requires protection. Therefore, there is a requirement for an efficient way of segregating, storing and encrypting sensitive information with minimum storage cost, and computational cost. To tackle these issues, a generic safe data storage model is proposed, which makes use of information extraction techniques of Natural Language Processing for sensitive attribute value identification, and Enhanced ECC for securing sensitive data centered on group key. When weighted against the existing entire document and partition-based encryption technique, the proposed generic secure data storage model for cloud takes lesser encryption time and storage space.

Jo

Keywords: Template based Text Document; Data Security; Security for Sensitive Data; Secure Data Storage Model for Cloud; Common Template; Common Terms and Conditions; Elliptic Curve Cryptography (ECC); Enhanced Elliptic Curve Cryptography (EECC). 1. Introduction:

With rapid development of digital devices, enormous volume of digital data is handled by a user for every fraction of a second. Personalized storage devices are insufficient to store these enormous volumes of data. Hence, the alternate storage system is required to store data in an efficient way. 3

Journal Pre-proof

of

Cloud storage provides Storage-as-a-Service to user for a flexible and scalable storage service on a pay-per usage system like Google Drive, Amazon and Microsoft One-Drive [1]. To utilize these services, from individual users to large organizations is gradually moving from a personalized storage device to Cloud Storage System (CSS). Nowadays, many organizations handle large sized template based text documents like Income Tax reports, Insurance Policy documents, Company Contracts, Legal Agreements, Land Registry records, Loan documents, Order and Invoices. These documents to be stored in bulk quantities in CSS. When these documents are stored in CSS for crores of customers, huge volume of storage space will be occupied which includes information that are common for all customers and information which is unique to each user. Hence, the cost will be increased for the storage and processes of these documents in CSS [2].

re-

pro

The Template Based Text Document (TTD) contains large size Common Information in Template (CIT), Common Terms and Conditions (CTC) and minimal size valuable Sensitive Attribute Values (SAV). The TTD is stored in a group of user by various organizations such as government tax department, hospitals, insurance, bank, etc. These organizations stores Income Tax documents, Medical Reports, Insurance Documents, Land Registry Records, Loan Document, Legal Agreements, Company Contracts, Order and Invoices. Here, each organization maintains a variety of information. Table 1 represents the various TTD information. The CIT and CTC are related to organization terms and conditions, headers, name of the attributes, etc., are common for all users and does not require security. (i.e), applying security techniques to these CIT and CTC is an unnecessary task and leads to high computational and storage complexities. To optimize storage and computational complexity the valuable SAV needs to be segregated from CIT and CTC.

urn al P

Additionally, when user data is moved to CSS, the Data owners (Do) loose the control over their data. Because, user data is maintained by third party cloud service providers. To protect data from outside and inside attackers, user data is encrypted by an organization member by their common protection techniques with common key. But, nowadays Do’s are also interested to protect their data from attackers. Hence, the Do preference based access control and secured storage technique is required to provide better security for user data in a CSS [3]. The present data protection in CSS is done as entire document encryption, partition based encryption and Attribute Based Encryption (ABE), etc. These encryption techniques are suitable for structured numerical and text data. With respect to a TTD the current protection techniques differ from document to document. The TTD storage and complete encryption of TTD take high computational cost and occupies more storage space in CSS. Hence, a secure data model for optimizing the storage space and computational cost without compromising SAV security is required. Figure 1. Shows the CIT, CTC and SAV proportions in TTD of the proposed system.

Jo

Nowadays, user information is accessed by interorganizational members for providing fast and better services. E.g. Insurance and loan agents access income details for a user from bank for providing loans and policies. In the similar way, marketing agents access user information for improving their business. Hence, user information is viewed by authorized interorganizational adversaries. The proposed system also provides security to SAV through group key mechanism and extends restricted access to interorganizational adversaries.

4

Journal Pre-proof

pro

of

Table 1. Information in Various Documents

urn al P

re-

All these fore mentioned TTD, the information is maintained in a prescribed template format and each TTD is having a large size CIT , CTC and minimal size SAV. Figure 1 shows the size proportion of CIT, CTC and SAV in an income tax document. The CIT occupies 56% of storage space and CTC occupies 26% of storage space for a single document. As a whole 82% of information is common for all users, only 18% of SAV differs from user to user. Additionally, these CIT and CTC does not require security. When a TTD is encrypted and stored as in a original format, 82% of CIT and CTC are stored in multiple times with high storage and computational cost. Hence, in a proposed system single copy of CIT and CTC is stored in CSS instead of multiple copies and 18% of data is stored in an encrypted and individual copies of each user.

Jo

Figure 1. Size Proportion of Subparts of TTD for Income Tax

The above figure 1 exhibits the size proportion of template-based text documents for income tax reports. In this report, 56% of information is Common, 26% of information are common terms and conditions, and 18% of information is the sensitive information which includes the user name, permanent account number, income details, address, tax file date, and tax file number. This sensitive information is unique for all users. In the proposed method, the sensitive information is 5

Journal Pre-proof

2. Related Work

pro

2.1 Sensitive Attribute Identification in Text Document

of

extracted and secured from outside attacks by using encryption algorithms[29].This rest of this article is organized as follows. In the second section, the existing SAV identification and protection techniques in a text document are discussed. In the third section, the proposed secure data model for the TTD is discussed with its corresponding algorithms. Experimental results with storage complexity, computational complexity and storage cost analysis is discussed in section four. Security analysis and attack types are discussed in section five. Finally, the paper concludes with the features of proposed work and future enhancement.

re-

In general, SAV requires higher security than other attributes. For providing higher security, SAV are identified in different ways, such as classification, clustering, semi-supervised method and information theoretic approach. In these identification techniques, classification and clustering are suitable for structural numerical data not an unstructured text document. But, pattern matching and information extraction approach of NLP is suitable for text document. This section describes the merits and demerits of existing techniques for identifying SAV.

urn al P

Pattern matching is a most preferable technique for identifying SAV in TTD. Risto Vaarandi et al. used pattern matching for checking the sequence of terms, if a predefined term matches in the input terms, the terms are identified as sensitive. Risto Vaarandi also discussed a log cluster based pattern matching technique for predicting the terms in a document. The words are identified by the regular expression functions like w-filter, w-search and w-replace. The accuracy of the term prediction depends on functions used in the regular expressions [4]. Ramanpreet Singh et al. have presented a finite automata based pattern matching technique, for term identification. The expected list of terms have been identified through N-gram process. The prediction accuracy depends on the list of predefined words [5]. Sreeja N.K et al. have focused on pattern matching based classification algorithm for feature selection. Based on the attribute features, the instances and the unlabeled samples have been classified. The ant colony technique has been used to optimize the accuracy of term selection [6]. Yung Shen Lin. has discussed a similarity measure based text classification and clustering technique for identifying similar terms in a given document. Based on the similarity degree the relevant terms are segregated from other terms in a text document. The accuracy of prediction depends on similarity values [7].

Jo

Graham McDonald et al [24] presented an automatic approach for identifying sensitive text in documents by gauging the quantity of sensitivity in a series of text. This approach augmented the recall of sensitive text while achieving an extremely high level of precision. When contrasted with a baseline, the method was effective at identifying sensitive text in other domains. Manual identification of sensitive terms provides accurate results than classification and pattern matching algorithms. But, the processing speed is very low and time consuming process. The information theoretic approach is also used for identifying SAV in a text document. Graham McDonald et al. proposed the semantic feature based sensitivity classification technique. In this approach, the SAV is classified by their term frequencies. To improve an accuracy of n-gram 6

Journal Pre-proof

pro

of

features, grammatical features and semantic features are combined with each other and terms are classified from a text document [18]. David Sanchez et al. discussed the information theoretic based SAV identification process. In each document, specific terms are providing more information content than the other terms. By using inverse probability the author measure the information value of the term ‘t’. Here, the high probability valued terms are identified as sensitive [19]. Josep Domingo Ferrer et al. survey the privacy preserving techniques in cloud computing based on sensitive data. According to their survey, before storing the data into the cloud, the sensitive data is partitioned into a number of subparts and these parts are stored in distributed cloud storage location. Here, the author clearly shows that, semantic relation based partition works well for unstructured data like documents and emails. According to semantic dependencies, the textual entities are identified in a document and make a group for encryption or sanitization [24]. Ziqi Yang et al. proposed a automated identification of sensitive data in a text document. The author analyses the structure of text through semantic, syntactic and lexical information. Do define the sensitivity requirement of the text and syntax analysis, identify the related terms from the text by using fixed syntactic patterns. If the pattern and text are matched, the terms are extracted as sensitive. Here, the meaning of each sensitive terms is analyzed deeply and are identified as sensitive. This process leads to the time consuming process [25].

re-

Based on these analysis patterns matching techniques, information theoretic approaches are suitable to plain text document not for a template based text document. Hence, information extraction based segregation technique is required to provide fast and accurate results. 2.2 Data Protection techniques in Cloud Storage

urn al P

The conventional symmetric and asymmetric encryption techniques are providing higher security to SAV in a personalized local storage device. When a data is moving to the cloud, third party Cloud Service Providers (CSP) are maintaining the user data and data are stored in different distributed locations. Hence, a large number of security issues are occurring in CSS and conventional security techniques are not sufficient for providing better security to SAV. To provide higher security to SAV, alternate encryption techniques are proposed by various authors. This section focused on existing and current security techniques of CSS.

Jo

Nowadays partition based encryption technique is used in CSS. The attributes are partitioned as in vertical or horizontal representations. Ji-Jiand Yang et al. applied a vertical partition technique in attribute list for partitioned the attribute into a number of parts. Selected partitions have been stored in the cipher-text form and other partitions have been stored in the plain text form. Data merging was done by authorized users with record level access. This technique works well for single access, not for a multiple accesses [8]. Liuding Xing et al. have focused a sensitive data partition approach in a document for avoiding a theft and corruptions. The document is encrypted as a whole and have been divided into a number of parts and distributed into different virtual machines in the CSS. Attackers are unable to merge all the parts of attribute values in a specific time period. Hence, it takes high computational time for encryption, partitioning and merging [9]. Yibin Li et al. have discussed a security aware distributed storage system. The data packets have been partitioned into two classes such as sensitive and normal data. The sensitive data have been split into a number of 7

Journal Pre-proof

parts and been encrypted separately. The encrypted parts have been uploaded into a distributed cloud storage locations. Hence, inside and outside attackers were unable to access it [10].

pro

of

Santheesh K S V Kavuri et al. have discussed about the data authentication techniques in CSS. The data partition depends on the selected data center information. User data is partitioned into a number of sub-partition and each partition are encrypted by ABE technique and stored in different data centers. This partition technique ensures integrity and authentication of user data [11]. Xuyun et al. proposed the top-down specialization approach for partitioning the data into two-phases by using map-reduce. Data have been partitioned into a number of parts and each part has been anonymized separately. The anonymized partitions have been merged and anonymized in second level. The two level anonymization providing better privacy to personal data. If anyone level of the anonymization technique is identified by the attackers, the other level of anonymization techniques is also being identified [12]. Hence, attribute or column based partition techniques provide better data protection and security in CSS. Partition based encryption techniques provide better security to SAV as well as increases storage and computational complexities.

urn al P

re-

In general, various types of ABE techniques such as Cipher-text Policy based ABE (CP-ABE), Key Policy ABE (KP-ABE) and Hierarchical ABE (HABE), etc., are used in CSS for protecting data from inside and outside attackers. In ABE scheme, the encryption and decryption are done by an attribute [16]. L.Cheung et al. discussed the Key-Policy ABE (KP-ABE), a user’s private key is used to describe access policy and ciphertext is described by the roles of attributes. But, in a Ciphertext Policy ABE the access policy is associated with a ciphertext and user’s private key is associated with an attributes [17]. In an ABE technique, the attribute access depends on a number of attributes are taken for encryption and role of user for accessing the attributes. The Do preference based attribute protection is also proposed in the current protection technique. These are described in this part. Zheng Yan et al., proposed the Do preference based ABE technique. Here, the access control is set by the Do through the individual trust level based on personal interaction and experience. The Do encrypt the data by symmetric secret key ‘K’ and ‘K’ consists of two parts. Then Do send the part of the key to CSP and another part to the requester. This technique provides more security for user data [20]. Sabrina De Capitani Di Vimercati et al. used a fragmentation based data protection technique. The Do partition the data into a number of disjoint sets and encrypted it by ABE technique. The Do provides access control for authorized user for merging the fragments [21].

Jo

From this study, the merits and demerits of the existing techniques are analyzed deeply. The major limitations are listed as follows:  No methodology to handle / identify SAV in TTD.  High storage and computational cost due to Entire Document Encryption and Partition based Encryption.  No methodology for handling Template based Text Document security issues.  Equal level of security to all data in a document, which is unnecessary.

8

Journal Pre-proof

TTD CIT CTC SAV NSAV Do

Jo

ABE EECC

Table 2: Terms which is used in the Proposed System Abbreviations Terms Abbreviations Cloud Storage System EGk Encryption by using Group Key Template based Text Gk1 to Gkn Group Key from 1 to n Document Common Information in CSP Cloud Service Provider Template Common Terms and KE Keyword Extraction Conditions Sensitive Attribute Values PM Pattern Matching Non-Sensitive Attribute KWList List of Sensitive Attribute Value Keyword List Data owner SAPAT Pattern for Sensitive Attribute Value Attribute Based Encryption SZ Size o Enhanced Elliptic Curve EnDoc Entire Document Cryptography Elliptic Curve Cryptography ŋ Merging Private Key ET Encryption Time Public Key πT Partition Time Organization Key DT Decryption Time

urn al P

Terms CSS

re-

pro

of

Motivation of the proposed work: The motivation of our proposed work is to develop a generic secure data storage model in a cloud storage. Cloud storage is used to store user data, with a minimal storage and maintenance cost. Due to third party service provider controls, a lot of security issues occur in CSS. The existing security techniques are applicable to entire data encryption or partition based encryption. In entire document encryption the document is encrypted as a whole and in a partition based encryption the document is divided into a number of chunks and each chunk is encrypted separately. Hence, both of these techniques take high storage cost and computational time. No work has been proposed for the TTD secured storage system. The TTD has a different template structure and it needs to be segregated / partitioned according to data content and semantics such SAV, CIT and CTC. Additionally, applying entire document encryption is an unnecessary task for a TTD.Thus, the CIT and CTC is common for all users and do not require security. In each TTD, SAV requires security and interorganizational adversaries requires access to a subset of SAV. Therefore, we proposed a generic secure data model for TTD with the following goals:  To identify SAV in a TTD using information extraction techniques of NLP.  To segregate SAV and CIT, CTC and storing them efficiently to reduce storage cost.  To reduce the computational cost by encrypting essential data only instead of the complete document.  To provide better security to SAV based on group key. Table 2 shows the terms that are used in the proposed system.

ECC Pr Pu Ok 9

Journal Pre-proof

N n Dok

Number of Document Number of Groups Data Owner Key

$ SZ

Storage Cost Size of ( )

urn al P

re-

pro

of

3. Proposed Work In order to achieve the cost reduction and security of the data stored in CSS, we propose an SAV based encryption in TTD storage on cloud. Thus, the aim of the proposed system is, to protect SAV with storage and computational cost optimization in TTD in a hybrid cloud. Figure 2. represents the flow diagram of the proposed work. This diagram comprises the SAV extraction, grouping of SAV, encryption of SAV group with group key and upload into cloud storage. Here, the valuable SAV are encrypted by an Enhanced Elliptic Curve Cryptography (EECC) technique [23]. The single copy of CIT and CTC is stored in a separate cloud storage location (public cloud) and encrypted SAV is stored in another cloud storage location (private cloud). Thus, the proposed Generic storage model reduces storage cost and computational complexity.

Figure 2. Flow diagram of the Proposed Work

Jo

Initially, the TTD is sent to database admin by an organization. Similarly, Do send a list of SAV to information extraction system. Now, the information extraction is applied to extract SAV from TTD using keyword extraction and pattern matching technique. After that, the identified SAV are grouped into ‘n+1’ groups based on the organizational requirements. Now, the grouped SAV are encrypted with group key by using an EECC algorithm. (i.e) Each SAV group is encrypted by separate group key such as EGk(SAV)= {Gk1(SAV1), Gk2(SAV2)…..Gkn(SAVn)}. The group key is the combination of ECC private key and Do defined secret key. Now, the encrypted SAV is uploaded into private cloud and non-encrypted CIT and CTC upload into the public cloud. In a decryption, the adversary send a 10

Journal Pre-proof

request to CSP, the CSP checks the authentication of the requestor and sends the encrypted SAV along with CIT, CTC to the adversary. Now, the adversary merges the CIT, CTC and SAV to get the original document. The following subsection describes SAV identification using information extraction followed by the SAV grouping according to organization requirement, SAV encryption based on a key generation algorithm along with decryption and finally SAV and template merging.

of

3.1 Sensitive Attribute Value Identification using Information Extraction

urn al P

re-

pro

TTD is taken as an input is in the form of PDF and PDF parsing is performed for the identification of SAV. The SAV are extracted from TTD by using a Keyword Extraction (KE) and Pattern Matching (PM). In general, any one of the PM algorithm is used for term identification in a text document. But, TTD consists of a varity of terms. Hence, multiple information extraction techniques are applied into a TTD. In a TTD, documents contain predefined list CIT and CTC. In a large sized TTD, minimal sized terms are differ from other TTD. The differing minimal sized terms are segregated from CIT and CTC [18]. Figure 3 and algorithm 1 shows the SAV identification in the Income Tax document. Figure 4 specifies the document separation and merging of the proposed work.

Figure 3. Sensitive Attribute Value Extaction in TTD

Jo

1. Information Extraction using Pattern Matching – PM is used for describing a specific text pattern and is easy for searching a specific word or a string in a document. E.g in an Income Tax document, Account Number, Social Security Number, Reference Number, etc., are following a specific pattern and often can be better identified by sequence matching. Sequence matching is used to identify the terms that are relevant to the given sequence. The remaining terms are sent to the KE. SAV ← Seq.

(1)

Based on equation 1, exactly matches each pattern to the terms in a TTD. The similar procedure applies to all patterns like Account Number, Reference Number, Social Security Number,

11

Journal Pre-proof

Acknowlegement Number and Policy Number etc.,. Table 2 shows the list of patterns with its example.

pro

of

Information Extraction, an application of NLP deals with extracting key domain-specific or generic information (includes names of Person, Location andOrganization) from natural language text. Domain-specific information extractionincludes the identification of entities that are relevant to a particular domain. Forinstance, FIR Nos., Rules referred, Case Nos. etc. represent domainspecific entitiesin a court judgment while in a medical record Patient name, Patient Registration No.,etc. are the domain-specific entities. In the proposed methodology, InformationExtraction techniques are used to extract key information from text and this keyinformation are of sensitive nature with respect to each domain. For example,information such as PAN, Aadhar No., Insurance Policy No., etc.

Jo

urn al P

re-

In medical documents, precision and recall in extracting sensitive attributes from medical records, which includes include Patient name, Patient id, etc. In general, in any type of documents keyword based techniques are works well for attributes that are preceded by some domain-specific or attribute-specific keywords. For instance, attributes like PAN, Aadhar No., email-id, Patient No., Insurance Policy No. etc. Regular expression methods works well in extracting tokens that follow some specific patterns like in PAN, Aadhar No., etc. We can use these two techniques based on how the attributes are mentioned in the input text. In some situations, both keyword and regular expression based techniques fail to extract certain attributes that are not pattern-specific and not preceded by domain-specific keywords. For example, there are a few medical records, which contain patient names not preceded by keywords and not preceded with prefixes like Mr., Ms. Etc. These in turn affects the overall precision in extracting patient names from medical records as in table 3.

12

13

Non-Sensitive Template

pro

Figure 4. CIT , CTC and SAV Separation for Income Tax Document

re-

urn al P

Jo Sensitive Attributes

of

Journal Pre-proof

Input – Income Tax Document

Journal Pre-proof

of

2. Information Extraction using Keyword Extraction – In many instances, terms or strings are preceded by some keywords and these keywords are identified by the KE. The keyword search identifies the location of the critical pieces of information in a text. In an Income Tax document, Account holder name and address is preceded by the keyword Name and Address. Here, the name and address that follows the Name and Address: is extracted as account holder Name and Address. The same procedure is applied to Social Security Number, gender etc.,. Equation 2. Represents the identification of terms that are related to the Name and Address:, SSN No:, etc.,. SAV←

(2)

pro

Algorithm 1: Sensitive Attribute Value Identification – Pattern Matching (SAI-PM)

urn al P

re-

Input: TTD : Template based Text Document KWList : List of Sensitive Attribute Keywords SAPAT : Patterns for Sensitive Attribute Values Output: SAV :Sensitive Attibute Values CIT : Common Information in Template CTC : Common Terms and Condition Method: 1: Initialize SAV ← 0 2: for all text-sequence, Seq. ϵ TTD that matches each pattern ϵ SAPAT do 3: SAV ← 4: end for 5: for all text-sequence, Seq. ϵ TTD that matches each keyword ϵ KWList do 6: SAV ← 7: end for 8: return SAV Now, the extracted information is segregated from the TTD. The benefits of data segregation are to reduce storage cost of TTD in cloud through one time storage of CIT and CTC. It reduces the storage space utilization and storage cost when a large number of instances are to be stored. It reduces encryption time also as CIT and CTC do not require security. The identified SAV is passed on to the next level for securing it.

Jo

3.2 Sensitive Attribute Value - Grouping Process Based on Do user preferences SAV can be given access to various other linked organizations of the parent organization for providing loans, insurance, marketing, etc. This leads to a need for careful selection of method that perfectly provide right access to right data from available data. The attribute values grouping is associated with the attribute access and security requirement. The SAV is then partitioned into ‘n+1’ groups, such as ‘n’ represents a number of other linked organizations

14

Journal Pre-proof

pro

of

and one user personal information, that is never shared with others. To facilitate this, attributes are grouped as given below:  Organization required Attribute Values – In real life applications, user information is accessed by different type of authorized users. Thus, each organization requires different kind of attributes. i.e, Income tax information is accessed by marketing, loan organization, insurance organization or marketing. Hence, user attributes are partitioned into ‘n’ groups with different intersected attribute values.  User personal Attributes– Apart from organization requirements, some of valuable attribute values are never shared with others. i.e PIN number, password, etc. Hence, one more group is formed from the attribute values. At the end of grouping, ‘n+1’ groups are generated from a SAV. Now, each group of SAV has to be encrypted. The next section describes the SAV encryption key generation for different user using EECC algorithm. Table 3. Sample Pattern with Example for Income Tax Document

Pattern

Permanent Account Number (PAN)

(?<=PAN:)\s?([A-Z]{5}\d{4}[A-Z]{1})

PAN: AMHPG8135E

(?<=Aadhar Number:\s)\d{4}\s\d{4}\s\d{4}[^.\s] ?<=NAME:\n)((?:Mr\.|Mrs\.|Ms\.)\s*[A-Z][AZ]+\s[A-Z]\.\s*(?:[A-Z]\.(?:\s*[A-Z][A-Z]+)?))

NAME: Mr. MANIKANDAN S.

((?<=NAME:\n)(?:Mr\.|Ms\.)\s*[A-Z][AZ]+\s[A-Z][A-Z]+(?:\s[A-Z][A-Z]+)?)

NAME: Mr. BABU MATHEW

Person Name

urn al P

Aadhar Number

re-

Attribute

Example

3.3 Sensitive Attribute Value - Encryption Key Generation Algorithm

Jo

The proposed system’s, security framework construction is based on Elliptic Curve Cryptographic (ECC) technique and key specified by Do. When compared to other asymmetric key techniques ECC requires lesser key size and provides a higher level of security to SAV. ECC engages a comparatively short encryption key -- a value, which must be inputted into the encryption algorithm to decode an encrypted message. ECC has a higher level of security that could be attained with these shorter keys than other traditional public-key cryptography. This short key is fastest and needs less computing power, memory, as well as battery life than other 1st-generation encryptions public key algorithms. The ECC security strength depends on discrete logarithms algorithm complexity. The basic ECC equation is y2=x3 + ax + b (3) defined by the prime field FP, here ‘p’ is a prime number and greater than 3. The value of a, b ∈ FP and its depends on the condition 4a3 +27b2 ≠ 0 mod p. The elliptic curve depends on the tuples p, q, G, a and b. Here, ‘p’ is a prime field and G (base point) is a generator of ‘q’. 15

Journal Pre-proof

pro

of

Based on EC values the private, public and group key is generated for an SAV group’s encryption process. Algorithm 2 specifies private, public and group key generation process using EECC. Here, Pu and Pr is a public and private key’s based on ECC. Organization key (Ok) is a shared secret key generated by an organization’s members and used for the generation of Group key (Gk). The proposed system takes, both ECC private, public key’s and Organization key’s as a combination. Thus, the proposed encryption algorithm is refered as Enhanced Elliptic Curve Cryptographic (EECC) Algorithm. The ‘n+1’ number of groups are encrypted by ‘n+1’ number of Gk’s. The Gk is computed by an XOR operation of each user private key (Pr) and Organization key (Ok). If ‘N’ number of TTD is encrypted by ‘N*(n+1)’ key’s. (i.e) ‘N’ number of TTD are having ‘n+1’ number of group’s. Hence, ‘N*(n+1)’ number of key’s are required for encryption. ________________________________________________________________________________ Algorithm 2: Key Generation (Pu , Pr , Gk)

urn al P

re-

Input: EC :Elliptic Curve P :Prime value - Random Point Ok : Organization key Output: Gk :Group Key Method: 1. Choose an Elliptic Curve EC in a finite field FP. 2. Find a point ‘P’ randomly on EC as 0
Jo

3.4 Sensitive Attribute Value - Encryption Algorithm In an encryption, ‘N*(n+1)’ number of Gk’s are used for the encryption. Depending on organization requirement SAV is partitioned into ‘n+1’ group’s. Each partition is encrypted by separate Gk which is generated specifically for the particular organization access. Now, the encrypted SAV partition’s are uploaded onto private cloud. Algorithm 3 shows the SAV encryption with Gk. In both encryption and decryption Gk is used as a secret key. 16

Journal Pre-proof

Algorithm 3: Sensitive Attribute Encryption Algorithm (Gk, G(SAV)) Organization

pro

of

Input: Group_key(Gk), Group_of_Sensitive_Attribute_Value(G(SAV)), Requirement(OR) Output: Cipher Text of Sensitive Attribute Value (C(SAV)) Method: for each G(SAV) do if(G(SAV)) = Ori) then Ci(G(SAV)) = EGK[Gi(SAV)] else Ci+1(G(SAV)) = EDok[Gi+1(SAV)] end if end for return Ci(G(SAV)), Ci+1(G(SAV))

re-

After the encryption, the encrypted SAV groups are stored in a private cloud. The non-encrypted CIT and CTC are stored in a plaintext form in a public cloud. The next section, describes the decryption of the group of SAV. 3.5 Sensitive Attribute Value - Decryption Algorithm

Algorithm 4: Sensitive Attribute Value Decryption Algorithm (Gk, C(G(SAV)))

urn al P

Input: Group key (Gk ), Cipher Text C(G(SAV )), Organization Requirement(OR) Output:Group of Sensitive Attribute Value(G(SAV)) Method: for each C(G(SAV)) do if(Ci(G(SAV)) = OR) then Gi(SAV) =DGKi [Ci(G(SAV))] else Gi+1(SAV) = DDok[C(G(SAV))] end if end for return Gi(SAV), Gi+1(SAV)

Jo

Algorithm 4 describes the decryption procedure to obtain a plaintext for a particular group of SAV. When an authorized organization member sends a request message to a CSP for accessing a particular user TTD, the CSP checks the authenticity of the requester and send a requested TTD with encrypted group attributes to them. Now, the requester using their secret shared group key to the decryption. Decryption algorithm takes as inputs the Gk and the ciphertext C. Then, it’s output is the plaintext ‘P’ of specific groups. The same procedure is repeated for all ciphertext groups with its own GK. Now, the decrypted attribute group is transferred to the SAV, CIT and CTC merging.

17

Journal Pre-proof

3.6 Sensitive Attribute Value and Template Merging This section describes the merging of CIT, CTC and decrypted SAV. The decrypted group attribute values are merged with CIT and CTC to get the original TTD by using the PM defined in section 3.1. Here, the exact SAV is decrypted instead of entire SAV. The decrypted SAV is placed in a TTD with the PM task. That is, the reverse process of term extraction.

pro

of

4 Performance and Complexity analysis In this section, we analyze the proposed generic storage with respect to encryption and decryption time analysis, storage space requirement and storage cost with our own synthetic dataset including income tax document, legal contract and insurance documents. In income tax document, user information with their personal details are maintained in a template format along with the CTC. In a medical document, user and doctor details along with details of diseases, terms and conditions of treatment is maintained. The insurance document contains policy, details of the user along with personal information and CTC. Hence, the generated documents have unique personal information and same CTC of different documents. That is, the each document consists of CIT, CTC and SAV. The following section describes the time and space complexity analysis of our proposed data model.

urn al P

re-

4.1 Storage Space complexity analysis With storage space analysis, the entire document (EnDoc) storage space is compared with the propsed system storage space. The storage space requirement will increase, when a number of document increases. That is, storage space is directly proportional to the number of documents. When a large number of documents are stored in cloud as in an original format, large volume of storage space will be occupied by duplicate values (ie) common information for all users. Figure 5 shows the storage space analysis of the EnDoc CIT, CTC and SAV. Here, the CIT and CTC size is increased when a number of document is increased. Figure 6 shows the proposed system storage space requirements. When compared to existing techniques, the proposed storage system requires very less storage space. Space complexity is measured by the amount of storage space occupied in a cloud storage and the size of (SZ) TTD is measured by unit of bytes. Equation 3 is used to measure the storage space requirement for storing a single document in a cloud without encryption. S Z (TTD )  S Z (CIT )  S Z (CTC )  S Z ( S AV )

(3)

Equation 4 represents the storage space requirement of ‘n’ TTD in cloud in an EnDoc storage representation is S Z (TTD , N )   i 1 S Z (TTD ) N

(4)

Jo

Now, the storage space requirement of our proposed data model is represented in equation 5.

S Z (TTD )  S Z (CIT )  S Z (CTC )   i 1 S Z ( S AV ) N

(5)

For example, the size of entire TTD is 430KB, size of CIT is 147KB, size of CTC is 150KB and size of SAV is 133KB. If 10000 documents are stored as in a original format the storage space 18

Journal Pre-proof

requirement is 10000x430KB = 430000KB. But, in a proposed data model, the storage space requirement is (147+150+(10000x133)) =1330297KB. Hence, the proposed data model requires only 31% storage space than EnDoc storage for which reduces storage cost drastically. Table 4 clearly shows the storage space requirement of single document and 10000 document’s for traditional storage and proposed storage system. Table 4. Storage Size Comparison of CIT, CTC and SAV

147KB

430KB

150KB

133KB 430KB

430KB

Storage Space Requirement in percentage

Proposed Storage Model – Single Document

of

Traditional Storage Size – 10000 Document

147KB

pro

Comon information in template Common terms and conditions Sensitive attribute values Total Storage size

Proposed Storage Model – Single Document

4300000KB

re-

Document Part

Traditional Storage Size – Single Document

150KB

1330000KB

4300000KB

13,30,297KB

100%

30.93%

Jo

urn al P

The symbolic representation of table 5 is exhibited in fig 5. The figure shows the comparisons of the storage space occupied by ED, CIT and CTC, SAV. It clearly shows the strength of the proposed technique. The storage space value is plotted against the number of documents. For 1000 documents, the storage space required by the entire document is more whereas the CIT, CTC, and also SAV require very low space when contrasted with the ED. Also for the remaining number of documents, the storage space required for ED is high, and for CIT and CTC, as well as SAV is low. It perfectly shows the proposed method’s superior performance.

Figure 5. Storage Size Requirement of EnDoc, CIT and CTC, SAV 19

pro

of

Journal Pre-proof

Figure 6. Storage Size Requirement of EnDoc and SAV

urn al P

re-

The above Figure 6 exhibits the Storage Size Requirement of ED and SAV. From the figure, it is clearly stated that if the document is stored using the proposed technique, the storage space of the document on the cloud is reduced.In an existing system, the document is stored as a whole or divided into a number of chunks and stored in different locations. There was no mechanism to handle storage or common information in the document. In both the cases, the data storage space requirement is very high. When compared to EnDoc storage, the proposed TTD storage technique requires very less storage space, such that instead of storing repeated storage of CIT and CTC, the proposed system stores a single copy of CIT and CTC. 4.2 Time Complexity Analysis

The time complexity is measured by the amount of time required to partition (π) of CIT, CTC and SAV, merging(ŋ) of CIT, CTC and SAV, encryption(E) of SAV and decryption(D) of a specific attribute group. Encryption time(ET) is the amount of time taken for converting plaintext to ciphertext. The ET is measured by the terms of milliseconds. Figure 7. Shows the ET taken for EnDoc encryption and SAV encryption. In general, encryption techniques are applied to the EnDoc without partition. But, in a proposed system the TTD is partition as CIT, CTC and SAV, encryption is applied to SAV only. Hence, time taken for the proposed system is the addition of a partition and ET. Equation 6 and 7. specifies the Partition Time (πT) of SAV, CIT and CTC for ‘n’ number of TTD.

 T (TTD )   ( S AV , CIT , CTC )

(6)

 T (TTD , N )   i 1  (TD )

(7)

Jo

N

Equation 8 and 9. describes the time taken for encryption of an entire TTD without partition and time taken for encryption of TTD with partition in different key size.

20

ET (TTD , N )   i 1 E ( S AV , CIT , CTC ) N

(8)

Journal Pre-proof

ET ( S AV , N )   i 1 E ( S AV ) N

(9)

pro

of

The following graph clearly shows that the EnDoc encryption takes higher ET than SAV encryption for both ECC and EECC. Both ECC and EECC takes an equal amount of time. The major benefit of the proposed system is, the key identification time for EECC is higher than ECC. Hence, security breakage time also high and the proposed EECC provides better security than ECC. For example, the ET for single entire TTD is 5ms, then ET for 10000 records is 5x10000=50000ms. Instead, the πT for TTD is 0.8ms and ET for a single document SAV is 0.3ms. The total time taken for single document is 1.1ms. When partition and encryption is applied to 10000 records are 1.1x10000=11000ms. Hence the proposed system takes only 22% of ET than the EnDoc encryption. Table 6 shows the time complexity analysis of the propoesed system. Equation 9 and 10. represents the time taken for merging of single and ‘n’ number of TTD. Figure 8 exhibits the graphical view of the encryption time of the ED and SAV. From the figure, it is deduced that the time taken for encrypting the SAV is low contrasted to ED.  ( S AV , CIT , CTC )  TD (9)

T (TTD , N )   i 1 ( S AV , CIT , CTC )

re-

N

(10)

Table 5. Time Complexity Analysis

5

ET for 10000 Documents (ms) 50000

Partition and ET for Single Document (ms) 0.8+0.3=1.1

urn al P

Single Document Encryption Time(ms)

Time taken for proposed system in percetange

Partition and ET for 10000 Document (ms) 1.1 x 10000=11000 22.00%

Jo

The above table 5 compares the encryption time consumed for the entire document and the proposed method. The encryption time of a single entire document is 50000ms whereas the time taken by the proposed method is only 11000s, which in turn demonstrates the strength of the proposed method.

Figure 7. Encryption Time Analysis. 21

Journal Pre-proof

Decryption time (DT) is measured by time taken for decrypting an encrypted attributes. In EnDoc decryption method, user must decrypt an entire attributes in addition to required attributes. But in a proposed data model, instead of entire attribute values only required group of attribute values are decrypted it. Figure 8 shows the DT analysis. Similarly the decryption of entire TTD and SAV is shown in equation 11 and 12 DT (TTD , N )   i 1 D( S AV , CIT , CTC ) N

N

of

DT ( S AV , N )   i 1 D ( S AV )

(11) (12)

pro

Decryption time is measured by the time taken for decrypting encrypted attributes. In the entire document decryption method, the user must decrypt entire attributes as well as required attributes. But in a proposed data model, instead of entire attribute values, only the required group of attribute values are decrypted. Figure 8 shows the decryption time analysis.Finally, the total time required for encryption and decryption of SAV is described in equation 13 and 14. ET (TTD , N )   i 1  T (TTD , N )  ET ( S AV , N )

(13)

DT (TTD , N )   i 1T (TTD , N )  DT ( S AV , N )

(14)

N

urn al P

re-

N

Figure 8. Decryption Time Analysis

Jo

In the EnDoc decryption, time taken for both ECC and EECC is equal because, key size are same. But in the proposed system SAV decryption specific SAV is decrypted instead of the EnDoc. Hence, the DT is very less when compared to EnDoc DT. Table 6 specifics the DT requirement for the proposed system.

22

Journal Pre-proof

Table 6. Time complexity analysis of the proposed system Particulars

Time Requirment n+1

Number of partition / group ‘N’ number of Do (NNumber of Do)

[N * (n+1)]

If an adversary wants to decrypt all groups ‘N’ user

[N *(n+1)]! Times

The EnDoc DT

of

Number of partition / group Single Do (n – Number of groups)

N!

[N * (n+1)!] times

pro

The adversary wants to decrypt any one group in ‘N’ user

Which is comparatively higher and hence difficult to identify the secured data.

re-

Computational Cost (ET and DT): The proposed system requires less ET because, instead of EnDoc only minimal sized SAV is encrypted by an EECC. Even though, the size of the EnDoc is 100% and the size of SAV is only 22%. Hence, our model requires minimal ET.

Jo

urn al P

4.3 Storage Cost Analysis The main characteristics of cloud storage are pay-per usage. To provide security to SAV with minimal storage cost the proposed system provides better results. Figure 9 shows that the storage cost ($) comparison of the EnDoc storage and proposed TTD storage. It clearly shows that the proposed TTD storage requires very less storage cost than the EnDoc storage cost.

Figure 9. Storage Cost Analysis

Storage Cost (Storage size and Storage Cost in $): In the proposed data storage model TTD common term is stored as a single copy instead of ‘n’ number of times. Only the unique SAV is stored as for ‘n’ users. That is, in the aforementioned point 82% of data is for a common TTD and only 18% of data is unique. Thus, instead of storing ‘n x 82%’ data, ‘1 x 82%’ data are stored in 23

Journal Pre-proof

CSS. Hence, storage space occupied by our proposed system is very less. In CSS, storage cost depends on storage space occupied by a user data. The proposed model occupies very less space. So that, storage cost is also very less. 5 Security Analysis This section focuses on the security analysis of the proposed system on the basis of key and plaintext identification along with different types of attack analysis in a proposed system.

of

5.1 Security Goal

pro

Our main security goals are to develop a generic secure data storage model in the cloud storage. The system achieves the following security objectives:  Confidentiality: SAV is grouped into ‘n+1’ number of parts and each groups is encrypted by separate GK. That is, instead of a single key, ‘n+1’ number of key’s are used for encryption. Thus, the system provides high confidentiality to the SAV stored in CSS.

re-

 Access Control: The primary goal of our system is to provide efficient fine grained access control to the authorized users. Thus, only a specific group of SAV is accessed by specific adversary, not all attributes. That is, those adversaries having access rights to specific group is able to access that group only, not any another group.

urn al P

 Key Management: In the proposed system, ‘n+1’ of keys are generated for SAV protection. The user preference based Gk and Ok for encryption and decryption. The GK is publicly available key. Hence, the public key does not require management. But, the Ok is a secret key and it is sent to authorized requesters through secure channel. Thus, the Ok is required to maintain in a secret manner by Do. Hence, key management is not an issue of the proposed system. 5.2 Types of Attacks

This subsection focuses on different types of attack analysis in the proposed system.  Man-in-the-Middle-attack –Man in the middle attack is not possible in the proposed TTD storage. Before moving to cloud storage SAV are categorized from the TTD and encrypted by EECC algorithm. When compared to other asymmetric algorithms, ECC provides higher security. In addition to this, the attribute values are divided into a number of groups and each group is encrypted by separate GK. Hence, man-in-the-middle-attack is not possible in the proposed system.

Jo

 Insider attack – The major drawback of cloud storage is inside attacker. Because, cloud data are maintained by third party cloud service providers. Hence, user data are easily accessed by them. To overcome this problem, the SAV are encrypted by EECC and uploaded into the cloud. Generally, the ECC security strength depends on random values. Additionally, each group of attributes are encrypted with different keys. To get an EnDoc decryption of all the groups and merging of decryption is required. These decryption and merging is a complex task for an inside attacker. Thus, insider attackers are not able acces the user SAV. 24

Journal Pre-proof

 Brute force attack – In general, brute force attack is trying to identify the key used for the encryption. In a proposed system, instead of simple random value based key, user defined key is also used for the encryption. Additionally, each group of SAV is encrypted by separate key values. Thus, identification of each group key is a complicated task. 6 Conclusion and Future Work

re-

pro

of

Cloud storage is used to store huge quantities of data with minimal storage and maintenance costs. Security and privacy is a main issue of CSS along with storage cost. The proposed work focused on a template-based text document storage system. The proposed technique segregates sensitive attribute values from common information in the template along with common terms and conditions to optimize storage and computational cost, and the EECC technique is applied to this sensitive information. The common information is stored once without encryption, which in turn reduces the storage cost and encryption time. For improving security, the sensitive attributes are grouped into ‘n+1’ groups and each group is encrypted by a separate group. Hence, the proposed system provides efficient storage and template-based text document. The experimental outcomes plainly exhibit that the encryption time along with the decryption time requirement is extremely less when weighted against the existing techniques. Additionally, the proposed system's storage space requirement is very less when weighted against the existing technique, which leads to less computational expense. In the future, the access control system can be implemented in the present work. References:

Jo

urn al P

1. Tian Wang, Jiyuan Zhou, Xinlei Chen, Guojun Wang, Anfeng Liu and Yang Liu, “A three layer privacy preserving cloud storage scheme based on computational intelligence in Fog Computing”, IEEE Transactions on Emerging Topics in Computational Intelligence, Vol.2, No.1, 2018, PP 3-11. 2. S.Brintha Rajakumari and C.Nalini, “An efficient cost model for data storage with horizontal layout in the cloud”, Indian Journal of Science and Technology, Indian Society for Education and Environment, Vol.7, 2014, PP 45-46. 3. Lifei Wei a, Haojin Zhu a, Zhenfu Cao A, Xiaolei Dong a, Weiwei Jia a, Yunlu Chen a, Athanasios V, Vasilakos B, “Security and Privacy for storage and computation in cloud computing”, Journal of Information Sciences, Elsevier, Vol.258, 2014, PP 371-386. 4. Risto Vaarandi and Mauno Pihelgas,"LogCluster A data clustering and pattern matching algorithm for event logs",11th International Conference on Network and Service Management (CNSM), 2015, Barcelona, Spain – November, PP 1-7. 5. Ramanpreet Singh and Ali A.Ghorbani,"Efficient PMM: Finite automata based effi-cient pattern matching machine", International conference on computational science, ICCS 2017, 2017, Zurich, Switzerland, PP 1060- 1070. 6. N.K.Sreeja and A.Sankar,"Pattern matching based classification using ant colony optimization based feature selection", Journal of Applied soft Computing, Elsevier Science Publishers, Volume 31 Issue C, 2015, PP 91-102. 7. Yung-Shen Lin, Jung-Yi Jiang, and Shie-Jue Lee, "A similarity measure for text classification and Clustering", IEEE Transactions on knowledge and Data Engi-neering, Vol.26, 2014, No.7, PP 1575-1590. 8. Ji-Jiang Yang, Jian-Qiang Li and Yu Niu,"A hybrid solution for privacy preserving medical data sharing in the cloud environment," Journal of Future Generation Computer Systems, Elsevier, Vol 43-44, 2015, PP 74-86. 9. Liudong Xing and Gregory Levitin,"Balancing theft and corruption threats by data partition in cloud system with independent server protection", Journal of Reliability engineering and system safety, Elsevier. Vol 167, 2017, PP 248-254. 10. Yibin Li, Keke Gai, Longfei Qiu, Meikang Qiu and Hui Zhao,"Intelligent cryptog-raphy approach for secure distributed big data storage in cloud computing", Journal of Information Sciences, Elsevier, Vol. 387, 2017, PP 103-115. 11. Satheesh K S V Kavuri, Dr.Gangadhara Rao Kancherla and Dr Basaveswara Rao Bobba,"Data Authentication and Integrity verification techniques for trusted / un-trusted cloud servers", International Conference on Advances in Computing, Communication and Informatics, Noida, India, 2014, PP 2590- 2596. 12. Xuyun Zhang, Laurence T.Yang, Chang Liu and Jinjun Chen,"A Scalable two-phase top-down specialization approach for data anonymization using Mapreduce on Cloud", IEEE Transactions on Parallel and distributed System, Vol.25, No.2, 2014, PP 363-373.

25

Journal Pre-proof

Jo

urn al P

re-

pro

of

13. Anu Thomas and S.Sangeetha,"TIEX-A Tools for Extracting Structured and Seman-tic Information from Text Document", Proceeding of the Fourth International Con-ference of Business Analytics and Intelligence, 2016, PP 1026-1032. 14. M.Sumathi and S.Sangeetha,"Sensitive Data Protection using Modified Elliptic curve cryptography", Proceeding of the ninth international conference on Advanced Computing-MIT Chennai, 2017, PP 485-491. 15. V.Kamalakannan and S.Tamilselvan,"Security Enhancement of Text Message Based on Matrix Approach using Elliptical Curve Cryptosystem", Proceeding of the second international conference on Nanomaterials and Technologies(CNT 2014), 2015, PP 489-496. 16. J. Bethencourt, A. Sahai, and B. Waters, ``Ciphertext-policy attributebased encryption,'' in Proc. IEEE Symp. Secur. Privacy (SP), May 2007, PP. 321_334. 17. L. Cheung and C. Newport, ``Provably secure ciphertext policy ABE,'' in Proc. 14th ACM Conf. Comput. Commun. Security, 2007, pp. 456_465. 18. McDonald, G., Macdonald, C., and Ounis, L. “Enhancing Sensitivity Classification with Semantic features using word embeddings”, 39th European conference on information retrieval, Aderdeen, Scotland, 8-13, April 2017, PP 450-463. 19. David Sanchez, Montserrat Batet, Alexandre Viejo, “Detecting sensitive information from textual documents: an information theoretic approach”, In: Torra V., Narukawa Y., Lopez B., Villaret M. Modeling decisions for artificial intelligence. MDAI 2012, PP 173-184. 20. Zheng Yan, Xuoyun Li, Mingjun Wang and Athanasios V. Vasilakos, “Flexible data access control based on trust and reputation in cloud computing”, IEEE Transactions on Cloud Computing, 2015, PP 1- 14. 21. Sabrina De capitani di vimercati, Sara Foresti, Pierangela Samarati, “Managing and accessing data in the cloud: privacy risks and approaches”, 7th international conference on risks of internet and systems, 2012. 22. Hongwei Li, Yi Yang, Tom H.Luan, Xiaohui Liang, Liang Zhou and Xuemin, “Enabling fine-grained multi-keyword search supporting classified sub-dictionaries over encrypted cloud data”, IEEE Transaction on dependable and secure computing, 2015, PP 1545- 1559. 23. Sumathi M, Sangeetha S, “Enhanced Elliptic Curve cryptographic technique for protecting sensitive attributes in a cloud storage”, IEEE international conference, MIT, Chanaai, Dec-2017. 24. Josep Domingo- Ferror, Oriol Farras, Jordi Ribes-Gonzalez, David Sanchez, “Privacy preserving cloud computing on sensitive data: a survey of methods, products and challenges”, Computer Communication, Elsevier, PP 38-60, 2019. 25. Yang and Liang CyberSecurity, “Automated identification of sensitive data from implicit user specification”, Cybersecurity, Springer, 2018. PP 1-13.

26

Journal Pre-proof

CONFLICT OF INTEREST

From,

of

1*Sumathi M, Research Scholar, Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamil Nadu, India, Email id: [email protected], Phone: 9442663241. 2Sangeetha S , Assistant Professor, Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamil Nadu, India, Email id: [email protected]

pro

3Anu Thomas, Research Scholar, Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamil Nadu, India, Email id: [email protected]

Dear Editor-in-Chief,

Jo

urn al P

re-

I am submitting paper title “Generic Cost Optimized and Secured Sensitive Attribute Storage Model for Template Based Text Document on Cloud”. All co-authors are having no conflict to submit our manuscript in the journal.

27

Regards Sumathi M