Available online at www.sciencedirect.com
This This This This
ScienceDirect
space is reserved for the Procedia header, space reserved for the header, ProcediaisComputer Science 108CProcedia (2017) 445–454 space is reserved for the Procedia header, space is reserved for the Procedia header,
do do do do
not not not not
use use use use
it it it it
International Conference on Computational Science, ICCS 2017, 12-14 June 2017, Zurich, Switzerland
Effective and Scalable Data Access Control in Onedata Effective and Scalable Data Control in Effective andScale Scalable Data Access Access Control in Onedata Onedata Large Distributed Virtual File System Effective andScale Scalable Data Access Control in Onedata Large Distributed Virtual File System Large1,2Scale Distributed Virtual 2File System 1 1,2 Michal Wrzeszcz , L ukasz Opio l a , Konrad Zemek2File , Bartosz Kryza , L ukasz Large Scale Distributed Virtual System 1 2 1,2 Kryza1 , L Michal Wrzeszcz1,2 , L ukasz Opiola1,2 , Konrad Zemek , Bartosz ukasz 2 1 , Renata ota , and Jacek Kitowski Dutka Michal Wrzeszcz1,2 ,L ukasz OpiolS al1,2 , Konrad Zemek , Bartosz ukasz 1 2 1,2 Kryza , L 1,2 2 1 , Renata S l1,2 ota , and Jacek Kitowski Dutka 1 2 1,2 Micha l Wrzeszcz , L ukasz Opio l a , Konrad Zemek , Bartosz Kryza ,L ukasz 1 , Renata S l ota , and Jacek Kitowski Dutka Academic Computer Centre CYFRONET-AGH, University of Science and Technology, Krakow, 1 2 1,2 1 Renata Slota , and JacekofKitowski Dutka Academic Computer Centre, CYFRONET-AGH, University Science and Technology, Krakow, Poland 1 1
Academic Computer Centre CYFRONET-AGH, University of Science and Technology, Krakow, 2 Poland AGH University of Science and Technology, Faculty of Computer Science, Electronics and Poland Academic Computer Centre CYFRONET-AGH, University of ScienceScience, and Technology, 2 AGH University of Science and Technology, of Computer ElectronicsKrakow, and Telecommunications, Department of Faculty Computer Science, Krakow, Poland 2 AGH University of Science and Technology, Faculty of Computer Science, Electronics and Poland Telecommunications, Department of Computer Science, Krakow, Poland
[email protected],
[email protected] 2 Telecommunications, Department of Faculty Computer Science, Krakow, AGH University of Science and Technology, of Computer Science,Poland Electronics and
[email protected],
[email protected] [email protected],
[email protected] Telecommunications, Department of Computer Science, Krakow, Poland
[email protected],
[email protected]
Abstract Abstract Nowadays, Abstract as large amounts of data are generated, either from experiments, satellite imagery Nowadays, as largeaccess amounts of data generated, either for from experiments, satellite or via simulations, to this data are becomes challenging users who need to furtherimagery process Abstract Nowadays, as largeaccess amounts of data are generated, either for from experiments, satellite imagery or via simulations, to this data becomes challenging users who need to further process them, since existing data management makes it difficult to effectively access and share large data Nowadays, as largeaccess amounts of data generated, either for from experiments, satellite imagery or via simulations, to this data are becomes challenging users who need to further process them, since existing data management makes it difficult to easy effectively access and share large data sets. In this paper we present an approach to enabling and secure collaborations based or via simulations, access to this data becomes challenging for users who need toshare further process them, since data management makes it difficult to easy effectively access and large data sets. In thisexisting paper we present an approach to enabling and secure collaborations based on the state of the art authentication and authorization mechanisms, advanced group/role them, since existing data management makes it difficult to easy effectively accesscollaborations and share largebased data sets. In this paper we present an approach to enabling and secure on the stateforofflexible the artauthorization authentication and authorization mechanisms, advanced group/role mechanism management and support forsecure identity mapping between sets. Instate this paper we present an approach to enabling easy and collaborations based on the of the art authentication and authorization mechanisms, advanced group/role mechanism for flexible authorization management and support file for identity mapping between local systems, applied in an eventually consistent distributed system called Onedata. on the stateforofasflexible the art authentication and authorization mechanisms, advanced group/role mechanism management support for identity mapping between local systems, as appliedauthorization in an eventually consistentand distributed file system called Onedata. mechanism fordata, management and support for identity mapping between local asflexible applied inbyan eventually consistent distributed file system called Onedata. Keywords: big openauthorization data, data management, authorization, security © 2017systems, The Authors. Published Elsevier B.V. Keywords: big data, open data, management, authorization, security Peer-review under responsibility of an thedata scientific committee of the distributed International Conference on Computational Science local systems, as applied in eventually consistent file system called Onedata. Keywords: big data, open data, data management, authorization, security Keywords: big data, open data, data management, authorization, security
1 Introduction 1 Introduction 1 Introduction Today, more and more research and commercial applications rely heavily on distributed access 1 Introduction Today, research commercial rely heavily on distributed access to largemore data and sets,more including dataand collected from applications physical experiments as well as data obtained
Today, and more research and commercial applications rely heavily on distributed access to largemore data sets, including collected from physical experiments as well asSuch data data obtained through pureand simulations or data statistical data collected from web sets Today, more more research and commercial applications rely applications. heavily on distributed access to large data sets, including data collected from physical experiments as well as data obtained through pure simulations or statistical data collected from web applications. Such data sets are created in distributed infrastructures, by various organizations using heterogeneous storage to large data sets, including collected from physical experiments as well asSuch data data obtained through pure simulations or data statistical data collected from web applications. sets are created inare distributed infrastructures, by various organizations using heterogeneous storage systems and often too large to be completely transferred between data centers for processthrough pure simulations or statistical data collected from web applications. Such data sets are created in distributed infrastructures, by various organizations using heterogeneous storage systems andissues are often tooseveral large to be completely between data centers for processing. These lead to requirements thattransferred areorganizations necessary forusing a modern distributed large are created inare distributed infrastructures, by various heterogeneous storage systems and often too large to be completely transferred between data centers for processing. These issues lead tosystem, several i.e.: requirements thatdata are access necessary forany a modern distributed scale data management from machine, access to large systems and are often tooseveral large to betransparent completely transferred between data centers for processing. issues lead to requirements thatdata are access necessary a modern distributed large scaleThese datawithout management system, i.e.: transparent fromforany machine, accessmetadata to large data sets completely transferring them to the computational nodes, flexible ing. These issues lead tosystem, several i.e.: requirements thatdata are access necessary forany a modern distributed scale data management transparent from machine, access to large data setsenabling without data completely transferring them to the computational nodes, flexiblesecure metadata support discovery, support for singleand multi-tenant deployment, and scale datawithout management system, i.e.: transparent data access from any nodes, machine, accessmetadata to large data sets completely transferring them to the computational flexible support enabling data discovery, support for singleand multi-tenant deployment, secure supand easy data sharing, advanced group and role mechanisms for large groups of collaborators, data setsenabling without data completely transferring them to the computational nodes, flexiblesecure metadata support discovery, support for singleand multi-tenant deployment, and easy for data sharing, advanced group and access role mechanisms for large groups of protocols collaborators, support open data publishing and data using standard interfaces and including support enabling data discovery, support formechanisms single- and for multi-tenant deployment, secure supand easy data sharing, advanced group and role large groups of collaborators, port for and openCDMI data publishing andManagement data access using standard interfaces and protocols including POSIX (Cloud Data Interface) [21]. easy data sharing, advanced group and access role mechanisms for large groups of protocols collaborators, support for open data publishing and data using standard interfaces and including POSIX and CDMI (Cloud Data Management Interface) [21]. port for and openCDMI data publishing andManagement data access using standard POSIX (Cloud Data Interface) [21]. interfaces and protocols including 1 POSIX and CDMI (Cloud Data Management Interface) [21]. 1 1 1877-0509 © 2017 The Authors. Published by Elsevier B.V. 1 Peer-review under responsibility of the scientific committee of the International Conference on Computational Science 10.1016/j.procs.2017.05.054
446
Effective and Scalable Data Access . . al. / Procedia Computer Science 108C (2017) Wrzeszcz, Opiola, Zemek, . . . MichaƚControl Wrzeszcz. et 445–454
However, existing data management platforms, which are either focused on high performance data access on a local network or Dropbox-like solutions for desktop users, often have complex authentication and authorization mechanisms (for instance based on X.509 certificates requiring users to manage them manually) and are difficult to deploy by smaller user communities. Furthermore, users are accustomed to accessing and managing their personal data through Cloud based services such as Dropbox or Google Drive, while in order to access and process these data on Virtual Machines or containers in the Cloud, they still have to use some legacy protocols such FTP and share the data by exchanging URLs or attachments to emails. In order to address these challenges, we have proposed a novel solution for global data access, giving the users similar experience and ease of use as with commercial data management and file synchronization solutions, while providing means for high performance transparent data access, ensuring security at every step of access including storing data at systems, which are under control of separate organizations. We have provided corresponding architectural design and finally - practical implementation of software for global data access without barriers - a distributed, eventually-consistent virtual filesystem Onedata [6, 17, 25]. In the next sections we briefly describe the Onedata data management platform, and discuss in detail our approach to data access control in a globally distributed file system and review related work on the subjects related to large scale data management in distributed infrastructures, including aspects related to data access control.
2
Data management in Onedata
One of the main issues in modern large scale data management is to manage and efficiently share data between large and distributed user communities. Onedata addresses this challenge by implementing a globally distributed storage system divided into zones (or federations), which are created by deploying a dedicated service called Onezone. Zones enable creation of Onedata deployments with no relation to other federations. Any organization, community or user group can deploy their own Onezone service (single-tenant mode) with customized login page or use some public Onedata deployment (e.g. onedata.org or datahub.egi.eu) and rely on the Onedata group mechanism for users authorization and isolation (multi-tenant mode). Storage providers can connect to selected zones to form storage federations, based on heterogeneous storage backends, while still providing to users a unified, transparent data access functionality. A typical distributed Onedata deployment is depicted in Fig. 1. Onezone service is the main point of access for users, allowing Single Sign-On login mechanism (1) for all providers who granted the user access to their resources. Based on the Onezone authentication and authorization decisions, Oneprovider instances running in storage providers data centers control data access operations on users spaces which they support (2). Onezone enables also easy and secure sharing of data between users within a single zone, by means of a simple token exchange (3). Our system design envisions also support for data sharing based on trust established between different zones for use cases requiring integration between storage federations (4). Furthermore, it is not assumed that storage providers within a single zone must trust each other, as only data and metadata about users who are supported by specific providers within a zone are exchanged between selected providers, and the data center administrators have full control over which users will be supported (5). Once the user has authenticated in the Onezone service, he can directly access the data by connecting to a selected Oneprovider service (e.g. the one closest to the users computing node), however thanks to the interconnection between Oneprovider services within a single federation, users have transparent access to all files which are available from all storage providers who are supporting their spaces (6). This allows users to access their 2
Effective and Scalable Data Access . . al. / Procedia Computer Science 108C (2017) Wrzeszcz, Opiola, Zemek, . . . MichaƚControl Wrzeszcz. et 445–454
1. Different login methods
PC with oneclient or web browser
Onezone 1 5. Lack of trust between providers
6. Single access point to multiprovider environment
2. Limited permissions to data containers (spaces) for providers
3. Advanced cooperation of different providers users
7. Small delays in permissions checking needed
4. Different providers cooperation agreements
Onezone 2
Provider 3
Provider 1
Provider 2
8. Access remote data on behalf of user Computing Element Process of user
Provider 4
Web GUI REST CDMI POSIX CLIENT
Oneclient 9. Delegation of permissions for direct access
Storage system 10. Different authentication/ permissions systems
Storage system
Figure 1: Overview of a typical Onedata deployment. data from any location without pre-staging via efficient POSIX protocol, by simply mounting them on their local machines or attaching to Cloud virtual machines or containers. On the lowest level, a special transfer protocol has been developed, called RTransfer [6], which enables efficient replication and real-time access to remote data between data centers as well as POSIX access for end users (8). The transparency of data access is in particular evident in case of running data processing jobs (including legacy applications) on remote computing nodes, which can use native POSIX API to access and write files while all access permissions are delegated using bearer tokens generated during the first authentication (9). Finally, an important issue in every federated data management systems is to allow local site administrators to be able to enforce full control over which users have access to which storage resources. In our solution this is achieved via a special mechanism called LUMA (Local User MApping) [16], which is a extensible mechanism allowing storage administrators to provide mapping between global user identities and local, storage specific user credentials (10).
3
Data access authentication and authorization
In order to enable effective data access and sharing in large distributed user communities, several issues such as unified identity management across different storage sites, distributed authorization and flexible group and role management have to be provided by the data management system. This section details data access control mechanisms that address these issues.
3.1
Identity management
One of the main issues HPC users face when accessing data is the complexity of authentication and authorization systems based on certificates, their management and renewing procedures. 3
447
448
Effective and Scalable Data Access . . al. / Procedia Computer Science 108C (2017) Wrzeszcz, Opiola, Zemek, . . . MichaƚControl Wrzeszcz. et 445–454
Onedata utilizes OpenID and OpenID Connect (based on OAuth 2.0) standards to provide easy and unified identity management. From the users point of view, it simplifies the registration and login process as they can use one of their existing institutional or social accounts. The minimum required information is the email address, served by virtually any OpenID provider. Users can connect multiple OpenID accounts to an already existing account in Onedata, which gives them more login methods. Onezone serves as the account management center for users, where they can personalize their settings and authentication methods, or obtain client tokens (see 3.2) to authorize operations on their behalf across the whole system. Internally, identity management is the responsibility of Onezone, which is the authentication and authorization center for all storage providers and users in a federation. Support for concrete OpenID providers is extendable via plugins and configurable, which makes it easy to widen the range of supported providers, or customize the available authentication methods for each instance of Onezone independently. Onedata also supports basic (login/password) authentication, which is mostly targeted at system administrators or small isolated deployments. Upon registration, the new user is given a unique ID, which will be used universally in the system from now on. By storing user identifiers obtained from OpenID providers (subject id), Onezone can easily map OpenID accounts onto the unique user ID. Later, when access to resources or files is negotiated, this ID is used for privileges verification (see section 3.3).
3.2
Macaroon-based bearer tokens
Internally, Onedata system delegates authority through the use of Macaroons [2]. Macaroons are a type of bearer credentials that leverage chained MACs (message authentication codes) to allow holder to add new caveats - contextual confinements that limit the scope or degree of the authorization. In particular, Macaroons allow to add third party caveats that can be only satisfied by presenting a macaroon-bound proof from the specified third party. All conditions imposed on the credentials - including those added later by subsequent credentials holders - are verified by the authorizing party on authorization request. The basic use of Macaroons in Onedata resembles OAuth model. Upon authentication in Onezone and redirection to a specific storage provider, the provider receives an authorization token in the form of a serialized Macaroon. The Macaroon is time-restricted but intended to be long-lived, and has an additional third-party caveat that requires bearer to present proof that the user is authenticated. Before using the credentials, storage provider first obtains the proof from Onezone. The proof is valid for a short period of time and has to be reacquired when expired. The storage provider can interact with Onezone in user’s name only with the Macaroon and valid proof of authentication. Another use-case of Macaroons in Onedata is authorizing native clients. In this case, the client is given only the long-lived token without the authentication caveat. Note that the authentication caveat serves to ensure that client’s actions are authorized not only by the authorization server (Onezone) but also by the user. In the command line client’s case, the client is under full and constant control of the user and thus does not require reauthentication. However, the native client connects to a storage provider, which also requires authorization with Onezone to properly function and unlike the native client is outside of user’s control. To mitigate risk to the user, the native client delegates its authorization to the storage provider via a Macaroon with a short expiration time, and refreshes the authorization periodically. The flow of authorization for the native client is shown in Fig. 2. Both the third-party authentication caveat for the web-based interface and delegating the short-lived authorization Macaroon by the native client, enable the whole system to work on 4
Effective and Scalable Data Access . . al. / Procedia Computer Science 108C (2017) Wrzeszcz, Opiola, Zemek, . . . MichaƚControl Wrzeszcz. et 445–454
Figure 2: An overview of authorization flow for a native client of Onedata.
users behalf while the user is actively using the system, and revoke the authorization when the user stops using the system, leaving the user’s identity under their control. Macaroons-based authorization allows for refinement of granted access and thus tightening of security in next versions of the subsystem. Macaroons allow to leverage asymmetric encryption to enable thirdparties to determine whether the credentials are valid. This mechanism could be used by the storage provider to independently verify given credential before or even after using it for authorization with Onezone. For example, the Macaroon might contain a access-type=read-only caveat that would be checked by the storage provider before a write operation. Examples of other possible refinements include restricting Macaroons to a specific space, a specific time period or a given pool of storage providers.
3.3
Groups and privileges mechanism
Existing data management systems tend to be either very complex and targeted for large user communities with steep learning curve, or very basic solutions, targeted typically for long-tail of science. In order to provide a unified data management solution which can scale from small user groups to large user communities, we have implemented a flexible nested group based mechanism. Their usability is best justified by the privileges system in Onedata. Privileges are fine-grained and concern members of specific resource, constraining the rights of members towards the resource. For example, each member of space can be individually granted (or revoked) the privileges to modify the space, invite new members, delete the space, write data within the space (among others). Memberships and privileges of every user are crucial information, which influences low level decisions, e.g. if a given user can write or read a certain file in certain space. Groups allow collaborating among users and other groups that wish to have shared access to some data sets, and should share the memberships and privileges towards other resources. For instance, a group can itself become a member of space. In this case, all members of the group inherit the privileges of the group for the space. This way, by adding a single group to the space and setting proper privileges, the administrator can effectively set privileges of a large pool of users. To achieve diversification of privileges among members, more groups can 5
449
450
Effective and Scalable Data Access . . al. / Procedia Computer Science 108C (2017) Wrzeszcz, Opiola, Zemek, . . . MichaƚControl Wrzeszcz. et 445–454
be added. There are more resources beside spaces that use privileges system, and they are analogous to space privileges (modify, invite members, delete, etc.). These resources include groups, handle services and handles (for instance Digital Object Identifiers). Handle services and handles are resources connected with Open Data publications, and they also have members (users or groups) with associated privileges. Beside privileges associated with specific resources, there are also general privileges enabling special features in Onezone. They are typically granted to system administrators and include the rights to view and modify resources in the system. These privileges can be granted to a user or a group of users. The group system in Onedata is very flexible and allows for creating complicated structures of nested groups. In fact, they can create an arbitrary graph, where cycles are allowed. Nevertheless, such unconstrained approach has one significant pitfall - how to efficiently verify privileges of given user towards a resource when he might belong to it via a long chain of nested groups? The naive approach would be to analyse the graph of relations every time. However, resources can be accessed with high frequency (thousands of requests per second), especially because the privileges must be checked during every file-system operation - thus an efficient solution is required. In Onedata, we observed that the relations graph is not modified often (adding/removing relations or updating privileges) - in fact entire organizations can run on the same group membership setup for months, with single users joining or leaving groups occasionally. Considering this, we devised an algorithm where the relations graph is analysed incrementally to collect information about direct and indirect memberships and privileges, which we call effective relations and effective privileges (see Fig. 3). The algorithm operates on a graph of entities (users, groups, spaces
M
User 1
Membership
IW
Privileges: M - modify I - invite members W - write data D - delete
User 2
Group A
Group B
MIWD User 3
Effective users: User 1: MIW User 2: -IWUser 3: MIWD Effective groups: Group A: -IWGroup B: -IWGroup C: MIWD
Group C
Figure 3: Simplified entity graph with pre-calculated effective members and their privileges. and other related resources). When the relation changes, a recalculation is scheduled which analyses only the affected entities. If the effective relations of an entity have changed because of the update, all adjacent entities are analysed recursively. The process spans wider and wider until all changes have been propagated. This way, shortly after each update, we obtain a graph of entities where every entity carries a pre-calculated information about all its effective relations and privileges. Thanks to this approach, verifying if a user has given privilege towards a resource is reduced to looking up a single record and it’s effective privileges. What’s most important this ensures very low overheads on the file-system level. To enrich the functionality of groups, Onedata introduces roles - an attribute of each group defining the characteristics of its members. There are several available roles: • role - simplest group type associating members of certain role in arbitrary organizations, • team - a group of members that form a team, • unit - a group of members that belong to the same administrative unit, • organization - a group associating multiple units (virtual organization). 6
Effective and Scalable Data Access . . al. / Procedia Computer Science 108C (2017) Wrzeszcz, Opiola, Zemek, . . . MichaƚControl Wrzeszcz. et 445–454
Roles allow for creating clear and orderly group structures for easier maintenance.
3.4
Mapping user identities to local accounts
In order to enable direct mapping between global user identities registered in the Onezone service and local storage user identities, Onedata provides an extensible mechanism called Local User MApping (LUMA) [16], which allows site administrators to provide a simple RESTful service (or use our reference implementation), which returns mapping from the global user identity as registered in the Onedata Onezone service to a local user account, which can be storage system specific. Currently, LUMA supports mapping to the following storage systems: Unix uid/gid identifiers, Amazon S3, OpenStack SWIFT, Ceph but more storage systems can be easily integrated by site administrators. An example mapping returned by this service is presented below: { "storageId" : "a5ec372b-9f47-44e2-8d98-87d62f055a12", "storageType" : "POSIX", "spaceName" : "Space1", "userDetails" : { "name" : "User One", "connectedAccounts" : [ ], "alias" : "user.one", "emailList" : [ "
[email protected]"] } }
LUMA mechanism supports also the feature of Onedata which allows to connect multiple external identity providers (e.g. Facebook, Google, GitHub) to be connected to a single user identity in the system, allowing users to authenticate using several identity providers depending on their context.
4
Related work
Currently several data management solutions have emerged, which try to deal with the increasing requirements of user applications in terms of large scale data processing, several of which were addressing the needs of the scientific Grid computing infrastructures [10, 11, 13]. ownCloud [14] is an open-source framework for creating self-managed file hosting services, similar to Dropbox, i.e. sync-and-share. It enables to maintain full control over data location and transfers, while hiding the underlying storage infrastructure, which can be composed of multiple storage resources. The main features of ownCloud include abstracting file storage available through directory structures or WebDAV, file synchronization between various operating systems, user group administration, sharing of files using public URLs, online text editing, viewers for various file formats, support for external Cloud storage services (e.g. Dropbox or Google Drive). The Integrated Rule-Oriented Data System (iRODS) [20] is an open source data management software used to manage and take control of users data regardless of the device used to store data. Its main features include data discovery using a triple based metadata catalog, support for data workflows, with a rule engine allowing any action to be initiated by any trigger on any server or client in the grid, secure collaboration and data virtualization, allowing access to distributed storage assets under a unified namespace, and freeing organizations from getting locked in to single-vendor storage solutions. Distributed, parallel filesystems such 7
451
452
Effective and Scalable Data Access . . al. / Procedia Computer Science 108C (2017) Wrzeszcz, Opiola, Zemek, . . . MichaƚControl Wrzeszcz. et 445–454
as Lustre [12] or Ceph [24] can be classified as high performance data access solutions. They are mature and widely used solutions designed especially for single data centres that maintain locally distributed data on multiple storage systems. Globus Connect [4] is a client-server solution allowing users and researchers to use the Globus transfer service. It simplifies the way of creating Globus endpoints - the different locations where data can be moved to or from using the Globus service. It is free to install and use for users at non-profit research and education institutions. An emerging requirement from data management systems is the support for open data publishing, in particular to enable easy integration with open access services such as DataCite [5] or OpenAIRE [19]. These services rely on established standards such as OAI-PMH[22], which enable them to integrate with the existing platforms for publication metadata harvesting, and identify datasets through globally unique handles such as DOI[9] or PID. However, while these services enable discovery and identification of open data sets, they do not address directly the issue of accessing the underlying data by end users. Moreover, the publication of data sets often involves publishing a URL, where the dataset is available along with the DOI or PID for resolution of the data set. With respect to authentication and authorization methods, classically most authentication and authorization to data management systems has been based on X.509 certifcates and its extensions for role and attribute information [23, 3]. However, several new mechanisms have evolved recently, mainly addressing the need for easy to use and secure single sign on identity management and authorization. OpenID Connect [15] is a simple authentication mechanism, which allows users to be identified against remote clients based on an authentiation to a OIDC provider. On the other hand, SAML 2.0, is a protocol for exchanging both authentication and authorization security tokens which can contain various authorization and identity assertions [18]. In federated data management systems, a common problem is mapping of global user identities to local user accounts within the storage systems. So far this has been addressed using such solutions as local mapping files, which raised several administrative issues [1]. The choice of tools and systems for distributed data management is wide and diversified, but typically they offer selective features and are not able to comprehensively address the needs of users operating in organizationally distributed environments. This is depicted in Table 1. The innovative approach of Onedata is to fulfill all presented requirements within a single, unified platform. Classification File synchronization services
Examples ownCloud, Dropbox
Disadvantages Limits on storage size and transfer speed Lack of location transparency
Services for fast data movement Globus Connect High-performance parallel file Centralized management Lustre, Ceph systems Widely distributed data storage Manual management of data loiRODS systems cation and low efficiency Table 1: Summary of existing data management solutions
5
Conclusions
In this paper we have presented Onedata distributed data management platform and its support for effective data access authentication and authorization in a distributed storage system. 8
Effective and Scalable Data Access . . al. / Procedia Computer Science 108C (2017) Wrzeszcz, Opiola, Zemek, . . . MichaƚControl Wrzeszcz. et 445–454
Onedata has a very strong focus on enabling users to easily and securely access and share their data regardless of whether they work in small teams or large international collaborations. At the same time, Onedata ensures that the storage system administrators have full control over their storage resources. Performance tests conducted in the PLGrid production environment also confirmed that Onedata offers good data access performance [25]. These features were made possible by development of a flexible authentication and authorization mechanism based on OpenID Connect and Macaroons. Onedata is targeted at global, highly distributed environments and it was developed to support large data scale and user base. Performance scalability is achieved thanks to advanced block replication mechanisms. Files are split into blocks, and only required blocks are replicated to the site where data is processed. Local access to blocks ensures maximum efficiency, while the blocks are simultaneously sychronized and available globally. The main novelty in the context of data access control achieved by Onedata platform lies in a provision of a unified data access control mechanism for diversified types of user communities, scalable from small research groups to large international communities. Management of privileges is possible in an easy fashion, thanks to automatic computation of effective group membership and effective privileges of each user, implemented using fast lookups of user and group graph structure to ensure low overheads irrespectively of user base growth. All data access requests are independently authorized, which ensures that the data can be secure even including storage systems. Currently, Onedata is being used in several international projects and initiatives includes PLGrid, EGI-Engage, INDIGO-DataCloud and is used as the basis for EGI DataHub [7], a public service for provisioning of large reference data sets. Recently, it has been also accepted for the second phase of Helix Nebula Science Cloud procurement for enabling high throughput scientific data processing on commercial Cloud infrastructures [8]. Future work will include integration of SAML 2.0 identity service, enabling integration with additional community identity providers and implementation of a P2P mechanism for establishing trust between different zones. Acknowledgements This work has been partially funded under Horizon 2020 EU projects: INDIGODataCloud (Project ID: 653549) and EGI-Engage (Project ID: 654142). RS and JK are grateful for AGH-UST grant no. 11.11.230.124. L O is grateful for his doctoral grant at AGH-UST.
References [1] Alfieri, R., Cecchini, R., Ciaschini, V., dell’Agnello, L., Frohner, A., Lorentey, K., and Spataro, F. From gridmap-file to VOMS: managing authorization in a Grid environment. Future Generation Comp. Syst., 21(4):549–558, 2005. [2] Birgisson, A., Politz, J. G., Erlingsson, U., Taly, A., Vrable, M., and Lentczner, M. Macaroons: Cookies with contextual caveats for decentralized authorization in the cloud. In NDSS. The Internet Society, 2014. [3] Chadwick, D. W., Otenko, A., and Ball, E. Role-based access control with x.509 attribute certificates. IEEE Internet Computing, 7(2):62–69, 2003. [4] Chard, K., Pruyne, J., Blaiszik, B., Ananthakrishnan, R., Tuecke, S., and Foster, I. Globus data publication as a service: Lowering barriers to reproducible science. In 11th IEEE International Conference on eScience, 2015. [5] DataCite. Datacite : helping you to find, access, and reuse research data, 2011. http://datacite. org. [6] Dutka, L ., Wrzeszcz, M., Licho´ n, T., Slota, R., Zemek, K., Trzepla, K., Opiola, L., Slota, R. G., and Kitowski, J. Onedata - a step forward towards globalization of data access for computing
9
453
454
Effective and Scalable Data Access . . al. / Procedia Computer Science 108C (2017) Wrzeszcz, Opiola, Zemek, . . . MichaƚControl Wrzeszcz. et 445–454
[7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19]
[20] [21] [22] [23]
[24]
[25]
10
infrastructures. In Koziel, S., Leifsson, L. P., Lees, M., Krzhizhanovskaya, V.., Dongarra, J., and Sloot, P. M. A., editors, ICCS, volume 51 of Procedia Computer Science, pages 2843–2847. Elsevier, 2015. EGI. EGI DataHub website, 2016. Available at http://datahub.egi.eu. HNSciCloud. Helix Nebula Science Cloud website, 2016. Available at http://www.hnscicloud. eu/. International DOI Foundation, editor. DOI Handbook. 2012. Kapanowski, M., Slota, R., and Kitowski, J. Resource storage management model for ensuring quality of service in the cloud archive systems. Computer Science, 15(1):3–18, 2014. Korcyl, K., Chwastowski, J., Plazek, J., and Poznanski, P. Selected issues on histograming on gpus. Computing and Informatics, 35(2):282–298, 2016. Lustre. Lustre website, 2016. Available at http://lustre.org/. Marco, J. et al. The interactive european grid: Project objectives and achievements. Computing and Informatics, 27(2):161–171, 2008. Martini, B. and Choo, R. Cloud storage forensics: ownCloud as a case study. Digital Investigation, 10(4):287–299, 2013. Mladenov, V., Mainka, C., Krautwald, J., Feldmann, F., and Schwenk, J. On the security of modern single sign-on protocols: Openid connect 1.0. CoRR, abs/1508.04324, 2015. Onedata. Local User MApping service documentation, 2016. Available at https://onedata.org/ docs/doc/administering_onedata/luma.html. Onedata. Onedata project website, 2016. Available at http://onedata.org. Organization for the Advancement of Structured Information Standards. Security Assertion Markup Language (SAML) v2.0, 2005. Rettberg, N. and Principe, P. Paving the way to open access scientific scholarly information: Openaire and openaireplus. In Baptista, A. A., Linde, P., Lavesson, N., and de Brito, M. A., editors, International Conference on Electronic Publishing, ELPUB. IOS Press, 2012. Roblitz, T. Towards implementing virtual data infrastructures - a case study with iRODS. Computer Science, 13(4):21–34, 2012. SNIA. Cloud Data Management Interface. Technical report, April 2010. Available at http: //www.snia.org/cdmi. Sompel, H. Van De, Nelson, M., Lagoze, C., and Warner, S. Resource harvesting within the oai-pmh framework. D-Lib Magazine, 10(12), 2004. Venturi, V., Stagni, F., Gianoli, A., Ceccanti, A., and Ciaschini, V. Virtual organization management across middleware boundaries. In eScience, pages 545–552. IEEE Computer Society, 2007. Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D. E., and Maltzahn, C. Ceph: A scalable, highperformance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI), pages 307–320, 2006. Wrzeszcz, M., Trzepla, K., S¤lota, R., Zemek, K., Licho´ n, T., Opio¤la, L ¤ ., Nikolow, D., Dutka, L ¤ ., S¤lota, R., and Kitowski, J. Metadata organization and management for globalization of data access with onedata. In Wyrzykowski, R. et al., editors, Parallel Processing and Applied Mathematics: 11th Intnl. Conf., PPAM 2015, Krakow, Poland, September 6-9, 2015. Revised Selected Papers, Part I, pages 312–321, Cham, 2016. Springer International Publishing.