Data & Knowledge Engineering 41 (2002) 183–204 www.elsevier.com/locate/datak
Data management issues in mobile and peer-to-peer environments Budiarto, Shojiro Nishio *, Masahiko Tsukamoto Department of Information Systems Engineering, Graduate School of Engineering, Osaka University, 2-1 Yamadaoka, Suita, Osaka 565-0871, Japan Received 5 December 2001; received in revised form 10 December 2001; accepted 19 December 2001
Abstract Mobile computing is a revolutionary technology, born as a result of remarkable advance in the development of computer hardware and wireless communication. It enables us to access information anytime and anywhere even in the absence of physical network connection. More recently, there has been increasing interest in introducing ad hoc network into mobile computing, resulting in a new distributed computing style known as peer-to-peer (P2P) computing. In this paper, we discuss the data management issues in mobile and P2P environments. The use of wireless communication makes the data availability the most important problem here, so we focus on the problem of data availability and provide detailed discussion about replicating mobile databases. Not only that, we extend our discussion to mobile–P2P environment. At the end, we discuss the general data management issues in P2P environment. 2002 Elsevier Science B.V. All rights reserved. Keywords: Mobile database; Replication; Peer-to-peer environment
1. Introduction Mobile applications have become increasingly popular in recent years. Today, it is not uncommon to see people playing games or reading mails on handphones. Furthermore, as many
*
Corresponding author. E-mail addresses:
[email protected] (Budiarto),
[email protected] (S. Nishio),
[email protected] (M. Tsukamoto).
0169-023X/02/$ - see front matter 2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 9 - 0 2 3 X ( 0 2 ) 0 0 0 4 0 - X
184
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
banks and stock trade companies begin to consider mobile technology in their strategy to open market, more recently, people have begun to manage their inventory or to buy and sale stocks via handphones, palmtops or laptops. Japan is well known as the place where mobile communication quickly gains high popularity among people. In Japan, the number of mobile subscribers increases about 10 millions each year. Not only that, mobile devices have also become the popular alternative to access data on the Internet. From a study published recently, it is known that among 27 million internet accesses in Japan, 10 million accesses came from mobile devices. This result shows that the applications of mobile devices quickly change from the voice communication to non-voice (data) communication. Currently, non-voice traffic covers only 5% of the entire mobile traffic, but it is expected to increase to 50% in the next 5 years and in 2010, it will cover 70–80% of the traffic. The incredible popularity of mobile communication in Japan shows also the success of Japan in performing advanced research in the area of information and communication. In 1996, the Japanese Government formulated the science and technology basic plan, where the government agreed to allocate 17 trillion Yen for our 5-year term science and technology development (fiscal year 1996–2000, also known as the first term). The area of information and communication is among the areas that got high priority. Currently, we are in the second term of our science and technology basic plan. Having learned from our success in the first term, the government agreed to increase the budget to 24 trillion Yen for the next 5-year science and technology development (fiscal year 2001–2005). One of the important new policies in this term is that we strategically set top priorities to the development in four areas, including the area of information and communication. In the area of information and communication, we put top priority to the development on the following technologies: • Advanced network technologies that securely support our activities anytime and anywhere such as mobile computing network. • Advanced computing technologies that support speedy analysis, processing, storing and searching the huge amount of multimedia contents which is floating around our society. • Human interface technologies that enable anyone to be a part of information society without any obligation to do complex machine operations. • Device and software technologies that provide infrastructure for the development of the above mentioned technologies. With the support from our government, it is clear that we will be able to develop more advanced information and communication infrastructure, thus we have a broader chance to introduce new services and technologies to mobile subscribers in Japan. For example, recently, NTT Docomo, the largest mobile telecommunication company in Japan, has introduced their next generation mobile telecommunication system called FOMA [38], the abbreviation of freedom of mobile multimedia access. FOMA is the first service in the world to practically employ the IMT2000 [41] standard. It offers a maximum downlink speed of 384 kbps which enables its users to access high-bandwidth data such as video. With FOMA, currently we are able not only to make teleconferencing possible but also to perform multitasking such as browsing internet while talking, of course by using only a FOMA handphone!
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
185
Mobile infrastructure has enabled the introduction of new applications, ranging from simple ones such as mobile games to those quite complex applications such as mobile banking and mobile multimedia. From business and technology perspectives, data management technology that can support easy data access from and to mobile devices is among the main concerns in mobile information systems. Mobile network, especially, has characteristics that make it difficult to employ the currently available database solutions as most of them were developed for the use on the fixed network environment. Mobile database has become a popular terminology, attributed to the data management technology that enables the use of databases on the mobile computing environment. This database is more advanced and challenging than the fixed distributed databases as it offers the following features: • Data are available anywhere independent of the availability of the fixed network connection. With mobile-ready devices, users can store a part of database and use it while being mobile. When a mobile user needs data which is not available locally, she can activate the wireless communication of her mobile device and initiate connection to the network via the closest mobile support station (MSS). Once connected, she can access the publicly available data by using applications such as internet browsers, or her system can take part in a distributed database environment where she can access specific data granted for her. In this way, mobile users can virtually access any data, anywhere and anytime, even in the absence of fixed network connection. • Databases on both mobile and fixed hosts are sharable in seamless way. In mobile information systems, databases proliferated on both mobile and fixed hosts naturally form a distributed database system. In general, techniques to support data sharing in distributed databases are more complex than those in centralized databases. Mechanisms such as distributed transaction processing and commit protocol, for example, are known to be dependent on reliable and many network connections. In a mobile environment, however, we involve also the use of wireless network which is known to be prone of frequent disconnections and the period of disconnection is also unpredictable. In order to support seamless data sharing among mobile and fixed hosts, we need to employ distributed computing technologies that should also work properly even in the disconnection-prone environments. In recent years, we are also witnessing that people are moving toward another style of distributed computing: peer-to-peer (P2P) computing. While the serverless file sharing behind P2P is not new, the explosive growth of Internet use in recent years has brought the idea back in new forms. With P2P, computers can communicate directly and share both data and resources. On the other hand, there has been increasing interest in ad hoc network which is constructed by only mobile hosts [4,7]. So far, mobile ad hoc network is among the fertile research areas related to mobile computing. The reason is that ad hoc network is considered best to utilize the flexibility of wireless network. In an ad hoc network, mobile hosts play also the role of router and communicate with each other directly without any intervention from the host like MSS in the mobile environment. Naturally, the distributed computing applications in an ad hoc network will take the P2P style. Due to this reason, we believe that P2P will become a significant distributed computing style in the future, and thus, it would be interesting to discuss data management issues in P2P environments.
186
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
In this paper, we will discuss data management issues in mobile and P2P environments. The rest of this paper is organized as follows. In Section 2, we describe the model of mobile environment and its important characteristics that need to be considered when using database in a mobile environment. After arguing that the low data availability is the most important problem in mobile environment, we begin to discuss the replication for mobile databases in Section 3. In Section 4, we discuss the detail of replication strategies for mobile databases covering three important aspects of replication, i.e., replication dynamic, replication level and replica placement. In the last part of Section 4, we make a comprehensive performance evaluation of five representative replication strategies. In Section 5, we bring our discussion about replication to mobile-P2P environment which combines the features of mobile environment and ad hoc network. After that, in Section 6, we discuss the data management issues in general P2P environment. Finally, we conclude this paper in Section 7. 2. Characteristics of mobile environment In order to realize the above features of mobile databases, first, we need to consider the characteristics of mobile databases that will likely affect our way of thinking on the current database technology. Fig. 1 presents the model of mobile environment. A mobile environment consists of two distinct sets of entities: mobile hosts and fixed hosts. Some of the fixed hosts, called MSS, are augmented with wireless interfaces to communicate with mobile hosts, which are located within its radio coverage area called a cell. Mobile hosts are connected by wireless connections to the MSS of the cell where they currently exist. A mobile host can move within a cell or between two cells while retaining its network connection. Further, we assume that every host and cell in the system is associated with a unique identifier. As a part of a mobile database system, a mobile host acts as a data client and a data server at the same time. A mobile host, as a data server, is to support basic transaction operations such as read, write, commit, and abort.
Fig. 1. The model of mobile environment.
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
187
Fig. 2. A hierarchical structure of location servers.
A fixed host called location server keeps information of locations of all mobile hosts located within its management coverage. Location servers act as routers and are connected hierarchically by a high speed fixed network, where leaf nodes represent MSSes and non-leaf nodes represent location servers (see Fig. 2). The most important characteristics that need to be considered are as follow: (1) The environment where mobile databases are deployed is a mix of two different networks, i.e., the fixed network and the wireless network. The fixed network is characterized by the fixed host location, relatively high capacity, high reliability, and low connection cost. On the other hand, the wireless network is known to support dynamic network topology, but with relatively low capacity, low reliability, and high connection cost. In order to avoid compromising database performance due to the use of wireless network, so far some techniques have been proposed, including • Reducing the number of data exchanged via mobile network [23]. • Reducing the response time of accessing data via mobile network [23]. • Providing data cache on mobile host [12]. (2) The resources available to mobile users are generally very limited [22]. As a result, mobile hosts will tend to be highly personalized. From data management point of view, mobile users will likely bring only the fraction of data they need to access frequently during mobile. A new challenge arises for coping with consistency requirement on databases (both on mobile and fixed hosts) especially when those fractions are not completely independent each other. So far, many techniques have already been proposed to address this problem including
188
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
• Transaction management for mobile databases [27,36]. • Allocation of mobile database replication (materialized view) on the fixed network [8,9,25]. (3) In general, mobile hosts have low security. The worst case, for example, our data on the mobile host would be completely lost if the mobile hosts become the subject of theft. As a consequence of the above characteristics, mobile databases, in general, have high degree of unavailability. It is not too much to say that most data management issues in mobile information systems are related, directly or indirectly, to the problem of low data availability, thus, data availability is the central issue in mobile data management. Accordingly, addressing the problem of low data availability would have significant contribution in the establishment of mobile database technology.
3. Replication in mobile databases Replication is a general technique to increase the data availability. However, the generally available replication technologies assume the deployment on fixed distributed environment. Replicas of mobile databases which are allocated on the fixed network offer many benefits, such as (1) Higher data availability and lower cost of remote access When replicas of mobile databases are available on the fixed network, the data will still be available even though the mobile hosts that hold the master data are disconnected from the network. On the other hand, since replicas are available on the fixed network, other mobile users can use them instead the master copies (from now, we will use master to refer to the master copy), thus one wireless connection can be omitted from the path of a remote read [8]. (2) More efficient access to data When replicas of mobile databases are available on the fixed network, we can use the computing resources on the fixed network for performing resource intensive data processing. The mobile hosts can get only the result of data processing. For example, agent-based access [30] is an alternative for this approach. (3) Higher security When replicas of mobile databases are available on the fixed network, the replicas can be used as active backups of masters on the mobile hosts. The existence of active backups can result in lower degree of data lost when the mobile hosts holding the masters meet with situation that can make them disconnected for a long time, or even a more catastrophic situation such as permanent damage or theft. On the other hand, using replicas in general will increase the complexity of data management. The following points should be considered when deploying replication in mobile databases:
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
189
• Increasing the number of replicas will result in the increase of costs for updates and signaling. • Mobile hosts can move anywhere and anytime. Depending on the network architecture, the distance of movement in the network can be different from that of movement in the real world. A small movement in reality might result in a distant movement from the network point of view. • Cache is becoming an alternative for applications that do not need strict consistency. In this paper, we are considering the mobile database where consistency is a special concern. Accordingly, we will not provide further discussion on caching. Interested readers should see [5,10,12,15,31] for more details on caching issues in mobile computing environment.
4. Replication strategies in mobile databases In general, a replication strategy determines the behavior of a replication system from the view points of dynamics (the ability to adapt to the change of working environment), replication level (the number of replicas), and placement (location). In this paper, we will discuss these matters from the mobile database point of view and compare some representative replication strategies for mobile databases. It is clear that in selecting the suitable replication strategies we need to consider many characteristics of the environments where the strategies will be deployed. As we have discussed briefly in the introduction, the use of wireless network and the limitation of computing resources on mobile hosts become the most important characteristics that distinguish mobile computing environment from other kinds of distributed environments. In order to cope with problems that might be introduced by the above mentioned characteristics, we have suggested to employ fixed-network resident replication for mobile databases. In the following, we will discuss the detail about strategies to replicate mobile databases on fixed network. 4.1. Replication dynamics According to the behavior of replicas in replication schemes, we can categorize the schemes into two, i.e., static replication and dynamic replication. In static replication schemes, the location and the number of replicas are chosen prior to the deployment. Traditional replication schemes discussed in [13,19,21], for example, fall into this category. Manual recalculation of the access cost and redistribution of replicas are necessary to reflect new access patterns. This is acceptable in traditional distributed environments since sites participating in the distributed environment have fixed locations and the access patterns are relatively static. In a mobile environment, however, static replication schemes may not perform well since the assumptions about fixed locations and static access patterns are no longer hold [3]. On the other hand, in dynamic replication schemes, the location and/or the number of replicas will change following the access patterns to data being replicated. Dynamic replication schemes, such as in [1,34] try to overcome the above mentioned problem by continuously maintaining statistics about access patterns and/or system workload so as to dynamically recalculate access
190
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
cost and reconfigure the replication structure to adapt to the changes in access patterns. In general, this is desirable for a mobile environment [20,32]. 4.2. Replication level The replication level has significant effect on the performance of system with replication. There is a trade-off that must be well considered before we make a decision about the level of replication. High level of replication tends to decrease the cost of queries but increase the maintenance costs, i.e., the costs of storage and update. In this paper, we will consider two general policies about replication level, i.e., single replication and multireplication. From the study in [28], single replication suffices to serve distributed environments under moderate condition (that is, the queries are issued at most 4–7 times more than updates), and it is disadvantageous to have more than a few replicas unless the update to query ratio is unduly low. In a mobile environment, however, mobile hosts are dynamic. They could move to anywhere and for unpredictable length of time. Furthermore, the users of replicas may need to work at several ‘‘well-known’’ sites. In such cases, it may be more advantageous to deploy multireplication, i.e., placing a replica on each ‘‘activity center’’ of its users. 4.3. Replica placement It is clear that the placement of replicas significantly affects the entire performance of system with replication. In general, the benefit of creating replica will be high if it is placed close to the readers. In a mobile computing environment, however, mobile hosts can move anywhere and anytime, resulting in highly dynamic system. Accordingly, the center of activity of replica readers is not static in general. Not only that, the ability of mobile users to move makes it more costly to find current location of mobile users and of course the replicas if they are dynamic. In this sense, it is necessary to take balance between the cost of finding location and replica maintenance. In this paper, we consider three general replica placement policies for mobile databases, i.e., replication at home (RAH), replication close to the writer (RCW) and replication close to the reader (RCR). Home location can be a good alternative for the place of replica. The reason is that in a mobile computing environment, in general, it is easier to find the home location of a mobile host than its current location. Furthermore, considering the current business styles, it is common that company has already employed RAH, in a sense that it pools databases in the server located at the company main office and the mobile users access them remotely. Due to the nature that the replicas at home location never relocate, RAH is a kind of static replication. In contrast, RCW and RCR are dynamic. In RCW, the replication ‘‘follows’’ the movement of a mobile host holding the master (primary-copy). By keeping the replicas close to the writer(s), we intend to keep the cost of updates low while we can still take the advantages of using replication. Primary tracking replica allocation (PTRA) [8] is the adaptation of RCW for single replication with a special assumption, i.e., the master holder updates data most frequently. On the other hand, RCR favors to take maximally the advantage of replication to reduce the cost of reading.
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
191
This is done by placing a replica at the ‘‘center of activity’’ of its readers. User majority replica allocation (UMRA) [8,9] is an adaptation of RCR for single replication. 4.4. Comparison of various replication strategies in mobile databases Here, we compare five representative replication strategies for mobile databases, which combine the three aspects of replication policies mentioned above. The five strategies are staticsingle, static-multi, dynamic-single, dynamic-multi and primary-tracking. We intend to show how some important parameters related to the characteristics of data access in mobile environment affect the performance of these strategies. The comparison was done by simulation experiments and our goal is to give a suggestion to the readers which is the best strategy to be deployed. As the performance measure, we use the average access cost of data. In networked environment, cost is mostly associated with the number of network packets transferred to do an activity being observed until it is completed. In a mobile computing environment, generally, network packets can be divided into two classifications, i.e., the data packet and the signaling packet. The data packet consists of user data transferred from the server to the client and vice versa. On the other hand, the signaling packet consists of data used by the system, such as routing information, location lookup, and location update. However, for the reasons mentioned below, we simply ignore the signaling packet from our model. 1. Generally, the size of the signaling packet is much smaller than that of the data packet. Thus, when we are considering both of them, ignoring the signaling packet will improve the clarity of analysis and the resulting observation on the characteristic of each strategy. 2. In mobile environment, signaling packets are exchanged using separated channel from data packets. For example, the SS7 [26] signaling network is used for such a purpose in personal communications service (PCS). Therefore, separating the analysis of these two kinds of packets is more logical and makes our model closer to the real situation in mobile computing environments. In our simulation, we assume that the mobile network forms a tree network having f fanout and h levels. With regard to the model of mobile environment, the leaf nodes of the network represent registration area (RA) while the inner nodes represent routers. Each mobile user can move to another RA independently. Furthermore, we assume also that there is a replica server exists in each RA. Here, we consider two types of user events, i.e., move and data access. User events in the mobile environment occur randomly according to the Poisson distribution. Furthermore, among user events, move operations occur in a ratio m. We assume that events of a user are mutually exclusive, i.e., a mobile user cannot perform both data access and move operations at the same time. Furthermore, the user event is synchronous in a sense that a user cannot start other event if its current event has not completed yet. For the fairness of cost evaluation, the time related parameters are based on the number of accesses, not the clock time. We run each simulation until 10 000 accesses. Accordingly, the average access cost is calculated by dividing the cumulative access cost by 10 000.
192
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
Without losing the generality, we assume that 50 mobile users are sharing their databases with each other. The accesses include both read and update. Access requests arrive according to the Poisson distribution and the access configuration, i.e., the portions of read and update, is determined by the write ratio w. We assume that an update operation will be preceded by a read operation. In implementing the replication strategies, first we assume that each mobile user holds database which is considered as the master database. A portion of master database on each mobile user is to be shared with other users and therefore become the subject of replication. The replication strategies work in the read-one-write-all (ROWA) [6,11,29] context to ensure one-copy serializability [6]. We assume that the database portion to be shared are of the same size. Furthermore, since we simply ignore the signaling packet from our model, in order to make the evaluation a bit more simple, we consider a replica as the access cost unit, i.e., the replica size is 1. We implemented the static-single replication strategy by making each master database to have only single replica statically stored at the home location. In this way, the static-single replication strategy is in essence equal to the RAH. On the other hand, for the static-multireplication strategy, we made each master database replicated in each RA (specifically, in the replica server of each RA). In other words, the static-multireplication strategy is implemented as static and full replication. As for dynamic replication strategies, the implementations are as follows. As indicated by the name, the dynamic-single replication strategy is implemented by making each master database to have only single replica. In this strategy, each replica is initially allocated on the home location. As the simulation runs, mobile users start to move and access the shared data. Every read request from a mobile user is recorded by the replication manager in the RA where the user resides currently. Periodically, the access statistics from all RAs are collected and compared. Based on the comparison result, the system makes a replica relocated to the RA where the read accesses to it are requested mostly, and the statistics is reinitialized. In this way, the dynamic-single replication strategy is equal to UMRA. In the simulation, the access statistic evaluation period is set to 150 accesses. As for the dynamic-multireplication strategy, we adopt the adaptive data replication (ADR) [34] with some modifications. The original ADR, metaphorically, forms a variable-size amoeba that stays connected at all times, and constantly moves towards the ‘‘center of read–write activity’’. The replication scheme expands as the read activity increases, and it contracts as the write activity increases. In our model, we have assumed that the replicas are allocated on the replica servers associated with RAs. That means, the dynamic-multireplication strategy does not assume any connected situation. However, as in the original ADR, read requests are served by the closest replica and all access requests (including the updates) are recorded. In each RA, the access statistics are periodically tested. A replica of data is made available in an RA if, during the access statistic evaluation period, the number of its reads is greater than the number of its writes. Otherwise, the RA will cease keeping the replica. In this way, the replication level changes dynamically following the read–write patterns but it is guaranteed that at least one replica for each data exists in the fixed network. The primary-track replication strategy is also dynamic. In this strategy, the replica is also kept single and is expected to exist close to its master database (primary copy). Accordingly, when an
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
193
update on the master holder needs to be reflected to the replica, first the replica manager of the RA where the master holder currently resides will try to find if the replica exists locally. If the answer is negative, the replica manager will issue a relocation request dedicated for the replica. After the replica relocates to the RA, the update will be synchronized, and of course, the cost of update will be virtually 0. This strategy, in fact, resembles the PTRA strategy. For the sake of simplicity, here, the access cost is evaluated based on the fixed connection distance model. In this access cost model, the access cost is calculated as follows: 1. Each node is connected to its neighbors at the same distance. 2. The access cost is calculated based on the number of inter-node connections that should be passed in order to access the data. 3. Only the fixed connections are considered.
4.4.1. The effect of network scale In general, the average access cost increases in all strategies when the network scales up. However, the way the access cost increases in each strategy is slightly different, depending on the move and update frequencies. Fig. 3 shows several simulation results. In Fig. 3, the horizontal axis indicates the network scale in term of the level h of the tree network. This means, the number of RAs scales up in exponential fashion along with the increase of h. In Fig. 3(a)–(d), first, we can see clearly that the static-multi strategy does not scale well. The access cost increases exponentially along with the the scale of network. Since the static-multi is implemented using full replication, thus the cost of read is virtually 0, it is clear that such behavior is caused by update operations. The static-single (RAH) and the dynamic-single (UMRA) strategies behave almost similar in any case but RAH has slightly better performance than UMRA. The performance difference, however, is getting wider as the network scales up. Since we do not take the control packet into consideration, the ways the static-single and dynamic-single strategies serve access requests are essentially the same. Accordingly, the replica relocation is considered as the main factor of performance difference since it involves one remote read. This means that making single replica dynamic is not quite useful when the users activities cover a wide area. The above observation, however, does not apply to the primary-track (PTRA) strategy although PTRA adopts also single replication policy. For example, when m ¼ 0:25 and w ¼ 0:75, PTRA beats other replication strategies. In contrast, when the system is highly mobile (m is high) or the network scales up, PTRA’s performance can be worse than that of other strategies, even for those that implement multireplication such as the dynamic-multi strategy. 4.4.2. The effect of movement frequency Fig. 4 shows some experimental results related to the effect of move/event ratio. In all figures, except for the static-multi strategy, the access cost when m ¼ 0:00 is equal to 0. The reason is that the mobile users do not perform any movement and they reside in the home location until the end of a simulation run. In that case, the behavior of the static-single, dynamic-single, primary-track and dynamic-multi strategies are similar, i.e., they keep a replica only on the home location. As
194
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
Fig. 3. The effect of varying the level h of the mobile network: (a) m ¼ 0:25, w ¼ 0:25; (b) m ¼ 0:25, w ¼ 0:75; (c) m ¼ 0:75, w ¼ 0:25; (d) m ¼ 0:75, w ¼ 0:75.
for the static-multi, even in that case, the cost cannot be 0 since replicas have been placed on all RAs from the beginning, and thus we need network connections in order to propagate updates to all replicas. When h ¼ 2, we can observe that until m reaching 0.5 the behavior of all strategies are quite similar. When m ¼ 0:75, however, the performance of primary-track strategy clearly degrades. This phenomenon can be explained as follows. When the mobility is high, mobile users have high probability to move in PTRA, since the replica should be relocated following the master holder, the increase in mobility directly results in the increase of the number of relocation operations, which needs a remote read from the destination. When h ¼ 8, the data access cost of static-multi strategy is incredibly high thus it is not shown in Fig. 4(c) and (d) intentionally. The fact that PTRA suffers from the increase of movement can be observed more clearly in this case. On the other hand, the static-single, dynamic-single and dynamic-multi are not affected by the increase of movement ratio. Especially, in the situation where users relocate frequently, the RAH becomes the best choice. However, when the movement ratio is low, we can see that PTRA is quite effective, especially when the update ratio is high. It
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
195
Fig. 4. The effect of varying move/event ratio m: (a) h ¼ 2, w ¼ 0:25; (b) h ¼ 2, w ¼ 0:75; (c) h ¼ 8, w ¼ 0:25; (d) h ¼ 8, w ¼ 0:75.
should be noted here that there is no setting in the simulation that makes the master holder dominating the updates. 4.4.3. The effect of update frequency Fig. 5 shows the effect of varying write/access ratio w on the access cost. When the network scale is small, all strategies perform quite similar, except for the PTRA which suffers if users relocate too frequently. However, among the strategies we compare here, PTRA is affected by the change of access pattern most minimally, even though the network scale is big. Especially, when the movement rate is low, high write ratio will favor the PTRA. In general, multireplication strategies perform much worse when the update ratio increases compared with the single replication strategies. 4.4.4. The effect of concentrated activities So far, we have discussed the performance of replication strategies without considering users’ movements patterns. In reality, however, it is often the case that mobile users have certain
196
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
Fig. 5. The effect of varying write/access ratio w: (a) h ¼ 2, m ¼ 0:25; (b) h ¼ 2, m ¼ 0:75; (c) h ¼ 8, m ¼ 0:25; (d) h ¼ 8, m ¼ 0:75.
moving patterns. For example, a professor may be scheduled to be at the university during the weekdays, go back home at night and go fishing during the weekend. A salesman has customers that he will visit regularly. In this part, we will discuss the effect of movement patterns on the access cost. In the simulation experiments, we model the users’ movement patterns as the skew in selection movement destination. In other words, a parameter is set to control the simulation program so that users will move to specific cells with a probability set in the parameter. We selected three RAs where any user will go with the probability indicated by the parameter concentration ratio. Figs. 6 and 7 show the simulation results of varying activities concentration ratio, each for the case that the centers of activities are close to and the centers of activities are far from the home location, respectively. It should be noted here, since the data access cost of static-multi replication strategy is incomparably high than other strategies, the static-multi strategy is omitted from the figures. Yet, the static-multi replication strategy does not take any effect from concentrated accesses. When mobile users are concentrated, we can observe that it gives a good effect on the performance of all replication strategies except the static-multi. Among them, dynamic-multi repli-
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
197
Fig. 6. Varying the concentration ratio when the centers of activities are close to the home location: (a) h ¼ 8, w ¼ 0:50, m ¼ 0:25; (b) h ¼ 8, w ¼ 0:50, m ¼ 0:75.
Fig. 7. Varying the concentration ratio when the centers of activities are far from the home location: (a) h ¼ 8, w ¼ 0:50, m ¼ 0:25; (b) h ¼ 8, w ¼ 0:50, m ¼ 0:75.
cation strategy receives the most significant performance improvement, where typically its performance increases about 50% when the concentration ratio increases about 200%. When users are concentrated around the home location, it is not surprisingly that the staticsingle strategy achieves the best performance. However, its performance gain due to the user concentration is still slightly smaller than that of the dynamic-multi strategy. PTRA, also gets performance gain, but the movement ratio is still the most influential performance factor. When the movement rate is high, however, the increase of concentration ratio gives the biggest saving since the replica will ‘‘follow’’ the master holder moving to the place crowded by other users who will access it. When the centers of activities are far from the home location, most replication strategies do not get affected except the static-single strategy whose performance tends to go down along with the increase of the concentration ratio. It is obvious, however, since the cost of accessing replicas is
198
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
high both for read and update for most users, thus decreasing the benefit of allocating replicas on the fixed network.
5. Replication in mobile–P2P environment Since P2P environment has very close relationship with the mobile environment, it is quite interesting to discuss the database replication issue in mobile–P2P environment and how and in what condition the replication can contribute to solve the problem of data availability in mobile– P2P environment. In a mobile–P2P environment, the most important characteristic that affects data availability is the nature of the ad hoc network. In an ad hoc network, hosts are connected to the network temporarily. Furthermore, hosts play also the role of router and they communicate with each other directly without any intervention of dedicated hosts like the MSSs in the mobile environment. Accordingly, there is no fixed address, and even more there is no principle of location. Since there are no dedicated hosts that act as routers, obviously the network connections become more and more prone to get disconnected. When the role of router is given to mobile hosts, roughly, the probability of the network to get disconnected is proportional to the square of that in the mobile environment (assume that MSS never fails). Consequently, in the mobile–P2P environment, the data availability will degrade in the same order of magnitude. Thus, the replication will play a more and more important role in the mobile–P2P environment. Recently, some replica allocation methods to improve data accessibility in ad hoc networks have been proposed in [17,18]. Although replication can improve data accessibility, using replication in the way as we have discussed so far may not be adequate. The fact that hosts are connected to the network temporarily makes it difficult to guarantee one-copy serializability in a timely and efficient manner since we rely on the mobile hosts, not the fixed hosts, in order to communicate with other hosts not reachable directly. When hosts are disconnected more often and the applications have high transaction rates, the deadlock and reconciliation rate will experience a cubic growth. The database at each host diverges further and further from others as reconciliation fails. Each reconciliation failure implies differences among hosts. Soon, the system suffers system delusion––the database is inconsistent and there is no obvious way to repair it [16]. In a P2P environment, all member hosts have the same level of autonomy. From the resource sharing point of view, the roles of client and server do exist, but there is no fixed assignment of the roles to particular hosts. At the same time, any host can be a server and a client. Anytime, a member host can request for unused resources available on other hosts, or should provide the unused resources to others who are requesting. If we apply this principle to systems with replication, then there will be no principle of master and replica. Accordingly, the scheme such as twotier replication [16] which relies on the execution on the master cannot be used in mobile–P2P environment. It seems that we need to abandon serializability for the convergence property, i.e., if no new transactions arrive, and if all the hosts are connected together, they will all converge to the same replicated state after exchanging replica updates. Lotus Notes [24] is a good example of systems that support the convergence property.
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
199
6. Data management issues in general P2P environment In distributed environment, ideally, each participant should be in equal level. It is clear that mobile computing environment, is not yet an ideal distributed environment due to the ‘‘too much’’ role held by the MSSs. In mobile computing environment, a mobile host should initiate connection to the closest MSS first before establishing a network connection with other hosts. Clearly, such an architecture makes MSSs the most crucial systems for mobile hosts to exist as parts of distributed environment. Since an MSS is responsible for managing all network traffic from and to mobile hosts in its coverage, the scalability of MSSs is therefore our primary concern. MSSs can be the bottle neck in the mobile computing environment. The ideal solution for the problem that might arise as mentioned above is to liberate any participant in a distributed environment to initiate a connection directly to others as needed. P2P is exactly the architecture for that purpose. Currently, P2P architecture is getting much attention from researchers as an alternative to realize more autonomous distributed environment. 6.1. Data management issues in P2P computing environment The concept of P2P is not really new. Ad hoc network is a network architecture which is closely related to P2P paradigm and has been around since several years ago. So far, ad hoc network is often tightly associated with mobile computing, since it assumes the network is built on wireless communication. However, P2P exists not only in the world of wireless. In the Internet world, Napster [45] popularized the P2P concept by introducing music sharing service, which unfortunately was banned recently due to a problem related to copyright violation. Nevertheless, P2P has brought us closer to the realization of an ideal distributed environment. So far, many applications have taken the advantage of P2P technology. Napster, Gnutella [39], and Freenet [37] let users directly exchange music files. ICQ [40] makes its users able to exchange personal messages. In system like SETI@home [48], computers exchange available computing cycles. Moreover, a system like LOCKSS [44] lets sites exchange storage resources to archive document collections. In the following, we will discuss the general data management issues in P2P computing environments. As mentioned above, P2P system can be built on the fixed or wireless network. When it is built on the fixed network, the network connection is relatively stable and the availability of sufficient computing resource is not a problem. However, the situation is different if P2P system is built on the ad hoc wireless network. In the following discussion, we focus on the data management issues in the wireless P2P environment. 6.1.1. Query processing In P2P, the biggest challenge is enabling peers to find one another. This issue, however, is not specific to P2P but applies also to other environments which employ ad hoc network. P2P requires peers to access decentralized resources that do not have fixed address/location. Because of this, P2P will require a new IP-addressing scheme. The difficulty of finding resources has significant effect to the query processing performance. In Gnutella, for example, users search for files by flooding the network with queries, and having each
200
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
computer to look for matches in its local disk. Clearly, this type of solution may have difficulty scaling to large numbers of sites or complex queries. In order to alleviate the drawback from using such expensive searching strategy, many systems prefer to use a mix of P2P and CS (client/server) systems. Napster and Pointera [47] fall into this category. In such hybrid P2P systems, some tasks such as searching are done in a centralized manner which results in a much more efficiency. Yang and Garcia-Molina studied the performance of hybrid P2P systems in file-sharing application and discussed tradeoffs among some general architectures for such systems [35]. The constraint of the P2P environment and the requirement for timely query processing will also make approximate answers to queries more acceptable, especially, if we intend to stick to pure P2P systems. In pure P2P systems, there is no element that has a specific role to process request in a centralized manner. Even tasks such as routing management should be handled by the participating peers alone. In this situation, some peers whose data to be queried might not be reachable, especially for the mobile peers. In this sense. the queries issued in pure P2P systems will likely to have approximate answer rather than the exact one. The results potentially adaptable to P2P environment include quasi-copies [2], and semantic data caching [14] in case the queries have regularity. When performing distributed query processing, query optimization is also among the important issue. In P2P system, it may be useful to consider the availability of hosts on the network in the query optimization. For example, for the hosts which are hard to find, we can cache their data locally. When doing query optimization, the query is decomposed into two parts. The part which is originally intended to be processed at those hosts can be processed by using the local cache. In this way, we can buy time by sacrificing the data freshness. For the other part, we can perform the usual remote processing. 6.1.2. Transaction processing Although the concept of P2P is not new, currently, the area of applications that can take benefit from P2P environment is still limited and most of them are related to sharing files or other resources. Such kinds of applications, in general, do not need to be transactional. So far, databases are always associated with applications that need strict data consistency where transactions play an important role. Accordingly, we have been using databases as if we work with single database, even though they are distributed. As a result, it imposed the need for tight connection between nodes. It is clear that the conditions for P2P environments are quite different from those for the existing database systems. In P2P, peers do not have fixed location. Furthermore, since there is no participant that has a specific role to find a certain peer, finding all peers that hold data to be updated is not an easy task and the result cannot be guaranteed satisfactorily. Accordingly, it is almost impossible to impose strict consistency on P2P databases, and the situation is much worse than that of mobile databases. Accordingly, we might need to use a more relaxed consistency criteria than serializability. For example, the above mentioned convergence property can be used to replace the serializability. Due to the nature of an ad hoc network which can get disconnected anytime, we might need to use a different model than the traditional transaction model. In P2P environment, especially, peers have the same level of autonomy and therefore the transaction model might need to support a great autonomy in transaction processing. Furthermore, it is also desired that transactions exe-
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
201
cuted in distributed fashion over several hosts do not abort just because of losing contact with some hosts. Open nested transaction model [33] is an example of advanced transaction models that support autonomy. 6.1.3. Security In P2P environment, peers directly communicate with each other and should handle tasks such as routing management. However, the routing requests that should be processed are not only the local requests, but also requests from other peers. In general, peers are required to share some of the computing resources they have in order to keep the environment working. Unfortunately, if such resource sharing is done in an uncontrolled fashion, not only will it hurt the peers, but it might result in destroying the entire environment. All of this applies also to data. In general, it is not desirable to share resources or data with unauthorized users. Accordingly security is among the most important issues in a P2P environment. However, many of the security issues in P2P environment are not specific to database systems. Besides resource sharing, peers might need the help of unauthorized peers in order to transfer data to others. In that case, it is possible for the peers that provide routing to illegally access the data. Accordingly, it may be important to encrypt data which only peers that participate in data sharing know how to decrypt. A security system like public-key infrastructure (PKI) [46] might be useful for this purpose. 6.1.4. Interoperability In a P2P environment, peers will communicate and have to work with other peers using a variety of operating systems, networks and applications. Currently, there are a lot of computer systems and applications, and there is no interoperability between each other. The most notable, e.g., the binary data created on computers working with Windows OS is almost likely unreadable from the Macintosh computer. Database system is not an exception. In P2P, data should be readable to any peer that is eligible to access it, no matter the systems and software used. Todays P2P applications generally perform relatively simple tasks, such as transferring MP3 music files, thus they are able to work with simple interoperability technologies such as translation scripts and wrappers. However, in the future, P2P applications may require to do complex tasks that involve direct manipulation on data such as in the systems which use replications. At that time, more enhanced interoperability technologies such as XML-based data description are needed. Interoperability can also be achieved by using applications implemented on a platform neutral system, such as Java Language. Java, e.g., includes the JDBC [42] package which is useful for handling a database independently from inside Java language. Recently, a Java package that supports P2P called JXTA [43] has been released by Sun. It makes Java become more and more favorable to be used in realizing P2P applications.
7. Conclusion The availability of mobile computing environment is a result of incredible research work in the areas of computer hardware and communication technologies. On a mobile computing
202
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
environment, we can access data anytime and anywhere, even in the absence of the fixed network connection. In this paper, we have discussed the replication strategies that can improve the availability of mobile databases. Due to the use of wireless network, without using replication, mobile databases have very low availability, thus it is difficult to do data sharing among mobile users. By using simulation studies, we showed that the performance of replication strategies depends heavily on many conditions including the network scale, mobility, access ratio and access concentration. We showed that in most circumstances, single replication strategies such as RAH, UMRA and PTRA, perform better than the multiple replication strategies. After that, we have discussed on replication and general data management issues in P2P environment. Currently, P2P environment receives much attention from many researchers and is considered the distributed computing style of the future that better utilizes computing resources proliferated around the world. Since in P2P, many of the current notions on distributed computing such as address, location and client/server do not exist, in general, we need more advanced data management technologies that integrate new alternatives to replace that information, or that can work without it.
Acknowledgements This research was supported in part by the Special Coordination Funds for Promoting Science and Technology from the Ministry of Education, Culture, Sports, Science and Technology, Japan under the Project ‘‘Establishing P2P Information Infrastructure for Mobile Computing Environment’’, by the Research for the Future Program of Japan Society for the Promotion of Science under the Project ‘‘Advanced Multimedia Content Processing (Project No. JSPSRFTF97P00501)’’ and by the Grant-in-Aid for Scientific Research on Priority Areas from the Ministry of Education, Science, Sports and Culture, Japan under grants numbered 08244103 and 09230212.
References [1] S. Acharya, S.B. Zdonik, An efficient scheme for dynamic data replication, Technical Report CS-93-43, Department of Computer Science, Brown University, 1993. [2] R. Alonso, D. Barbara, H. Garcia-Molina, Data caching issues in an information system, ACM Transactions on Database Systems 15 (3) (1990) 359–384. [3] B.R. Badrinath, T. Imielinski, Replication and mobility, in: Proceedings Second Workshop on the Management of Replicated Data, 1992, pp. 9–12. [4] D.J. Baker, J. Wieselthier, A. Ephremides, A distributed algorithm for scheduling the activation of links in a selforganizing, mobile, radio network, in: Proceedings IEEE ICC-82, 1982, pp. 2F6.1–2F6.5. [5] D. Barbara, T. Imielinski, Sleepers and workaholics: caching strategies in mobile environments, VLDB Journal 4 (4) (1995) 567–602. [6] P. Bernstein, V. Hadzilacos, N. Goodman, Concurrency Control and Recovery in Database Systems, AddisonWesley, 1987. [7] J. Broch, D.A. Maltz, D.B. Johnson, Y.C. Hu, J. Jetcheva, A performance comparison of multi-hop wireless adhoc network routing protocols, in: Proceedings MOBICOM-98, 1998, pp. 85–97.
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
203
[8] Budiarto, K. Harumoto, M. Tsukamoto, S. Nishio, T. Takine, Replica allocation strategies for mobile databases, IEICE Transactions on Information and Systems E81-D (1) (1998) 112–121. [9] Budiarto, K. Harumoto, M. Tsukamoto, S. Nishio, On relocation decision policies of mobile databases, IEICE Transactions on Information and Systems E82-D (2) (1999) 412–421. [10] O.A. Bukhres, J. Jing, Performance analysis of adaptive caching algorithms in mobile environments, Information Sciences 95 (1) (1996) 1–27. [11] S. Ceri, G. Pelagatti, Distributed Database Principles and Systems, McGraw-Hill, New York, 1984. [12] B.Y.L. Chan, A. Si, H.V. Leong, Cache management for mobile databases: design and evaluation, in: Proceedings ICDE-98, 1998, pp. 54–63. [13] B. Ciciani, D.M. Dias, P.S. Yu, Analysis of replication in distributed database systems, IEEE Transactions on Knowledge and Data Engineering 2 (2) (1990) 247–261. [14] S. Dar, M.J. Franklin, B.T. Jonsson, D. Srivastava, M. Tan, Semantic data caching and replacement, in: Proceedings VLDB-96, 1996, pp. 330–341. [15] C.C.F. Fong, J.C.S. Lui, M.H. Wong, Quantifying complexity and performance gains of distributed caching in a wireless mobile computing environment, in: Proceedings ICDE-97, 1997, pp. 104–113. [16] J. Gray, P. Helland, P.E. O’Neil, D. Shasha, The dangers of replication and a solution, in: Proceedings SIGMOD96, 1996, pp. 173–182. [17] T. Hara, S. Nishio, Replica allocation strategies for improving data accessibility in ad hoc networks, in: Proceedings 2000 Symposium on Multimedia Distributed Cooperative and Mobile Systems (DICOMO 2000), 2000, pp. 7–12 (in Japanese). [18] T. Hara, Effective replica allocation in ad hoc networks for improving data accessibility, in: Proceedings IEEE INFOCOM-01, 2001, pp. 1568–1576. [19] A.A. Helal, A.A. Heddaya, B.B. Bhargava, Replication Techniques in Distributed Systems, Kluwer Academic Publishers, Dordrecht, 1996. [20] Y. Huang, A.P. Sistla, O. Wolfson, Data replication for mobile computers, in: Proceedings ACM SIGMOD-94, 1994, pp. 13–24. [21] S.Y. Hwang, K.K.S. Lee, Y.H. Chin, Data replication in a distributed system: a performance study, in: Proceedings DEXA-96 (LNCS 1134), Springer-Verlag, 1996, pp. 708–717. [22] T. Imielinski, B.R. Badrinath, Data management for mobile computing, Sigmod Record 22 (1) (1993) 34–39. [23] S.J. Lai, A.B. Zaslavsky, G.P. Martin, L.H. Yeo, Cost efficient adaptive protocol with buffering for advanced mobile database applications, in: Proceedings DASFAA-95, 1995, pp. 87–94. [24] T.L. Lai, E. Turban, One organization’s use of lotus notes, Communications of the ACM 40 (10) (1997) 19–21. [25] S.W. Lauzac, P.K. Chrysanthis, Programming views for mobile database clients, in: Proceedings DEXA-98, 1998, pp. 408–413. [26] Y.B. Lin, S.K. De Vries, PCS network signaling using SS7, IEEE Personal Communications 2 (3) (1995) 44–55. [27] S. Mazumdar, P.K. Chrysanthis, Achieving consistency in mobile databases through localization in PROMOTION, in: Proceedings Mobile in Database and Distributed Systems (MDDS99) Workshop, 1999, pp. 82–89. [28] S. Nishio (Muro), T. Ibaraki, H. Miyajima, T. Hasegawa, Evaluation of the file redundancy in distributed database systems, IEEE Transactions on Software Engineering SE-11 (2) (1985) 199–205. [29] M.T. Ozsu, P. Valduriez, Principles of Distributed Database Systems, Prentice-Hall, 1991. [30] E. Pitoura, B. Bhargava, A framework for providing consistent and recoverable agent-based access to heterogeneous mobile databases, Sigmod Record 24 (3) (1995) 44–49. [31] Q. Ren, M.H. Dunham, Using semantic caching to manage location dependent data in mobile computing, in: Proceedings MOBICOM-00, 2000, pp. 210–221. [32] N. Shivakumar, J. Jannink, J. Widom, Per-user profile replication in mobile environment: algorithms, analysis, and simulation results, MONET 2 (2) (1997) 129–140. [33] G. Weikum, H.J. Schek, Concept and applications of multilevel transactions and open nested transactions, in: A.K. Elmagarmid (Ed.), Database Transaction Models for Advanced Applications, Morgan Kaufmann, 1991, pp. 515–553. [34] O. Wolfson, S. Jajodia, Distributed algorithms for dynamic replicated data, in: Proceedings ACM PODS-92, 1992, pp. 149–163.
204
Budiarto et al. / Data & Knowledge Engineering 41 (2002) 183–204
[35] B. Yang, H. Garcia-Molina, Comparing hybrid peer-to-peer systems, in: Proceedings VLDB-01, 2001, pp. 561–570. [36] L.H. Yeo, A.B. Zaslavsky, Submission of transactions from mobile workstations in a cooperative multidatabase processing environment, in: Proceedings ICDCS-94, 1994, pp. 372–379. [37] Freenet homepage. http://freenet.sourceforge.org. [38] FOMA homepage. http://foma.nttdocomo.co.jp. [39] Gnutella homepage. http://gnutella.wego.com. [40] ICQ homepage. http://www.icq.com. [41] IMT-2000 homepage. http://www.arib.or.jp/IMT-2000. [42] JDBC homepage. http://www.javasoft.com/products/jdbc. [43] JXTA homepage. http://www.jxta.org. [44] LOCKSS homepage. http://lockss.stanford.edu. [45] Napster homepage. http://www.napster.com. [46] PKI homepage. http://csrc.nist.gov/pki. [47] Pointera homepage. http://www.pointera.com. [48] SETI@home homepage. http://setiathome.ssl.berkeley.edu. Budiarto works as research associate at the Department of Information Systems Engineering, Osaka University, Japan. He received his B.E., M.E., and Ph.D. degrees in Information Systems Engineering from Osaka University in 1994, 1996, and 1999, respectively. His current research interests include database systems, distributed computing systems and multimedia systems. He is a member of Information Processing Society Japan (IPSJ).
Shojiro Nishio received his B.E., M.E., and Dr.E. degrees from Kyoto University, Kyoto, Japan, in 1975, 1977, and 1980, respectively. From 1980 to 1988, he was with the Department of Applied Mathematics and Physics, Kyoto University. In October 1988, he joined the faculty of the Department of Information and Computer Sciences, Osaka University, Osaka, Japan. Since August 1992, he has been a full professor in the Department of Information Systems Engineering of Osaka University. He has been serving as the founding director of Cybermedia Center of Osaka University since April 2000. His current research interests include database systems, multimedia systems, and distributed computing systems. Dr. Nishio has served on the editorial board of IEEE Transactions on Knowledge and Data Engineering, and is currently involved in the editorial boards of ACM Transactions on Internet Technology, Data and Knowledge Engineering, New Generation Computing, International Journal of Information Technology, Data Mining and Knowledge Discovery, and The VLDB Journal. He is a member of eight learned societies, including ACM and IEEE.
Masahiko Tsukamoto received his B.E., M.E., and Dr.E. degrees from Kyoto University, Kyoto, Japan, in 1987, 1989, and 1994, respectively. From 1989 to 1995, he was a research engineer of Sharp Corporation. From 1995 to 1996, he has been an Assistant Professor at the Department of Information Systems Engineering, Osaka University and since 1996, he has been an Associate Professor at the same department. He is a member of eight learned societies, including ACM and IEEE. His current research interests include wearable computing and its applications.