G Model JOCS-111; No. of Pages 10
ARTICLE IN PRESS Journal of Computational Science xxx (2012) xxx–xxx
Contents lists available at SciVerse ScienceDirect
Journal of Computational Science journal homepage: www.elsevier.com/locate/jocs
Efficient content distribution in social-aware hybrid networks Bernd Klasen SES ASTRA TechCom, L-6815 Betzdorf, Luxembourg
a r t i c l e
i n f o
Article history: Received 14 July 2011 Received in revised form 25 October 2011 Accepted 15 December 2011 Available online xxx Keywords: Content distribution Hybrid networks Bandwidth optimization YouTube analysis
a b s t r a c t Internet usage is increasing in two dimensions: the time users spend online and the amount of data they send and receive. The result is an increasing stress for the Internet infrastructure. Current developments like the trend towards video on demand and IPTV introduce even more bandwidth intensive services, which amplifies the demand for an efficient content delivery model. This article presents such a model which incorporates social awareness and is based on the combination of complementary networks, referred to as a hybrid network. Its applicability is further investigated and discussed by means of a YouTube video request analysis. © 2011 Elsevier B.V. All rights reserved.
1. Introduction The Internet significantly changed the way we work, the way we think, how we communicate with friends and many facets of our recreation activities. While this change started already many years ago, it is still not completed. The progression in ubiquitous computing and the popularity of mobile Internet access lead to an uninterrupted usage of online services. The introduction of cloud computing and online storage provides us with pervasive access to our files and to computational power. While this can make our work more efficient and increase convenience, it also causes an amplified message exchange and network traffic. The latter is mostly driven by extremely popular and bandwidth intensive services, such as software downloads and access to multimedia content. Video files are responsible for more than 32% of all http traffic [1]. In consequence of the trend towards video on demand services (VoD) and IPTV combined with an increasing video quality, the relevance of multimedia content for the total amount of web traffic will further increase in the future. This has also been postulated in an Internet traffic forecast by Cisco (Source: http://www.cisco.com – Cisco Visual Networking Index), which predicts a fourfold increase of web traffic from 2010 to 2015. According to the Cisco Visual Networking Index the main traffic drivers will be Internet video, either delivered to PCs or to television sets, and P2P file exchange. It is worth noting that the latter also partly consist of multimedia
E-mail addresses:
[email protected],
[email protected] URL: http://research.berndklasen.de.
respectively video content. This indicates that online video is the most significant traffic driver. The importance of an efficient content distribution model for video services has already been stated by [2], and it is further amplified by the aforementioned developments in order to ensure the quality of service (QoS) requirements. Also the authors of [3] prognosticate that the Internet backbone will become a serious bottleneck for network performance in tomorrows Internet traffic. But not only does efficient content delivery avoid failures or delays in data distribution, it also saves money and energy. The latter is becoming more and more important due to global energy saving agreements. This article introduces an approach that enables increased efficiency in content delivery and reduced traffic by means of combining the strengths of broadcast and unicast networks within what we call a hybrid network in Section 2. It further describes how this approach can benefit from the utilization of social networks (Section 3.3) in order to facilitate demand estimation, recommendations and an efficient P2P redistribution (Section 3.5). The way this combination works is depicted by means of a future television model (Section 3.1). The decision for this use case is founded on the impact of that type of data on world wide Internet traffic [1,4]. Nevertheless, the described technology can be used for any other types of content, too. Further this example concentrates on a stationary use at home which utilizes Internet and satellite networks, but it can be easily extended to mobile application by use of other/additional broadcast networks or different frequency bands. In order to get evidence for the applicability of the presented approach, we present an evaluation of requestbehavior at the online video portal YouTube and discuss its results in Section 4. These results as well as the proposal of the network
1877-7503/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jocs.2011.12.003
Please cite this article in press as: B. Klasen, Efficient content distribution in social-aware hybrid networks, J. Comput. Sci. (2012), doi:10.1016/j.jocs.2011.12.003
G Model JOCS-111; No. of Pages 10
ARTICLE IN PRESS B. Klasen / Journal of Computational Science xxx (2012) xxx–xxx
2
Fig. 1. Hybrid network distribution model.
model for content distribution are the main contributions of this paper.
2. The hybrid network approach We refer to hybrid networks as the conjunction of at least one unicast and one broadcast network which appear to the users as one single network they connect to with a multi network router. Within this article, we postulate a satellite network as broadcast network, while the unicast network will be represented by the Internet as we see it today. Since the Internet itself consists of multiple interconnected networks, for a better distinction we refer to the term Internet as the wired conjunction of autonomous systems which apparently functions as one large network by means of the IP overlay. In order to build a hybrid network, other networks could be used as well, such as cable TV, cellular or terrestrial radio networks. The decision for satellite and Internet is justified by their nearly orthogonal specialties. The Internet provides low latencies and a fast back channel. While satellites cannot offer these qualities, they have a great coverage as well as a constant latency and bandwidth for all receivers within the same satellite footprint. In other words, they come with an unbounded scalability with regard to the number of receivers. Currently the two networks co-exist and are used independently but often for redundant services. For example, modern Set-TopBoxes and television sets can present TV broadcasts delivered via DVB-S as well as VoD or streaming content from Internet servers. But there is a strong separation: content is always exclusively delivered by the network in which it has been requested and little advantage is taken from the availability of both networks. For example a satellite TV channel is broadcasted even if none or just very few viewers exist. On the other hand, there are LiveTV transmissions via Internet which cause high server loads and tremendous traffic due to high viewer counts. The reason is that the delivery in this case relies on unicast. Both of the aforementioned scenarios are examples for an inefficient usage of those networks. The approach of a hybrid network presented in this article is supposed to overcome this limitation by introducing a complete fusion of both networks, where transfers can be dynamically switched from Internet to satellite and vice versa (see Fig. 1) depending on the number of recipients. Thus it is supposed to reduce Internet traffic – in terms of the number of IP packages – by avoiding peaks caused by very popular files. In order to facilitate this, we assume that the broadcast network (e.g. the satellite)
functions as a hub for IP packets1 which gives us a uniform addressing scheme. Further on the client side a hardware device is needed which can receive incoming data from both networks. We refer to such a device as the home gateway (HGW). The HGW is the mediator between an arbitrary number of Internet devices – such as mobile phones, computers, television sets and many more – and the two delivery networks. Thus it can be seen as the successor of todays broadband modems with integrated router and WiFi access point functionality. It receives incoming packets from both carriers and forwards them to the corresponding devices. As a further requirement, it must provide a sufficiently large cache whose purpose will become obvious in the next lines. In the proposed hybrid network model a decision must be made whether to deliver a file or an IP packet via broadcast or unicast. This delivery network selection is depending on the applicability of the transfer, which is basically related to the number of receivers and the timely accumulation of requests. In the easiest case, we already know in advance that a sufficiently large number of users will request a specific content. Then it can be broadcasted, stored on the HGW’s cache and delivered to the connected devices when needed. This can be presumed for updates of widespread software such as the commonly used operating systems [5] or popular games but also when a large number of subscriptions exists (analyzed in Section 4). In most other cases there is no upfront knowledge about the number of recipients. One way to deal with this is to make assumptions on the request development. With other words, predicting the future popularity which will be discussed in Section 3.4. In this context the integration of social networks is of great importance, since they provide a better understanding of information spread and mutual influence among individuals, as described by [6,7]. Both of the approaches just presented aim at distributing the content before users declare a demand for it. Another way to select the appropriate network is to constantly monitor the number of requests and switch the delivery network accordingly. For example for a new online content the number of active downloads is zero at the beginning. Then users might start downloading and the corresponding data is send through the unicast network. If the number of active downloads exceeds a certain threshold, a broadcast is reasonable. At that point, the file will already be partially delivered to many recipients. Those pieces of the file that have not been delivered to any of the clients so far are broadcasted and potentially cached. The pieces which have been already downloaded by most users are unicasted to the remaining fraction (or possibly also exchanged in a peer-to-peer manner in order to reduce the load on the original server, as described in Section 3.5). While this might potentially deliver the best results by means of maximum bandwidth savings, it is not applicable in a global scale. Monitoring all online files and comparing the number of requests in such a large distributed and highly dynamic system is simply not possible. A further drawback is that web traffic is shaped by time-of-day curve. Thus the broadcast network bandwidth might be unused at night, while it becomes a scarce resource at day. Nevertheless it might still be a viable solution for content providers that operate their own satellite uplink, as it is the case today for the major television broadcasting organizations. However, this particular selection strategy will not be further investigated in this article, which concentrates on the other two, first mentioned approaches. Since the proposed model affects manifold research aspects, this article cannot provide evaluations for all of them. Instead it puts its focus on one criterion which is a crucial prerequisite for the suitability
1 The choice of a specific encapsulation method for the IP packets is not subject of this work.
Please cite this article in press as: B. Klasen, Efficient content distribution in social-aware hybrid networks, J. Comput. Sci. (2012), doi:10.1016/j.jocs.2011.12.003
G Model JOCS-111; No. of Pages 10
ARTICLE IN PRESS B. Klasen / Journal of Computational Science xxx (2012) xxx–xxx
of the model: can we make assumptions about future popularity of online content and are there enough requests concentrated on a sufficiently small time interval, that the resulting traffic savings would justify a satellite broadcast? Section 4 presents the analysis of request distribution at the online video portal YouTube, which has been performed in order to answer this question. The aspects that influence the selection of the delivery network are discussed in more detail in the Sections 3.2 and 3.4 while a corresponding evaluation is subject to future work. 3. Hybrid networks with social awareness This section gives an introduction to a possible future television concept (Section 3.1) followed by a more detailed description of the core aspects of the proposed model (Sections 3.2, 3.4 and 3.5). For a better understanding, it gives a summary of the social network paradigm and its general benefits (Section 3.3) before its specific application in this model is depicted. 3.1. Tele-vision of the future The increasing popularity of on-demand video reception indicates that people more and more prefer to freely decide what and when to watch. This contrasts with the traditional television model, where users are offered a predefined schedule of content for each TV channel. After many successful years this model has reached a critical point. On the one hand the growth in the number of available channels led to an increased probability – at least in theory – that at an arbitrary moment in time there might be at least one TV program which matches the interest of a specific user. On the other hand the probability thereby decreased that the corresponding channel might be found. A desirable new television model should offer a personalized choice of all available content in accordance with individual user preferences that are stored in a user profile. Such a choice must be small enough not to overstrain a viewer and broad enough to allow an individual decision that can compensate any limitations of computer based recommendations. User preferences can be generated explicitly by telling the system about ones interests, implicitly by observation of user behavior2 or interactively by giving users the opportunity to rate content. Since nobody wants to loose the freedom of choice – no matter if we ever make use of it or not – it should still be possible to access all content, regardless of any preferences. For a better understanding, we will have a look at the following scenario which is not unrealistic to become reality on TV-screens in a few years: Mrs. X comes home on a Friday evening and switches on her TV. The system has learned that she likes to see movies if she watches TV on Friday evenings. Thus it immediately offers a choice of eight videos, which have been selected according to her personal preferences. The latter are determined by her previous behavior. The video selection has been composed according to the following information: she likes thrillers, romantic movies and documentations about pre-Christian Egypt. She subscribed to the channels of 21 providers. Four of them are TV-Stations who produce professional content of mixed genres in a high frequency and thus can provide a continuous playlist with a 24/7 program. The other 17 are private users who add their self-generated content on a more or less regular basis. Furthermore she has 42 videos in her favorite list, organized in 7 categories. The system checked her social network and found 23 movies that were seen by at least 10 of her friends during the last 3
2 This is subject to privacy protection and must require user approval. Privacy protection is not within the scope on this paper.
3
weeks which have received good ratings from at least 80% of them. Besides of that two of her friends explicitly recommended movies to her. A further comparison is made to the favorite movies of people who are not necessarily part of her social network who liked the same movies as Mrs. X. Additionally the two most seen movies within the last 24 h were added to the choice, too. But Mrs. X wants to see something different. Today at lunch, she remembered an old movie that she liked so much when she was a child, and she wants so see that one. So she uses the search function, finds that movie and immediately starts watching. After 20 min, she realizes that the memory was much better than the movie itself. She pushes the Channel-Plus Button and starts zapping. Zapping selects with a high probability content that matches her preferences, but sometimes also brings up randomly chosen videos. It can be configured to start playback always at the beginning of a video or at a random position in order to get an insight more quickly. Mrs. X finds an interesting documentation, not on Egypt but about the roman empire. When 15 min later her friend calls and invites her to a party, she adds that movie to her watch-it-later list before she switches off the TV. She will see the rest of it tomorrow and probably write a review afterwards, as she often does for documentations. Many aspects described in this scenario are not new compared to similar online services we already know. What is new is the fact that they appear on our TV-screens in high definition quality and probably also in 3D. However, this is no revolutionary invention either, and it is not meant to be. But a global implementation of the described functionality demands for an efficient content distribution model. The originality of this paper lies in the proposed hybrid network as a qualified infrastructure combined with a distribution logic that implements social awareness and in the corresponding analysis of YouTube video request rates (Section 4). 3.2. Broadcast-ability As already mentioned before, using the Internet for global scale delivery of video on demand content will result in an unmanageable amount of traffic. Contrariwise, while broadcast networks can serve an arbitrary number of customers simultaneously, they are not designed for individual services. Regarding currently available technology, a broadcast comes with much higher costs than one Internet unicast of the same data. Obviously a minimum number of concurrent recipients is needed for an economically reasonable broadcast. This raises some important questions. First it must be clarified whether a sufficiently large number of concurrent requests for the same content occur at all in an on demand environment. The results of [8,9] suggest that this should be the case but this assumption needs to be substantiated by real world measurements. This article provides an insight into this problem and attempts to give an answer in Section 4. The second question is how such content can be identified in real time. Since bandwidth in a broadcast network – and particularly on satellites – is limited, we further need a metric that returns the degree of appropriateness of a broadcast for a certain content. This will allow the decision which content should be prioritized in case of competing requests. Such a metric will result in a property that can be denominated as the broadcast-ability of an online content. Fig. 2 gives a visual impression on how broadcastability depends on the timely correlation of requests (burstiness) and the popularity. It shows the expected results for several categories of content. In the upper right corner we have the highest degree of broadcast-ability. In the lower left corner it is zero. The question is where the bound between broadcast-ability and nonbroadcast-ability lies and how it can be determined. While this is subject to ongoing research, this paper concentrates on the question if sufficiently high numbers of concurrent request occur at all in on demand environments (Section 4). Since an exact number for
Please cite this article in press as: B. Klasen, Efficient content distribution in social-aware hybrid networks, J. Comput. Sci. (2012), doi:10.1016/j.jocs.2011.12.003
G Model JOCS-111; No. of Pages 10
ARTICLE IN PRESS B. Klasen / Journal of Computational Science xxx (2012) xxx–xxx
4
utilizing these networks of communities for a wide variety of reasons: to provide a focused interface to functions and products that might be of interest for an individual, to enrich raw information through community-driven annotations or to try replicating proactively crucial resources such as videos by anticipating predictable interests of large communities. In all these cases, the hub-based interdependencies among communities and the non-linear acceleration in the dissemination of hype-like information is key for their success. Due to their importance, social networks moved into the focus interdisciplinary research. Especially the modeling of social networks and the specific information diffusion characteristics for their network topology were subject to many studies, such as [12–20]. They all agree on the great significance of social networks respectively of scale free networks and their effect on information diffusion. Thus their findings make it reasonable to utilize social networks for our distribution model. 3.4. Popularity prediction and in-advance distribution
Fig. 2. Broadcast-ability of content.
the amount of needed receivers for an economical broadcast may depend on the specific technology that is used (e.g. the type and capacity of the satellite), a criterion that allows easy comparison is applied. The objective is to identify bursts in file request characteristics. A burst is the concurrent or timely close coherent incidence of requests for a certain content. More formally, let a burst be the occurrence of at least MB requests within a time period T where MB is adjusted to the specific use case. 3.3. Social network paradigm Before we continue and see how social networks can support the described hybrid network based content distribution model, it is important to be reminded on the remarkable characteristics of social networks. Todays understanding of networks in general and computer networks in particular increasingly benefits from the observation that most of them exhibit – deliberately or unknowingly – social network properties. Formally, a social network resembles a small world graph where the node degree follows a power law distribution. With other words, social networks are scale-free networks, which allows even weak information – e.g. information only shared by one network node – to experience an epidemic spread. This property has been analyzed and described by [10,11] and it is mainly caused by the fact that – due to the power law distribution of node degrees – there is always an ample probability for nodes with a high or even very high connectivity to other nodes. These hubs and super-hubs are warrantors for most of the remarkably good properties of social networks – especially with respect to scalability issues. From a pragmatical viewpoint, social networks resemble communities that are intertwined by these hubs. The connectivity within a community is typically above average so that essential information can be communicated nearly instantaneous. Hubs accelerate the interaction within a community even more, but particularly they serve as gateways to disseminate sufficiently important information among different communities. It is important to note, that by these means every social network exhibits sometimes adverse but most of the time beneficial nonlinear self-accelerating effects. Many modern distributed applications are either explicitly establishing social networks (e.g. Facebook, Google+, LinkedIn, MySpace) or they benefit indirectly from social network structures (e.g. Amazon, Akamai). Applications are working towards
As we argued before in Section 2, an important aspect for the enabling of content distribution via the proposed hybrid network is to make assumptions about the future development of popularity. According to the analysis of [21], such predictions can be made by analyzing the development of requests in an early phase of content lifetime. It states that an accuracy of at least 80% can be achieved two days after the video was launched. As we shall see in Section 4, there are certain groups of files where such predictions can already be made in the moment when they are first shared. In these cases, the predictions are based on experiences made in preceding observations, on user subscriptions or favourite lists. Nevertheless, there is also a fraction of content where predictions on this basis are currently impossible. These are videos which are uploaded and remain without any mentionable number of views for an arbitrary amount of time until suddenly an incredible increase begins. Terms like viral video, to go viral, hype or (Internet) meme describe this content or its behavior. An example for such a video is the song recording of a TV-show that was made in the year 1976 and which became famous as the trololo man. It has been uploaded to YouTube on November 26 of 2009 and hardly been recognized until February 25, 2010. Then an epidemic spread started which ended up with more than seven million views. Such a development can hardly be foreseen and predicted by view count observation. However, there are other indicators that might enable early detection of such an behaviour. They rely on the number of occurrences of links to a specific video shared in social network utilities like Facebook, Google+or Twitter. Especially being able to identify the opinion leaders and hubs within these networks will significantly increase the ability to recognize future trends. This is one of the reasons why the social awareness is an important feature in order to reach efficiency in the described content distribution model, since it relies on in-advance distribution. A detailed analysis of the correlation of link occurrences and the emergence of Internet memes is subject to future work and not part of this paper. However, there are existing research results that prove a significant influence of link sharing in social networks on the development of online video request as well as studies that aim at describing influence models for social networks, for example [22–24]. According to [25], the strong effect of social networks of future popularity plays a most important role during the first period directly after the launch of the specific video. While not all details and rules behind this mutual influence are explored, there is evidence that people within the same clique or even with a direct connection in the social graph have a much higher probability to influence each other compared to nodes with a lower correlation. In the context of online multimedia content this means that people who know each other are likely to watch similar movies. This tendency can be further
Please cite this article in press as: B. Klasen, Efficient content distribution in social-aware hybrid networks, J. Comput. Sci. (2012), doi:10.1016/j.jocs.2011.12.003
G Model JOCS-111; No. of Pages 10
ARTICLE IN PRESS B. Klasen / Journal of Computational Science xxx (2012) xxx–xxx
specified by considering other user information such as personal preferences and prior behavior. In order to harness social network structures and to get access to the social graph of users, instead of implementing its own social network the proposed model preferably utilizes one or even several existing social networks. Regardless of the way how we came to a prediction, how should a file be treaded after we diagnosed a high probability for a great future demand? While there is still satellite bandwidth available, the corresponding files are broadcasted and cached at user side. This shall be done as early as possible, which allows to broadcast those files that allow an early prediction – like for software updates – at night time, which saves bandwidth at day for the short term predictions and live transmissions. If the available bandwidth is completely occupied and we thus have to deal with competing aspirants, those files are broadcasted where we expect the highest savings compared to unicast transmission. Parameters for this are the number of predicted requests, the file size and the reliability of the prediction. A mathematical formula that defines a metric in accordance with these parameters must be evaluated and is subject to ongoing research. In case of a falsely prediction, the reduction of Internet traffic is lower than expected. As long as we have more than one recipient, there still is at least some saving. However, as we argued above, correct forecasts should be possible in for least 80% of all files. 3.5. Peer-to-peer based re-distribution The approach for content distribution in hybrid networks is operable as it has been described so far. Nevertheless it bears the potential for further improvements in terms of reducing load on the Internet backbone and also on the content provider’s servers. Let us assume that a certain video file has been broadcasted some while ago. Thus is has been cached at a large number of users respectively their HGWs. We know that this number is large since there would not have been a broadcast for a small group of recipients. Now let us further assume a user wants to watch that video but it is not cached at his HGW. This could be in case of a power loss, a hardware failure during the broadcast or – more probably – because it did not match his preferences. Further there is also a non-negligible amount of households without satellite dishes and hence can never cache broadcasted files. No matter what the reason may be, the consequence would be that the file needed to be transmitted again from the content provider to the client. Considering that there are already a lot of duplicates of this file in place at numerous HGWs, there is an alternative solution that reduces the server load. From a P2P file sharing point of view, having files broadcasted and cached at the recipients results in a high number of potential seeders. Thus a self-evident measure is to make use of this situation and let the requesting user receive the corresponding file from these seeders in a P2P exchange manner. The potential to reduce server load for content providers by using P2P for video on demand distribution has been analyzed by [26,27]. Both agree on possible savings of 68% and more. It is worth noting that in their studies the files were delivered directly from the origin server during the bootstrap phase until enough seeders were created. Since users are commonly equipped with an asynchronous Internet link – with relatively small upload bandwidth – a high number of seeders per viewer is needed if live playback is targeted which results in a long bootstrap phase. In the scenario described above and thus for the hybrid network distribution model, the situation is different since seeders already exist after a file was broadcasted. Being able to perform P2P file exchanges by avoiding the bootstrap problem further increases the potential for server load reduction. While in this case mainly the content providers benefit from the P2P extension for hybrid networks, there are other aspects that potentially reduce the Internet backbone and Internet service
5
provider’s peering traffic. This can be achieved by utilizing the presence of social network structures again. That allows to build P2P overlays based on the node correlation in the social network graph. When we observe social networks, it can be asserted that people with small virtual distance often also exhibit a close physical relation. This has an additional positive influence on the P2P traffic. Peers that are located within the same area have a high probability to be part of the same provider network. Thus the described overlay structure can reduce peering costs, keep traffic at the edges of the Internet and thus decreases the backbone traffic. The described effect refers to the spatial locality which is well known for storage access patterns. Considering the sum of all HGWs as a distributed storage, this perfectly matches the established rules. The reduction of peering traffic respectively the introduction of locality awareness to P2P protocols is a very important aspect since studies presented by [28] state that current P2P implementations double the ISP cost and their outbound peak loads. Further they find that 24% of all participating peers experience an 50% increased download speed when locality awareness is supported. In the context of P2P file exchange we can further benefit from the availability of a broadcast network. Using a modified BitTorrent protocol for the P2P exchange, the metadata – the torrent files – can be broadcasted as well, and thus information about which files are available via P2P is present at HGWs. Hence these devices can immediately check and decide whether to a file is already cached, whether to download it from a server directly or to join a P2P overlay and request it from the corresponding peers. In order to achieve this the broadcast network operators provide trackers which ensure up to date information about peers and file-piece availability. Users without access to the broadcast network could benefit by using these trackers in accordance to the conventional BitTorrent protocol. One might argue that this approach based on caching content after it has once being broadcasted is unqualified since the original files might change in the meantime and thus the caches needed to be refreshed. However, especially in the case of online video, this does not apply. According to [29] files hosted at online video services seldom change. Thus they are appropriate for being cached over long periods of time without becoming invalid. The same applies for software updates, which commonly are released once a month or – considering game updates – even more rarely. The evaluation of these assumptions is part of the author’s current studies and results will be presented in upcoming papers.
4. The YouTube burst study In the field of VoD offers, YouTube is probably the worldwide best known and most used service [4]. It experiences more than two billion requests for its videos every day [30]. Each minute 24 h of new videos are uploaded [31] by professionals, companies or private users. The latter produce what is referred to as user generated content (UGC). This is the key aspect of the Web 2.0 and introduced a paradigm changes in the Internet usage. According to [32], most files on YouTube are user generated content, but they seldom appear in the category of the most popular videos, which is dominated by professional content. Nevertheless, the most subscribed channels on YouTube are user channels and due to the ZIPF-like distribution of user generated content popularity [33], a relatively small fraction of these videos cause a significant portion of traffic. The importance of YouTube and its popularity as well as the existence of a sophisticated application programming interface (API) were the most important criteria to choose this video platform for an analysis of video-request correlation. The aim of this study is to gain more knowledge about the burstiness of video requests and thus to determine the broadcast-ability of existing material.
Please cite this article in press as: B. Klasen, Efficient content distribution in social-aware hybrid networks, J. Comput. Sci. (2012), doi:10.1016/j.jocs.2011.12.003
G Model
ARTICLE IN PRESS
JOCS-111; No. of Pages 10
B. Klasen / Journal of Computational Science xxx (2012) xxx–xxx
6 Table 1 Observed video categories. Category
Description
Most popular (today) Most popular (week)
Videos with the most views today Videos with most views during the actual week Most recently added videos The videos with the most responses Most often shared on Facebook or Twitter Most highly rated YouTube videos Videos from YouTube trends
Most recent Most responded Most shared Top rated Trending videos
# Videos 797 134 5248 65 1099 57 964 Fig. 3. YouTube: R.W. Johnson view count curves.
Related research work was presented in [34] where the authors concentrated more on the analysis of the long tail of the popularity distribution. They further performed a study of the view count development which is not as fine grained as needed for our burst detection. The same applies for the results presented in [35], which exhibit significant peaks on a daily granularity. In order to be able to do calculations with realistic values, several measurements on YouTube videos have been performed. During each run a set of YouTube standard feeds (Table 1) – potentially supplemented by additional, varying request results as needed – has been observed. All these videos respectively the feeds were repeatedly checked for new videos and changes in the statistical data – such as the number of views, comments, rating, channel subscribers – in 60min intervals. Videos appearing in more than one category were not counted twice. As far as nothing different is denoted, the results presented in this paper are based on a measurement period from 2011/04/05 to 2011/04/16. It must be admitted that there is a slight problem with the videos we receive by some of these feeds. The most popular and most responded videos already gained a certain degree of popularity, otherwise they would not appear in the corresponding lists. Thus the obtained data misses the most interesting phase where the videos got popular. Within the most recent videos the probability to have even one video that will gain a mentionable popularity is very low. In order to have the guarantee that a very popular video is observed from the beginning, additional video queries have been added to the list, those produced by the YouTube users named RayWilliamJohnson and BarelyPolitical. Their channels are on top of the US all time most subscribed list. Ray William Johnson’s videos reliably get at least 3 million views during the first 4 days after their publication. For BarelyPolitical, the situation is a little different. Most of his videos do not get that high view counts, usually they stay at values between 100,000 and 500,000. But there are several exceptions which have millions of views. Since the primary objective for monitoring these channels was to capture newly added videos which have a guarantee to become highly popular, only videos not older than 30 days are included. This resulted in 16 observed videos for BarelyPolitical’s channel and 10 for RayWilliamJohnson’s. As already mentioned, the measurement interval is set to 60 min. This is not as fine grained as it would be desirable in order to get an exact impression of the burstiness, but it follows from the limitations of the data provided by the YouTube API. YouTube updates these values in intervals that may vary from 30 to 120 min (Source: www.youtube.com). During earlier measurements, requests for the video statistics have been done every 15 min. The results document that YouTube seldom provides data updates for the view-count value in Intervals shorter than 60 min. This especially applies to very popular videos with heavy changes for these values. Thus the statistical data is coarse grained on any account. This causes the results to appear as step curves as we see it
in Fig. 3. Interestingly, the comment statistics of videos are updated much more frequently. Thus the data observed here could be used to approximate the views between the updates. This has not been done for this study, since it is desired to work only with strictly verifiable data. In spite of the course grained updates of the statistics, the obtained results provide valuable information. The burst period T (see Section 3.2) is adjusted to span a full YouTube update interval in order to comply with the granularity. Further we define a burst as the occurrence of an increase where the number of views since the last measurement is greater than 10,000 (MB = 10, 000). We will also see results for other values of MB later. The corresponding MB will be explicitly specified then. Fig. 3 shows the view-count curves for five different videos by YouTube user RayWilliamJohnson within the ten days period of the measurement. The upper two curves belong to older videos. There are no significant changes here anymore. The other videos are younger, partially they were just uploaded during the measurement. They exhibit a fast increase within the first days after the video has been published, a decreasing gradient after day 3. This will be followed by a rather flat progression for remainder, like we observe it at the two older clips. The same effect can be noticed for all videos from this channel during all measurements, and it seems to be characteristic for channels with many subscribers, except for a varying duration of the distinct phases. For example the view count statistics for several videos from BarelyPolitical’s channel have a similar shape but a lower maximum view count. The visual impression of burst occurrences within the first days after a video is uploaded can be substantiated by the results shown in Table 2. It is based on the same data as Fig. 3. The oldest video in the table does not exhibit any bursts at all, even if it still gets numerous requests. This shows that requests for older videos occur less frequently and are rather evenly distributed. All other videos in this category which were older than this one did not have any bursts either. Again we find that most bursts happen within the first 2 or 3 days after a video was uploaded. This effect is what we would expect for channels with many subscribers, since a subscription implies a notification about new videos by email. It can be assumed that sharing a video URL on Twitter might have a similar effect on the popularity respectively the view count development of a specific video. While this is subject to future work, YouTube searches for Table 2 Bursts in Ray William Johnson’s Videos during measurement period. Upload date
# Views during observation
# Bursts
∅ Increase per burst
2011/04/15 2011/04/08 2011/04/05 2011/03/22 2011/01/21
1,500,502 4,721,440 4,790,816 326,883 65,565
8 54 73 4 0
187,562 87,339 64,681 11,971 0
Please cite this article in press as: B. Klasen, Efficient content distribution in social-aware hybrid networks, J. Comput. Sci. (2012), doi:10.1016/j.jocs.2011.12.003
G Model
ARTICLE IN PRESS
JOCS-111; No. of Pages 10
B. Klasen / Journal of Computational Science xxx (2012) xxx–xxx
7
Fig. 4. Visualization of measurement results.
Table 3 Bursts for trending Twitter topics (MB = 10, 000). Search string
# Videos
Videos with bursts
# Bursts
∅ Increase per burst
∅ Bursts per video
Fukushima Osama
313 348
38 46
418 339
33,651 26,473
1.3355 0.9741
two trending topics on Twitter have been performed and the results were added to the observation list. The outcome is shown in Table 3. When we compare these values to the data given in Table 4, we notice a very low amount of average bursts per observed video. Obviously, being a trending topic does not automatically result in a high popularity respectively a high number of bursts. This might be partially caused by duplicate videos or quality disparities. What has not been considered here are the views which resulted from external links, where it might be assumed that the videos with many bursts have a lot of external links. Nevertheless, we find that even with such a rudimentary integration of additional Table 4 Bursts after category. MB = 10, 000. Category
Videos with bursts
MostPopular (today) MostPopular (week) MostRecent MostResponded Most shared Top Rated Trending videos RayWilliamJohnson Barelypolitical
450 53 0 2 301 44 280 7 7
# Bursts 2358 570 0 10 2567 1186 2412 234 43
∅ Increase per burst
29,278 20,671 0 13,974 35,233 42,027 33,659 58,696 15,478
∅ Bursts per video
2.9586 4.2537 0 0.1538 2.3358 20.8070 2.5021 23.400 2.6875
knowledge, significantly better results than by picking videos randomly – respectively those of the YouTube-Feed most recent – can be achieved. In order to identify factors that might allow early detection or predictions of burst – respectively determine the probability for a burst within a group of videos with certain constraints – the results have been further analyzed. As one result, we find that videos with bursts tend to have a high rating and almost always have an average rating of 4 or better, as we can see in Fig. 4a. It shows the relation of ratings and bursts for videos that were returned for the search string “Fukushima”. Fig. 4b shows this relation for a 30-day measurement in April 2011, where a similar tendency can be observed. A further refinement of the Fukushima results included a reduction on those videos whose age at the end of the measurement was less than 23 days. This limits the videos to those who were uploaded after a heavy natural disaster in Japan which was followed by a meltdown at the nuclear power plants located at Fukushima, which was causative for the high interest in it. An interesting aspect here is, that the peak in burst occurrences appears several days after the disaster (Fig. 4c). This delay would allow to broadcast such videos even before the bursts emerge and thus to reduce the induced traffic. Further Fig. 4d shows a correlation for the number of bursts and the total views of videos within this group. Apparently the requests for very popular content are rather not uniformly distributed but occur concentrated in bursts for these videos. This has been analyzed for other measurements with a similar result. Fig. 5 (data set A), Fig. 6 (data set B) to Fig. 7 (data set C) show the view-count burst correlation for different data sets. Data set A corresponds to a measurement at the beginning of March, Data set B has been recorded at the end of march and data set C during April 2001 for a whole month. We observe a similar effect as it has been described for the Fukushima date set.
Please cite this article in press as: B. Klasen, Efficient content distribution in social-aware hybrid networks, J. Comput. Sci. (2012), doi:10.1016/j.jocs.2011.12.003
G Model
ARTICLE IN PRESS
JOCS-111; No. of Pages 10
B. Klasen / Journal of Computational Science xxx (2012) xxx–xxx
8
Fig. 5. Set A.
Except for the already discussed user channel of Ray William Johnson, the results for most of the observed video feeds exhibit mean burst rates per video to a much lesser extend, save the feed Top Rated. This is a new, experimental feed from YouTube and obviously performs well in early detecting trending videos. Far less impressive is the feed Trending Videos, which is also experimental. The low number of bursts per video reveals that the trend-detection method which is applied here still needs improvement. The number of bursts compared to the mean view-count increases that have been observed for each feed as well as the mean number of bursts per video is shown in Table 4. The subsequent Tables 5 and 6 show the results for higher values of MB . In the latter we observe an average of 200,000 views per hour for each burst. This indicates the existence of videos with a very high broadcast-ability and thus a great potential of traffic savings. The density of bursts within the specify YouTube feeds remains almost unchanged. While this might raise the impression that feeds and videos with high numbers of small bursts also exhibit an increased probability to have also bigger bursts, this is not universally valid.
Fig. 7. Set C. Table 5 Bursts after category. MB = 50, 000. Category
Videos with bursts
# Bursts
∅ Increase per burst
∅ Bursts per video
MostPopular (today) MostPopular (week) MostRecent MostResponded Most shared Top Rated Trending videos RayWilliamJohnson Barelypolitical
91 8 0 0 82 9 70 4 0
278 25 0 0 388 193 333 76 0
101,692 71,388 0 0 116,010 154,690 117,314 127,308 0
0.3488 0.1866 0 0 0.3530 3.3860 0.3454 7.600 0
Considering the user channel of BarelyPolitical – where we already discussed that the curve for the view count development has a similar shape as the one for Ray William Johnson’s videos – a completely different behavior can be observed. While for this channel the results in Tables 4–6 show a very low burst probability, it drastically changes if we set MB = 5000, which might still be enough to reason a broadcast. But even if the relative number of bursts per video is low for many feeds, their absolute quantity is often high since several of them contain very high numbers of files. Yet the presumably bursty videos are much harder to find there. What might help is using further information provided by YouTube, such as the content categories. Thus the burst occurrences in relation to these categories have been analyzed. The results are visualized in Figs. 8–10, for varying values of MB . There is an obvious cumulation of bursts for the categories Comedy, Entertainment and Music. The significance of this effect varies with the changing Table 6 Bursts after category. MB = 100, 000. Category
Fig. 6. Set B.
MostPopular (today) MostPopular (week) MostRecent MostResponded Most shared Top Rated Trending videos RayWilliamJohnson Barelypolitical
Videos with bursts 25 2 0 0 34 4 36 3 0
# Bursts
∅ Increase per burst
∅ Bursts per video
73 3 0 0 161 87 121 36 0
196,879 138,266 0 0 182,823 262,078 202,556 191,801 0
0.0916 0.0224 0 0 0.1465 1.5263 0.1255 3.600 0
Please cite this article in press as: B. Klasen, Efficient content distribution in social-aware hybrid networks, J. Comput. Sci. (2012), doi:10.1016/j.jocs.2011.12.003
G Model
ARTICLE IN PRESS
JOCS-111; No. of Pages 10
B. Klasen / Journal of Computational Science xxx (2012) xxx–xxx
9
M B = 100, 000
M B = 10, 000
Fig. 10. MB = 100, 000.
Fig. 8. MB = 10, 000.
MB but the general trend remains. This knowledge can and should be utilized when estimating the future burst-probability of videos. A comprehensive analysis of the quality of such predictions is subject to future work. What we can already see so far is that within certain groups of videos a high probability for bursts exists and that future view-count prediction can expected to be possible there. 5. Conclusions This paper discussed the problem of increasing web traffic, exemplary for multimedia content. It analyzed current trends like VoD and related problems such as the unsuitability of broadcasts for individual on demand content or the extraordinary high bandwidth demands for a unicast solution. Subsequently it proposed a content distribution model based on a hybrid network – a combination of unicast and broadcast networks – as a solution for these problems. This solution further adopts social networks in order to enable prediction of user behavior and to provide a new quality of experience. The content distribution model also incorporates P2P distribution technique for further traffic and cost reduction. The serviceability of the proposed model has been studied by means of an exploration of time correlation of user requests for YouTube videos. It delivered good results for the broadcast-ability of specific groups of videos – for example those with high subscriber counts. For other videos, a deeper analysis of the early recognition of future popularity and time correlation of requests is necessary. The further investigation of P2P techniques in combination with socially aware hybrid networks and a broader long-term observation of YouTube video request patterns are subject to future research.
M B = 50, 000 Fig. 9. MB = 50, 000.
Acknowledgement This research is supported by the National Research Fund (FNR) of Luxembourg.
Please cite this article in press as: B. Klasen, Efficient content distribution in social-aware hybrid networks, J. Comput. Sci. (2012), doi:10.1016/j.jocs.2011.12.003
G Model JOCS-111; No. of Pages 10 10
ARTICLE IN PRESS B. Klasen / Journal of Computational Science xxx (2012) xxx–xxx
References [1] G. Maier, A. Feldmann, V. Paxson, M. Allman, On dominant characteristics of residential broadband Internet traffic, in: Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement, ACM, New York, NY, USA, 2009, pp. 90–102. [2] M. Saxena, U. Sharan, S. Fahmy, Analyzing video services in Web 2.0: a global perspective, in: Proceedings of the 18th International Workshop on Network and Operating Systems Support for Digital Audio and Video, ACM, 2008, pp. 39–44. [3] T. Leighton, Improving performance on the Internet, Communications of the ACM 52 (2) (2009) 44–51. [4] J. Biel, D. Gatica-Perez, Wearing a YouTube hat: directors, comedians, gurus, and user aggregated behavior, in: Proceedings of the Seventeen ACM International Conference on Multimedia, ACM, 2009, pp. 833–836. [5] C. Gkantsidis, T. Karagiannis, M. VojnoviC, Planet scale software updates, ACM SIGCOMM Computer Communication Review 36 (4) (2006) 423. [6] D. Kempe, J. Kleinberg, E. Tardos, Maximizing the spread of influence through a social network, in: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining – KDD ’03, 2003, p. 137. [7] A. Anagnostopoulos, R. Kumar, M. Mahdian, Influence and correlation in social networks, in: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining – KDD ’08, ACM Press, New York, NY, USA, 2008, p. 7. [8] M.E. Crovella, A. Bestavros, Self-similarity in world wide web traffic: evidence and possible causes, IEEE/ACM Transactions on Networking 5 (6) (1997) 835–846. [9] M.F. Arlitt, C.L. Williamson, Internet web servers: workload characterization and performance implications, IEEE/ACM Transactions on Networking 5 (5) (1997) 631–645. [10] A.-L. Barabási, E. Bonabeau, Scale-free networks, Scientific American 288 (5) (2003) 60–69. [11] R. Pastor-Satorras, A. Vespignani, Epidemic spreading in scale-free networks, Physics (2000) 13. [12] A. Apolloni, K. Channakeshava, L. Durbeck, M. Khan, C. Kuhlman, B. Lewis, S. Swarup, A study of information diffusion over a realistic social network model, in: 2009 International Conference on Computational Science and Engineering, 2009, pp. 675–682. [13] D. Agrawal, A.E. Abbadi, Information diffusion in social networks: observing and influencing societal interests, Social Networks 4 (12) (2011) 8–9. [14] D. Gruhl, D. Liben-Nowell, R. Guha, A. Tomkins, Information diffusion through blogspace, ACM SIGKDD Explorations Newsletter 6 (2) (2004) 43–52. [15] J. Yang, J. Leskovec, Modeling information diffusion in implicit networks, in: 2010 IEEE International Conference on Data Mining, IEEE, 2010, pp. 599–608. [16] M. Berlingerio, M. Coscia, F. Giannotti, A. Monreale, D. Pedreschi, The pursuit of hubbiness: analysis of hubs in large multidimensional networks, Journal of Computational Science 2 (3) (2011) 223–237. [17] A.S. Brahim, B.L. Grand, L. Tabourier, M. Latapy, Citations among blogs in a hierarchy of communities: method and case study, Journal of Computational Science 2 (3) (2011) 247–252. [18] M. Youssef, R. Kooij, C. Scoglio, Viral conductance: quantifying the robustness of networks with respect to spread of epidemics, Journal of Computational Science 2 (3) (2011) 286–298. [19] V. Buskens, K. Yamaguchi, A new model for information diffusion in heterogeneous social networks, Sociological Methodology 29 (1) (1999) 281–325. [20] R. Toivonen, J. Onnela, J. Saramaki, J. Hyvonen, K. Kaski, A model for social networks, Physica A: Statistical and Theoretical Physics 371 (2) (2006) 851–860.
[21] M. Cha, H. Kwak, P. Rodriguez, Y.-Y. Ahn, S. Moon, I tube, you tube, everybody tubes: analyzing the world’s largest user generated content video system, in: IMC ’07: Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, ACM, New York, NY, USA, 2007, pp. 1–14. [22] M. Grabisch, A. Rusinowska, A model of influence in a social network, Theory and Decision 69 (1) (2008) 69–96. [23] P. Hui, S. Buchegger, Groupthink and Peer Pressure: Social Influence in Online Social Network Groups, IEEE, 2009. [24] J. Bollen, H. Mao, X. Zeng, Twitter mood predicts the stock market, Journal of Computational Science 2 (1) (2011) 1–8. [25] G. Szabo, B.A. Huberman, Predicting the popularity of online content, Communications of the ACM 53 (8) (2010) 80–88. [26] P. Garbacki, D.H.J. Epema, J. Pouwelse, Offloading servers with collaborative video on demand, in: Proceedings of the 7th International Conference on PeerTo-Peer Systems, USENIX Association, Berkeley, CA, USA, 2008, pp. 1–6. [27] X. Cheng, J. Liu, H. Wang, Accelerating YouTube with video correlation, in: WSM ’09: Proceedings of the First SIGMM Workshop on Social Media, ACM, New York, NY, USA, 2009, pp. 49–56. [28] T. Karagiannis, P. Rodriguez, K. Papagiannaki, Should internet service providers fear peer-assisted content distribution? in: Proceedings of the 5th ACM SIGCOMM Conference on Internet Measurement – IMC ’05, 2005, pp. 63–76. [29] P. Gill, M. Arlitt, Z. Li, A. Mahanti, YouTube traffic characterization: a view from the edge, in: IMC ’07: Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, ACM, New York, NY, USA, 2007, pp. 15–28. [30] YouTube-Team, At five years, two billion views per day and counting (2010). http://youtube-global.blogspot.com/2010/05/at-five-years-twobillion-views-per-day.html. [31] Google, YouTube facts and stats (2010). http://www.youtube.com/t/fact sheet. [32] G. Kruitbosch, F. Nack, Broadcast yourself on YouTube – really? in: Proceeding of the 3rd ACM International Workshop on Human-centered Computing, ACM, 2008, pp. 7–10. [33] A. Abhari, M. Soraya, Workload generation for YouTube Multimedia Tools and Applications 46 (1) (2009) 91–118. [34] P. Rodriguez, Analyzing the video popularity characteristics of large-scale user generated content systems, IEEE/ACM Transactions on Networking 17 (5) (2009) 1357–1370. [35] F. Figueiredo, F. Benevenuto, J. Almeida, The tube over time: characterizing popularity growth of YouTube videos, in: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, ACM, 2011, pp. 745–754.
Bernd Klasen studied computer science at the University of Trier, Germany. After receiving his diploma in 2008 he joined SES ASTRA TechCom where he supported an international project. In 2009 he successfully applied for a grant from the Fonds National de la Recherche Luxembourg (FNR) and commenced work in a public private partnership between SES ASTRA TechCom and the University of Luxembourg. Since then Bernd Klasen is working on his Ph.D. thesis on the integration of satellites into global scale content distribution. This research also concerns how broadcast networks can be used to support peer-topeer networks and the special aspects of correlated user behavior. He is further interested in social networks and their influence on individual behavior as well as in massively multiuser virtual environments.
Please cite this article in press as: B. Klasen, Efficient content distribution in social-aware hybrid networks, J. Comput. Sci. (2012), doi:10.1016/j.jocs.2011.12.003