ARTICLE IN PRESS
Signal Processing: Image Communication 19 (2004) 479–497
Rate adaptive video streaming under lossy network conditions Aylin Kantarcia,*, Nukhet Ozbekb, Turhan Tunalib a
Computer Engineering Department, Ege University, Izmir, Turkey b International Comp. Inst., Ege University, Izmir, Turkey
Received 7 July 2003; received in revised form 12 December 2003; accepted 22 March 2004
Abstract In this study, the performance of a rate adaptive video streaming algorithm is examined under controlled packet loss rates and delays. The developed algorithm is a set of heuristics that consider the packet loss rate and receiver buffer level during the adaptation decision. The algorithm is content aware in such a way that it employs quality or temporal scaling, or both, in accordance with the amount of motion in a scene. Extensive periodic and non-periodic packet loss scenarios are implemented to examine the behavior of the algorithm. It has been observed that the algorithm reacts to congestion by reducing its data rate and maintain interrupt-free display even if the continuous packet loss rate approaches 15%. The results of this study confirm the suitability of our algorithm for Internet video streaming where congestion can occur any time unpredictably. r 2004 Elsevier B.V. All rights reserved. Keywords: Video streaming; Rate control; Buffer management; Rate adaptation; Lossy network
1. Introduction Recently, multimedia has become an important concept that merges advances in communications, computing and information processing into a new interdisciplinary field. Many new applications and services have been emerging as this field is getting more popular among industrial and academic institutions. An example application class is video streaming applications, in which video data are
*Corresponding author. Tel./fax: +90-232-3399-405. E-mail addresses:
[email protected] (A. Kantarci),
[email protected] (N. Ozbek),
[email protected] (T. Tunali).
played out while parts of the video content are being received and decoded in real time. In Internet era, IP-based networks offer the most promising infrastructure that is both cost effective and ubiquitous for video streaming applications. Due to its real-time nature, video streaming has bandwidth, loss and delay requirements. However, the Internet, being a best effort and heterogeneous network, is far from providing the required quality of service (QoS) support for video streaming applications. New transport protocols and end system support are essential for better multimedia services over the Internet [1]. Since Internet is a shared best-effort datagram network, proper modifications are required on the Internet transport protocols. First, the hardware
0923-5965/$ - see front matter r 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.image.2004.03.002
ARTICLE IN PRESS 480
A. Kantarci et al. / Signal Processing: Image Communication 19 (2004) 479–497
has to provide the required bandwidth for multimedia applications. Second, unreliable transport protocols are required because lost and late packets are tolerated to some extent and reliability mechanisms based on the retransmission of data increases the network traffic and loss ratio, leading to congestion. Third, the network has to deliver packets in real time to guarantee continuity in the playback. Fourth, video packets are to be sent smoothly because bursty nature of multimedia data results in unpredictable network services and degrades the performance [2]. These requirements must be specified by industry standards to be put into life. The Internet Engineering Task Force (IETF) has developed protocols for the transmission of multimedia data over the Internet. Realtime transport protocol (RTP) is an application layer protocol developed by IETF for data transmission. It is built on lower layer protocols. It provides time stamping, sequence numbering, multicasting and offers no reliability mechanisms. RTP supports various types of payload such as MPEG1, MPEG2, MPEG4, JPEG, CellB, H261, H263 and H26L [3]. In addition to new protocols, application layer QoS control is essential for the performance of streaming applications. With no network-based QoS guarantee, packet loss is inevitable when transmission errors and congestion occur on the Internet. The objective of application layer QoS is to avoid congestion and maximize video quality in the presence of packet losses. Application layer QoS control techniques are employed by the end systems and do not require any support from the underlying networks [1,4–6]. In a typical streaming system, a packetized compressed video stream is passed through the RTP/IP/UDP layers before entering the Internet at the sender. Packets may be lost in the Internet due to congestion or at the destination due to excessive delay. Successfully delivered packets are passed through IP/UDP/RTP layers and then they are decoded at the destination. A video QoS monitor at the receiver sends feedback messages to the sender about the congestion status based on the behavior of the arriving packets. For this purpose, RTP is accompanied with another protocol, called RTCP. RTCP receiver reports con-
taining loss and jitter statistics are issued regularly at the receiver side. Based on this feedback, the sender reduces its data rate to reduce losses and delays [5]. This process is called media scaling. Media scaling techniques are broadly categorized as follows [7]: Temporal scaling: In temporal scaling, frame rate is altered during encoding if live video is being streamed. For stored video, a frame-dropping filter may be used. Frame-dropping filters drop frames by taking into account the relative importance of different frame types. For example, B frames are dropped first, P frames are dropped next. Finally, I frames are dropped. Quality scaling: In quality scaling, encoder quantization parameters are altered. DCT coefficients are divided by a larger quantization parameter. This technique is also known as SNR scaling. Since altering the quantization parameters is not enough to achieve very low bit rate, these encoding schemes are not suitable for low bit-rate video. On the other hand, MPEG-4 and H.263 decompression schemes allow temporal scaling and, hence, they are preferred for low bit-rate video applications [5]. The emerging video coding standard H.26L has tools for quality scaling. H.26L defines a new picture type called an SP frame to allow switching between different versions of a stream [8, 9]. This technique is easier to implement in live video streaming because it is in interaction with the encoder. To use this technique for the streaming of stored video, a re-quantization filter is used to extract the DCT coefficients from the compressed video stream through techniques like de-quantization and DCT coefficients are re-quantized with a larger quantization step, resulting in rate reduction. Another way to use this technique is encoding the same video stream at different bit rates and switching between different versions when required. Quality scaling is also achieved by using frequency filters that include low-pass filters, color reduction filters and color-to-monochrome filters. Low-pass filters discard the DCT coefficients of the higher frequencies. Color reduction filters drop chrominance components in a similar way. Color-to-monochrome filters remove all color information from the video stream [5].
ARTICLE IN PRESS A. Kantarci et al. / Signal Processing: Image Communication 19 (2004) 479–497
Spatial scaling: In spatial scaling, the number of pixels in a frame is reduced, the pixel size is increased, thereby reducing the detail in the frame. This approach is useful in layered coding systems. The base layer contains the lowest resolution and it can be independently decoded. Other layers are called enhancement layers which can be coded with the base layer and can provide better visual quality. The combination of all layers provide the highest quality [5,6]. For better perceptual qualities, all of these techniques should be used together. The content of the video is the most important factor to determine when to apply each scaling method throughout the streaming process. For example, if a scene has a lot of motion and data rate is to be reduced, quality scaling is the most appropriate method. On the other hand, temporal scaling is useful for scenes with low motion content when an adaptation is required. Therefore, a content aware scaling system is needed for better QoS control of streaming applications. To develop such a system, a method for the measurement of motion is essential. A practical technique is to count the number of skipped, interpolated and intracoded blocks in a frame. When the number of skipped macroblocks are high, the picture is very similar to the previous one and there is little motion. If the number of interpolated macroblocks are high, the level of motion is higher. In high motion video segments, there are rapid scene changes, new objects are joined to the scene and existing objects may move out of the search area during motion estimation. In such frames, the number of skipped and interpolated blocks are low, whereas the number of intracoded blocks is high. The ratio between the numbers of different macroblock types in a frame may be an efficient determinant of the level of motion in a scene [7,9]. In this paper, we introduce an adaptive video streaming algorithm. A simpler version of the algorithm is given in [18]. This paper has two contributions to [18]. Firstly, the algorithm is optimized by considering the motion dynamics of the video. A prioritization heuristics determines whether temporal or quality scaling should be performed. Secondly, the performance of the
481
algorithm is extensively tested under simulated loss and delay conditions. The results reveal that the algorithm is very robust under these conditions. This paper is organized as follows: In Section 2, our rate control algorithm is examined in detail. Performance results are given in Section 3. Finally, Section 4 is the conclusion.
2. The rate control algorithm In this section, we will review the properties of our rate control algorithm that we have developed for our streaming system. In our system, RTP is used for data transfer, RTCP is employed to convey congestion statistics and UDP is used to exchange protocol messages. Stored videos are transmitted in unicast manner. The protocol stack is given in Fig. 1. Further details on our system can be found in [10,18–20]. The rate control algorithm both regulates the transmission rate based on the loss rate information reported in RTCP reports sent by the receivers and tries to maintain the receiver buffer occupancy within a desired interval. Based on the feedback information from the receiver, the sender assumes that the receiver is in one of the four states, namely, uncongested, mildly congested, congested and severely congested, where loss percentages fall into the ranges [0,5), [5,15), [15,30) and [30,N), respectively. In the uncongested state, the sender probes for the available bandwidth by progressively increasing its transmission rate in an additive manner.
Rate Control Protocol RTP
RTCP
UDP IP Fig. 1. The protocol stack for the system.
ARTICLE IN PRESS 482
A. Kantarci et al. / Signal Processing: Image Communication 19 (2004) 479–497
Similarly, in mildly congested state, it decreases its transmission rate in an additive manner. In the congested state, the sender progressively reduces its transmission rate in a more aggressively additive manner than in the mildly congested state. In the severely congested state, the rate is reduced multiplicatively to alleviate the congestion in the network. To provide better quality of service with the application, buffer management is integrated with the loss-based adaptation algorithm. The goal is to keep the buffer occupancy within a desired interval. For the continuity of the display, frames should be read from the buffer at the same rate as they are written into the buffer. Decrease in the number of frames in the buffer may be an indicator of a problem in the network, at the server or at the client. When the problem is solved, the buffer occupancy starts to rise again. To detect the condition of the network and the participants, buffer occupancy is checked regularly. Network congestion is the main reason for the sudden decreases in buffer occupancy. Congestion in the network causes delays during delivery. Delays result in decrease in the input rate to the buffer and greater consumption intervals than arrival intervals. In this case, the sender is asked to decrease the number of frames to be sent per second. The player thread decreases the display rate too. Consequently, the congestion condition is relieved and buffer occupancy starts to increase again. Another reason for fluctuations in buffer space is the slow sender. When the CPU load increases at the server side, the server becomes slower and the actual sending interval may be greater than the theoretic sending interval. Additionally, the theoretic sending interval is an approximation to the optimum one. For these reasons, buffer occupancy at the receiver may start to decrease. Additionally, the receiver may not have sufficient CPU power to keep up with correct display rate or the server may be fast. This condition is detected when the buffer occupancy exceeds buffer capacity. A traditional approach for buffer space management is to use a two-threshold policy [11]. In this approach, required actions are taken only when
the high threshold is exceeded and the buffer occupancy falls below the low threshold. The drawback of this approach is that the system may be late to respond to buffer underflows and overflows. For example, in cases where there is a persistent decrease in buffer occupancy, an action is taken only if the buffer occupancy passes low threshold, which may not stop the decrease in buffer space and an interrupt during the playback becomes inevitable. Among the few studies that take into account the buffer status in quality adaptation, [12] and [13] use an approach that adapts rate according to a hysteresis loop applied to receiver buffer level. The shape of the loop determines the conservativeness of the algorithm. Another approach developed in [14,15] uses adjustment of frame rate at the display stage. In this approach, the client adjusts its data consumption rate by varying the speed of frames played out. Popular player softwares such as Real Player and Windows Media Player detect congestion in the network with the rapid reduction in bandwidth. In case of congestion, they keep the playing with the buffered data. If the buffer empties completely, then the playback is halted for 20 s, while the buffer is being filled again [16]. We devised a new method that is robust and simple enough not to overload the receiver CPU. Our method detects the change in buffer occupancy before low and high thresholds are exceeded and it notifies the server via the control channel about the state of its buffers. The client does not determine the rate at which the server transmits the video. It only keeps track of the change in the buffer level. As shown in Fig. 2, we set four points for our receiver buffer. Byte-overflow (byte-ovfl) is the top most level above which the receiver buffer enters a physical overflow region. Next comes time-toplay-overflow (ttp-ovfl) point set in terms of seconds. Note that byte-ovfl must be higher than highest quality video data rate multiplied by ttpovfl. Next, we call adjust threshold (adj-thr) is a point somewhere in the adjustable region. Next comes ttp-unfl, which determines the underflow region. We try to keep the buffer level within adjustable zone.
ARTICLE IN PRESS A. Kantarci et al. / Signal Processing: Image Communication 19 (2004) 479–497
byte-ovfl Overflow zone ttp-ovfl Upper adjustment zone adj-thr Adjustable zone Lower adjustment zone ttp-unfl Underflow zone Fig. 2. Buffer regions.
Table 1 Observed and controlled variables of QoS management module Observed variables
Source
Controlled variables
Source
Loss ttp dttp
RTCP RR feedback UDP receiver feedback UDP receiver feedback
Encoding rate GOP pattern Packet interval
Server Server Server
The value of adj-thr and pre-buffering duration are closely related. During pre-buffering, it is customary to fill up the receiver buffer slightly above the adj-thr value so that the buffer does not drain immediately. The value of adj-thr is dependent on the network conditions. Although, low jitter may suggest lower adj-thr values, in practice, due to other factors such as a burst error in the network, the adj-thr value should conservatively be chosen so that the receiver buffer does not quickly drain. On the other hand, if the video is short, long pre-buffering period is disturbing. With all these limitations in mind, adj-thr should carefully be selected. Throughout our experiments, an optimal value of 45 seconds is used. Table 1 shows the observed and controlled variables by the rate control algorithm. Loss rate is sent from the clients to the server over the RTCP channel. At the client side, two parameters have been taken into account to determine the buffer
483
status and are sent regularly to the server via the UDP channel. The first one is, ttp, the duration of video currently present in the buffer. The second one is, dttp, the rate of change in ttp. We determine three dttp regions. The first region includes negative dttp values, which corresponds to decrease in buffer level. The second region corresponds to the case when the dttp is between 0 and 1. In the third region, dttp values are greater than 1, which we assume that the buffer level has increased unexpectedly. The rate control module at the server reacts to the feedback messages by scaling the video in accordance with the loss, ttp and dttp parameters given in those reports. Video is scaled in a seamless manner by switching to a video encoded at a lower/higher rate—quality scaling, by dropping/ adding frames from the current GOP pattern and by increasing/decreasing packet interval—temporal scaling. Each video in the system is stored at multiple encoding rates, namely, 100, 200, 500 and 1000 Kbps. Each video file is associated with a metadata file that describes the structure of the video. The metadata files are useful to facilitate application layer framing and to adjust data rate. Metadata files store information about the frame types, frame sizes, GOP statistics, sequence statistics and packet interval values for different frame discard levels. Frame dropping/adding is performed in levels given in Table 2 for the GOP pattern IBBPBBP BBPBBPBB. To determine these levels, we took into the consideration the dependencies among different frame types and the smoothness of the video streams. Frame type dependencies restrict that B frames should be dropped first, P frames should be dropped secondly and I frames should be dropped lastly. When all B and P frames are dropped, the quality of video reduces significantly. The rules for adding frames follow the reverse order. The second factor in choosing the frames to be dropped is to distribute them evenly in the video sequence to maintain the smoothness of the stream. It should be noted that the variability in the size of different frame types, each frame discard level (fdl) does not result in equal amount of reduction in the network.
ARTICLE IN PRESS 484
A. Kantarci et al. / Signal Processing: Image Communication 19 (2004) 479–497
Table 2 Frame discard levels fdl
Frame pattern in one GOP
Frames dropped
0 1 2 3 4
I I I I I
0 5B 10 B 10 B+2 P 10 B+4 P
BBPBBPBBPBBPBB B P B P B P B P B P P P P P P
An important property of the adaptation module is that it considers the content of the video during adaptation. When a new video is added to the database, it is analyzed at the macroblock level. If a scene has little motion, the number of skipped and interpolated blocks is high. On the other hand, high number of intramacroblocks associated with low number of skipped and interpolated macroblocks is the indicator of high motion in the video sequence. At the analysis stage, the graphs for the number of skipped, interpolated and intramacroblocks for each frame of video are plotted. Portions of the video that include high motion are determined and corresponding GOP numbers are stored in the metafile. If adaptation is required, quality scaling is used by switching to a version at a lower/higher encoding rate during the transmission segments of the video, which contain high motion, whereas temporal scaling is applied during the transmission of the portions of video with low motion content. Each encoding rate and frame discard level pair corresponds to a different video rate in the network. With these pairs, a grid may be formed where y- and x-axis correspond to encoding bit rate and GOP pattern adjustment levels, respectively. Moving down along the y-axis and right along the x-axis reduces the video rate in the network. Various paths can be followed to change the rate of video in the network. Among all possible paths, we chose three of them to use in different congestion and buffering conditions. The selected paths are given in Fig. 3. The paths given in Fig. 3a are strictly followed as long as the frames have low motion. The first path, the linear path, is followed when an additive decrease is needed in the transmission rate. We
Fig. 3. (a) (a) Linear, (b) square and (c) diagonal paths for low motion scenes. (b) (a) Linear, (b) square and (c) diagonal paths for high motion scenes.
observed that the encoding rate contributes more to visual quality. Therefore, in the linear path, we chose to drop frames in the GOP pattern firstly. We also observed that frame discard levels 3 and 4
ARTICLE IN PRESS A. Kantarci et al. / Signal Processing: Image Communication 19 (2004) 479–497
have adverse effects in the continuity of the presentation. Therefore, after adjustment level 2, we switched to a lower encoding bit rate. If a more aggressive additive decrease is needed, the square path is followed. For multiplicative decrease in the transmission rate, diagonal path is selected. For the scenes with high motion, alternative paths are given in Fig. 3b. The most appropriate path is followed in accordance with the current ER and FDL settings, the severity of buffer status and congestion. These paths were chosen such that temporal quality is preserved or increased while decreasing compression quality. When Fig. 3a and b are compared, it is seen that diagonal movement paths are the same. This is because, the diagonal path is selected when the system is severely congested and a very high reduction in data rate is required. In addition, our experimental findings show that ER and FDL values are set to the highest values after a few adaptation steps. This makes modifying the original diagonal path unnecessary. Furthermore, modifications on the diagonal path may cause a delay in reaching the worst video quality settings if needed, which may also delay the alleviation of the severe congestion. When ttp values become high enough or RTCP reports show that congestion is alleviated in the network; video quality is increased in additive manner following the linear path. This is because of the fact that there is no means to determine the amount of available bandwidth in the network. If a multiplicative policy is followed, loss rates may increase in cases where there is not much increase in the available bandwidth. Therefore, it is more appropriate to apply a conservative policy where video quality is to be scaled up. In initial experiments, it was observed that the adaptation module quickly responds to the changes in the available bandwidth and in buffer occupancy. This may result in frequent quality oscillations, disturbing the viewer. Additionally, temporal fluctuations in the loss fraction and buffer level may invoke unnecessary scaling operations. Therefore, the adaptation algorithm has been modified to be more conservative in responses to adaptation requests.
485
The conservative approach which is based on a hysteresis model allows quality degradations after two adaptation requests (10 s) and quality increases after five adaptation requests (25 s). It has been observed that this approach preserved the prevailing quality until an indicator of persistent behavior in congestion and buffer status is available. Consequently, frequent rate adjustments which may disturb the viewer were eliminated. The algorithm can be tuned for less or more conservative policies by changing the number of required adaptation requests for the invocation of the scaling module. Fig. 4a and b show our adaptation module which consists of simple embedded switch statements. In the algorithm, path is used to select one of the paths shown in Fig. 3, dir and pi denotes the direction of movement along the selected path and the new packet interval, respectively. Highmotion is set to 1 for the scenes with high motion content, otherwise it is 0. In some cases, temporal and quality scaling are not required and only the packet interval is modified. In these situations, flag variable set to zero to prevent unnecessary invocations of SCALE VIDEO routine. Proper block of the outermost switch statement is selected according to the buffer occupancy. In the UPPERADJ and LOWER-ADJ blocks, dttp value is taken into account. NEG stands for a decrease in ttp, whereas LOW and HIGH refer to rate of increases smaller than 1 and larger than 1, respectively. The innermost switch blocks are used to distinguish between four congestion states. When the metadata files are being prepared, packetization module is executed for each encoding rate and frame discard level to determine the number of packets to be transmitted per video file. For each encoding rate and frame discard level, video duration is divided by the number of packets to calculate the transmission interval between successive packets during delivery. For flexibility, we used five packet interval values for each ER and FDL pair. Packet intervals determined by using actual packet counts correspond to MEDIUM value of the packet interval. The other four values are obtained by dividing the interval between MEDIUM values of subsequent frame discard levels into four sub-intervals. The first and
ARTICLE IN PRESS 486
A. Kantarci et al. / Signal Processing: Image Communication 19 (2004) 479–497 QOS_ADAPT ( highmotion) { flag=1; NEG= 1; LOW=2; HIGH=3; switch (buf_level) { case OVERFLOW: if (buffer> buf_ovfl) pause; else { switch(con) { case NOCON: path=LINEAR; dir=INC; pi=medium; case MILD: path=LINEAR; dir=DEC; pi=high; case CONGESTED: path=SQUARE; dir=DEC; pi=high; case SEVERE: path=DIAGONAL;, dir=DEC; pi=high } } case UPPER_ADJ: switch (dttp) { case NEG: switch(con) ; { case NOCON: decrease pi; flag=0; case MILD: path=LINEAR; dir=DEC; pi=medium case CONGESTED: path=SQUARE; dir=DEC; pi=medium; case SEVERE: path=DIAGONAL; dir=DEC; pi=medium; } case LOW: switch(con) { case NOCON: path=LINEAR; ;dir=INC; pi=medium; case MILD: path=LINEAR; dir=DEC; pi=medium-high case CONGESTED: path=SQUARE; dir=DEC; pi=medium-high; case SEVERE: path=DIAGONAL; dir=DEC; pi=medium-high; } case HIGH: switch(con) { case NOCON: path=LINEAR; ;dir=INC; pi=medium; case MILD: path=LINEAR; dir=DEC; pi=medium-high case CONGESTED: path=SQUARE; dir=DEC; pi=medium-high; case SEVERE: path=DIAGONAL; dir=DEC; pi=medium-high; } LOWER_ADJ: switch (dttp) case NEG: switch(con) { case NOCON: path=LINEAR; dir=DEC; pi=medium of fdl-0; case MILD: path=LINEAR; dir=DEC; pi=medium-low; case CONGESTED: path=SQUARE; dir=DEC; pi=medium-low case SEVERE: path=DIAGONAL; dir=DEC; pi=medium-low } case LOW: switch(con) { case NOCON: no_op; flag=0; case MILD: path=LINEAR; dir=DEC; pi=medium case CONGESTED: path=SQUARE; dir=DEC; pi=medium; case SEVERE: path=DIAGONAL; dir=DEC; pi=medium; } case HIGH: switch(con) { case NOCON: increase pi; flag=0; case MILD: path=LINEAR; dir=DEC; pi=medium-high case CONGESTED: path=SQUARE; dir=DEC; pi=medium-high; case SEVERE: path=DIAGONAL; dir=DEC; pi=medium-high; } } case UNDERFLOW:
switch (con) NOCON: path=LINEAR; dõr=DEC; pi=medium of al fdl-0 MILD: path=SQUARE; dir=DEC; pi=medium; CONGESTED: path=DIAGONAL; dir=DEC;pi=medium; SEVERE: path=DIAGONAL; dir=DEC;pi=medium;
}
(a)
(b)
if (flag) SCALE_VIDEO(path, dir, pi, highmotion); }
SCALE_VIDEO (path, dir, pi, highmotion) { if highmotion QUALITY_SCALING(path, dir, pi); else TEMPORAL_SCALING(path, dir, pi); }
Fig. 4. (a) Algorithm for the adaptation module. (b) SCALE VIDEO routine.
second values after the MEDIUM value of frame discard level k are called MEDIUM-HIGH and HIGH values of frame discard level k, respectively.
The third and fourth values are LOW and MEDIUM-LOW values for frame discard level k+1, respectively.
ARTICLE IN PRESS A. Kantarci et al. / Signal Processing: Image Communication 19 (2004) 479–497
Our test bed consists of Sun Ultra 5 workstations running Solaris 2.8 operating system. The video database at the server side consists of MPEG-1 videos encoded at 4 bit rates. Higher bit rates produce better quality at the cost of heavy load on the network and more processing power at the end stations. Videos are encoded at 30 fps with the GOP pattern IBBPBBPBBPBBPBBPBB. The server software has been implemented with C++ and the client software has been implemented with C language. RTP library has been obtained from Lucent Technologies. We have done the packetization of MPEG-1 videos ourselves according to the rules given in the related internet draft [17]. XIL library of Solaris Operating system has been used to decode and display the videos. POSIX 4 library and pthread library have been used to provide real-time programming support and multithreading facilities, respectively. The system has been developed under client/ server paradigm. The server accepts new connection requests, streams videos to clients, schedules packet delivery, performs bandwidth management and adapts to dynamic network conditions. The client software receives video packets from the network, combines them into frames, decodes and displays them. Both the server and clients are multithreaded. A separate thread serves the requests of each client. At the client side, two separate threads carry out I/O and computing tasks. The first thread is responsible for receiving packets from the network and placing them into a buffer. The second thread retrieves frames from the buffer, decodes and renders them. Buffer space has been regarded as a critical section. Two threads are synchronized to allow only one thread to access
5 4 3
80 60 40
2 1 0
20 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
t (sec) ttp
dttp
Fig. 5. ttp statistics when buffer level is not considered.
dttp
3. Experimental results
buffer space by using mutual exclusion mechanisms of pthread library. Client software works in a pipelined manner, minimizing the number of lost packets at the network interface by allowing concurrent execution of CPU and I/O bound threads. To justify our hypothesis that buffer status should be integrated into rate control decisions, we collected performance results without any buffer constraints under no loss scenario. In this experiment, video quality increased to 1000 kbps during pre-buffering period. Since effective bandwidth is smaller than this rate, packets experienced delay during transmission. Since the input rate into the buffer became greater than the output rate from the buffer, we observed that buffer occupancy continuously decreased after the prebuffering period, leading to display interruption as given in Fig. 5. The server and client workstations are on the same 100 Mbps Ethernet LAN. We measured the effective throughput as 400 kbps. We placed a PC running CLOUDt software between the server and the client. CLOUDt software simulates actual WAN environment by dropping and delaying packets received from the server in accordance with a given loss and delay pattern. We collected performance results with no loss and no delay scenarios, with constant loss rate and delay scenarios demonstrating regular loss and delay patterns and with congested network scenarios having variable loss rate and delay. Note that loss rates are smoothed by using a low-pass filter to avoid unnecessary rate regulations. In the following subsections, we will present the performance results, separately.
ttp (sec)
Adaptation module does not introduce high CPU load to the system. One of the 33 branches is chosen with at most three switch statements. Linear, square, diagonal movements and selection of the packet interval values consist of simple table look-up operations.
487
ARTICLE IN PRESS 488
A. Kantarci et al. / Signal Processing: Image Communication 19 (2004) 479–497
3.1. System performance under no loss and no delay scenario Fig. 6 presents the states of system variables when there is no packet loss and no delay in the network. Since there is no loss and delay, only buffer status is taken into account in rate control decisions. Transmission starts at the encoding rate of 200 kbps with a frame rate of 10 fps. The system switches to better qualities after the probing experiments that indicate sufficient network capacity exists. Encoding rate and frame rate are increased to 1000 kbps and 30 fps, respectively, which represent the best video quality. We observed that the transmission of the highest video quality results in large number of packets to be transmitted, increasing the end-to-end delay. As a result, after the pre-buffering period, the client starts to view video at a greater rate than the incoming packet rate and ttp value falling below adj thr (45 s), triggering the adaptation module which decreases the encoding rate to 500 kbps with a frame rate of 30 fps (ER=1; FDL=0). Fig. 6a shows that the prevailing encoding rate and frame rate are 500 kbps and 30 fps, respectively, throughout the transmission. According to Fig. 6b and c ttp and physical buffer occupancy falls slowly within the boundaries of lower adjustment zone. At t ¼ 728 s, ttp falls below the underflow threshold (25 s) and the adaptation module is triggered to scale down video along the linear path. This point corresponds to a scene with high motion content and quality scaling is applied (ER=2 (200 kbps) and FDL=0 (30 fps)) instead of temporal scaling. There are also other segments with high motion content in the video such as during time intervals 100–195, 266–296 and 476–536 s. During the transmission of these scenes and the buffer level is above underflow threshold; which does not require rate adaptation. Fig. 6d indicates that the average throughput is around 400 kbps and packet intervals are almost constant in accordance with the prevailing encoding rate. 3.2. System performance under regular loss and delay patterns We conducted experiments under average loss rates, namely, 1%, 3%, 5%, 10% and so on. Corresponding network delays have been chosen
such that the higher loss rate is, the higher the transmission delay is. Loss rates of 1% and 3% reflect the behavior of the system in uncongested states, in which only buffer status governs the rate control decisions. The case with the average loss rate of 5% corresponds to the boundary between uncongested and mildly congested states. At this loss rate, the network status starts to contribute to the rate control decisions. The network is mostly in a mildly congested state throughout the playout when the average loss rate is 10%. It should be noted that in all experiments the cause for losses and delays is external due to the restrictions of our simulation software. Hence, our system tries to achieve the possible highest video quality at a given network configuration. When the loss rate is 1% and the transmission delay varies uniformly between 0 and 10 ms, performance results given in Fig. 7 have been obtained. Similar to the previous experiment, the system is in uncongested state and only buffer status governs the rate control decisions. Compared to the previous experiment, ttp falls below underflow threshold earlier (t ¼ 400 s) due to packet losses and network delays (Fig. 7c). This point falls into a low motion scene. Therefore, adaptation module follows the linear path by employing temporal scaling (frame rate=20 fps). At t ¼ 476 s, a high motion scene starts and quality scaling is applied by decreasing the encoding rate to 200 kbps (ER=2) by setting the frame rate to 30 fps (FDL=0). At t ¼ 536 s, the high motion scene ends and the previous (ER, FDL) pair is switched to. The rise of buffer level above adj thr invokes the adaptation module to increase video quality. Until t ¼ 636 s, where the last high motion scene starts, adaptation module follows the linear path. After this point, quality scaling is applied and video is transmitted at the full frame rate. Fig. 7c shows that ttp value increases as the algorithm switches to lower qualities. The reason for this increase is a trick with packet intervals. When the network is in uncongested state, in cases where quality is degraded as a result of five successive decreases in ttp or the start of a dynamic segment, packet interval values are preserved or decreased, not increased. Therefore, ttp values increase slowly
ARTICLE IN PRESS A. Kantarci et al. / Signal Processing: Image Communication 19 (2004) 479–497
489
Fig. 6. (a)–(d) System variables under no loss scenario.
between t ¼ 476 and t ¼ 536 s. After this point, ttp starts to fall again as a result of switching back to the previous quality setting. Fig. 7d shows that
buffer level and ttp values are consistent with each other. Buffer level graph is a shifted version of the ttp graph. The amount of shift is the pre-buffering
ARTICLE IN PRESS 490
A. Kantarci et al. / Signal Processing: Image Communication 19 (2004) 479–497
Fig. 7. (a)–(e) System variables when the loss rate is 1%.
period, during which no video is consumed from the buffer. Finally, Fig. 7e indicates that the throughput falls around 200 kbps at t ¼ 476 s, where the encoding rate is decreased to 200 kbps as shown in Fig. 7b. The prevailing throughput is around 400 kbps except the period between t ¼ 476 and 536 s. It is also observed that packet interval values increase where the adaptation module is invoked to decrease video quality. This is in compliance with our expectations, because lower quality results in lower number of packets and higher intervals between consecutive packets. When the average loss rate is 3% and the average delay is uniformly distributed between 20 and 30 ms (Fig. 8a), more packets get lost in the network and buffer level decreases at a higher rate than in the previous examples, falling below underflow threshold earlier. As seen in Fig. 8c, ttp falls into underflow region twice. Our trick
with packet intervals increases ttp value in both cases. In parallel, required ER and FDL adjustments are performed following the proper path in accordance with the content of video. As seen in Fig. 8d and e, ttp, buffer level, throughput and packet interval statistics are in compliance with corresponding ER and FDL settings shown in Fig. 8b. The case with the average loss rate of 5% demonstrates the behavior of our algorithm at the boundary point between the uncongested and mildly congested states (Fig. 9) and transmission delay is uniformly distributed within the interval between 50 and 100 ms. In this experiment, loss statistics start to contribute to the rate control decisions when the instantaneous loss rate exceeds 5%. Therefore, more frequent rate adjustments are required than in the previous cases as shown in Fig. 9b. Due to persistent packet losses, encoding rate switches between 100 and 200 kbps
ARTICLE IN PRESS A. Kantarci et al. / Signal Processing: Image Communication 19 (2004) 479–497
491
Fig. 8. (a)–(e) System variables when the loss rate is 3%.
throughout the transmission. When Fig. 9c is examined, it is seen that ttp values sharply increase between intervals 171–211, 296–336, 386–496 and 546–576 s. This is due to the trick with packet interval values. When Fig. 9a is examined these intervals correspond to periods when the network is uncongested. We think this trick is very useful especially when the buffer level is continuously decreasing. In our experiments, we observed that when the ttp falls below underflow region, this trick prevented the buffer from draining, provided that the system is in uncongested state. Between t ¼ 376 and 421 s, where the worst quality is used, packet intervals are the highest (Fig. 9e), comparable to other quality levels. Therefore, the increase in ttp values stops and consumption rate from the buffer becomes greater than the input rate into the buffer. After t ¼ 580 s, it is seen that the system enters into no loss and no delay mode. In sequence, video quality increases, resulting in high
number of video packets with larger end-to-end delays. As shown in Fig. 9d, buffer occupancy is small due to the lowest video qualities. Although it remains constant around the same level as the ttp values increase, this is in compliance with the prevailing encoding and frame rates. After t ¼ 580 s, it is seen that buffer occupancy increases due to the transmission of higher quality video. Finally, throughput is high during the transmission of high-quality video. Throughput decreases due to losses and switchings to the worse video qualities. Similarly, after switching from the worse video quality to the better one the throughput increases. It is the smallest between t ¼ 376 and 476 s when the streaming proceeds at the worst video quality. Fig. 10 demonstrates the behavior of our system when the loss rate is around 10% and the average delay is uniformly distributed between 100 and 150 ms. In this case, the system is in mildly
ARTICLE IN PRESS 492
A. Kantarci et al. / Signal Processing: Image Communication 19 (2004) 479–497
Fig. 9. (a)–(e) System variables when the loss percentage is 5%.
congested state throughout the transmission. A persistent loss rate of 10% corresponds to much higher number of lost packets which results in much smaller incoming packet rates to the buffer than in previous loss scenarios. Due to the persistent loss in the network, video is scaled down to the worst quality and remains at that quality throughout the transmission. ttp value is around 60 s until t ¼ 321 s and starts to fall slowly after this point as given in Fig. 10c. ttp and buffer level are in compliance with each other. Until t ¼ 156 s, physical buffer occupancy is high due to the high video quality during initial probing phase of the experiment. Fig. 10e demonstrates that throughput is very low and packet interval values are very high in connection with the prevailing ER and FDL values. At t ¼ 585 s, the simulation ends; ttp, buffer level and throughput statistics starts to rise again. We also conducted experiments with higher constant loss rates and higher delays. The behavior
of the system was similar to the behavior depicted in Fig. 10. When the loss rate was 15%, the boundary between the mildly congested and congested states, we observed that buffer drained nearly at the end of the simulation (t ¼ 10 min) and display interrupted. As we increased the loss rate in subsequent experiments, we observed that the buffer drained more quickly leading to earlier interruption in the display. 3.3. System performance under variable loss and delay patterns In addition to the experiments with no loss and with constant loss, we also carried out simulations under congested network scenarios. In this section, we will present the results of two congestion scenarios. CLOUDTM has been configured with different loss and delay settings for each experiment. In both examples, there are two periods of congestion. In both of the examples, the first
ARTICLE IN PRESS A. Kantarci et al. / Signal Processing: Image Communication 19 (2004) 479–497
493
Fig. 10. (a)–(e) System variables when the loss percentage is 10%.
congestion occurs during a scene with high motion content, while the second one falls into a segment of video with low motion content. In each experiment, the same loss and delay settings have been used during both periods of congestion. However, due to the randomized time interval between consecutive RTCP reports, loss fractions contained in receiver reports have different values. Therefore, smoothed loss fractions do not have exactly the same pattern during congestion periods. In the first experiment, loss values are selected in such a manner that the system enters into congested or severely congested states during congestion periods. In the second experiment, loss values follow a pattern that is similar to the shape of Gaussian distribution. Loss values increase, driving the system into mildly congested, congested and severely congested states, and then they decrease such that the system recovers from congestion passing through severely congested,
congested and mildly congested states. The status of the system variables during the first experiment is shown in Fig. 11. Congestion starts just after the start of the first dynamic scene. Following the square and diagonal paths given in Fig. 3b, video quality is quickly reduced to the worst value. Congestion ends just before the end of the high motion scene and video quality starts to increase by following the linear path given in Fig. 3a (t ¼ 176 s). When Fig. 11b is examined, it is seen that there is a frequent change in ER and FDL values between t ¼ 201 and 241 s. This is due to the high buffer occupancy during this period. At t ¼ 181 s, the system is in uncongested state and buffer level has started to decrease. Our trick with the packet interval values increases the ttp above ttp overflow (t ¼ 100 s) in a short time period. When ttp is above ttp overflow, to be able to switch to the best possible quality as soon as possible, the adaptation module does not follow
ARTICLE IN PRESS 494
A. Kantarci et al. / Signal Processing: Image Communication 19 (2004) 479–497
Fig. 11. (a)–(e) System variables under variable loss scenario 1.
our conservative policy when increasing video quality. At t ¼ 266 s, ER and FDL values are settled to 0 and 1, respectively. ttp is almost constant until t ¼ 266 s where the encoding rate is switched to the highest value. At t ¼ 326 s, the second congestion period starts and video is scaled down along the square and diagonal paths given in Fig. 3b. The second congestion ends at t ¼ 416 s, where a scene with high motion starts. Therefore, the frames are started to be sent at full frame rate for better perceptual quality. At t ¼ 491 s, a scene with low motion starts, video quality is increased along the linear path given in Fig. 3a. At t ¼ 536 s, video is transmitted at the best quality. Consequently, end-to-end delay increases and ttp starts to fall as shown in Fig. 11c. It should be noticed that video quality is not increased as aggressively, as it is increased after the first congestion. Since ttp value is below ttp overflow during that period,
quality is increased in conservative manner. It should be noted that ttp values are constant during congestion periods. This is because the packet interval values are equal to the actual interval values of the corresponding ER and FDL pairs, which do not lead to increase or decrease in ttp values. The pattern of congestion and adaptation graphs demonstrates itself with the fluctuations in buffer level as given in Fig. 11d. During congestion periods, physical buffer occupancy falls as video quality decreases. As congestion alleviates and video quality increases, physical buffer occupancy also increases. Finally, packet interval and throughput statistics are in compliance with each other and with the other statistics as shown in Fig. 11e. During periods of congestion, throughput is low as a result of quality degradations. After congestion periods, throughput increases as a result of quality improvements. Packet intervals
ARTICLE IN PRESS A. Kantarci et al. / Signal Processing: Image Communication 19 (2004) 479–497
495
Fig. 12. (a)–(e) System variables under variable loss scenario 2.
are high when there is congestion and low when there is no congestion. Fig. 12 shows the system status during another congestion scenario. As shown in Fig. 12a, congestion starts at t=70 s, a point in the first low motion segment of video. The system enters mildly congested, congested and severely congested states in turn, resulting in respective quality degradations along linear, square and diagonal paths. At t ¼ 105 s, a high motion scene starts. As seen in Fig. 12a, smoothed loss fraction value rises as high as 50%. The high motion scene ends at t ¼ 175 s. During the first congestion period, ttp value is almost around 45 s as shown in Fig. 12c. Unlike the first experiment, ttp is below ttp overflow threshold. Therefore, video quality is increased in a conservative manner along the linear path as congestion diminishes. In the first experiment, corresponding ttp values were above
ttp overflow threshold and video quality had been increased in a non-conservative manner. Consequently, the adaptation module had a chance to increase the video quality to the best level before the start of the second congestion period. In this experiment, our conservative policy results in quality improvements at larger steps and the best video quality achieved before the second congestion is 500 kbps with 20 fps. The second congestion starts at t ¼ 296 s. Since this point falls into a low motion scene, video quality is decreased along the paths given in Fig. 3a, following linear, square and diagonal paths in turn. The system enters the uncongested state at t ¼ 386 s. Due to our conservative policy, quality is not increased as soon as congestion is over. At t ¼ 401 s, a scene with high motion content starts and quality is improved by switching to higher encoding rates and by transmitting the video at full frame rate.
ARTICLE IN PRESS 496
A. Kantarci et al. / Signal Processing: Image Communication 19 (2004) 479–497
After t ¼ 466 s, video quality reaches the highest level. Consequently, end-to-end delay increases and ttp value starts to fall. At t ¼ 551 s, ttp is below adj thr. In response, video quality is decreased along the linear path given in Fig. 3a. After the encoding rate is set to 500 kbps, ttp value starts to increase. At the end of the simulation, ttp rises above 45 s again. Finally, graphs in Fig. 12d and e are similar to the corresponding graphs given in Fig. 11d and e. Buffer level, throughput and packet interval statistics are in compliance with the corresponding loss, ER and FDL statistics.
4. Conclusions In this study, we reported the controlled loss rate and delay simulation results of a rate adaptation algorithm. The algorithm modifies packet interval, frame rate and encoding rate to adapt to network congestion and to keep buffer occupancy within a given interval. An important property of our algorithm is that it is a contentaware algorithm that takes into account the level of motion in the video sequence to provide better perceptual quality. We conducted simulation experiments with various loss and delay scenarios. Under no loss and trivial loss scenarios, the algorithm considers only buffer occupancy in rate control decisions. We observed that our system is able to provide uninterrupted display even if we experience a fixed continuous loss rate of 15%. Finally, in the variable loss rate scenario, we observed that the algorithm successfully performs adaptation in accordance with the level of congestion and the change in buffer level. Currently, motion measurement is being performed off-line. As a future study, we plan to measure the level of motion in real time while the transmission is proceeding.
Acknowledgements The authors are indebted to Dr. Reha Civanlar for helpful suggestions.
References [1] J. Chung, Y. Zhu, M. Claypool, FairPlayer or FoulPlayer?—head to head performance of realplayer streaming video over UDP versus TCP, Technical Report WPICS-TR-02-17, Computer Science Department, Worcester Polytechnic Institute, 2002. [2] G. Davini, D. Quaglia, J.C. De Martin, C. Casetti, Perceptually-evaluated loss-delay controlled adaptive transmission of MPEG video over IP, Proceedings of the IEEE International Conference on Communications, Vol. 1, Anchorage, Alaska, 2003, pp. 577–581. [3] J.C. Guerri, M. Esteve, C. Palau, V. Casares, Feedback flow control with hysteresial techniques for multimedia retrievals, Multimedia Tools Appl. (13) (2001) 307–332. [4] C.K. Hess, Media streaming protocol: an adaptive protocol for the delivery of audio and video over the internet, M.Sc. Thesis at University of Illinois at Urbana Champaign, 1998. [5] G. Hoffman, V. Fernando, R.M. Goyal, Civanlar, RTP Payload Format for MPEG1/MPEG2 Video, January 1998. [6] M. Kalman, B. Girod, E. Steinbach, Adaptive playout for real-time media streams, Proceedings of the ISCAS2002, Invited Paper, 2002. [7] M. Kalman, E. Steinbach, B. Girod, Rate distortion optimized video streaming with adaptive playout, Proceedings of the ICIP2002, 2002, pp. 189–192. [8] A. Kantarcı, T. Tunalı, A video streaming application over the internet, advances in information systems, ADVIS 2000, Lecture Notes in Computer Sciences, Vol. 1909, Springer, Berlin, 2000, pp. 275–284. [9] A. Kantarci, T. Tunali, Design and implementation of a streaming system for MPEG-1 videos, J. Multimedia Tools Appl. Kluwer Acad. 21 (2003) 265–284. [10] A. Kantarci, T. Tunali, Lossy network performance of a rate-control algorithm for video streaming applications, Proceedings of the 18th International Symposium on Computer and Information Sciences, Lecture Notes in Computer Sciences, Vol. 2869, Springer, Berlin, 2003, pp. 635–642. [11] K.R. Rao, Z.S. Bojkovic, D.A. Milovanovic, Multimedia Communication Systems: Techniques, Standards, and Networks, 1st Edition, Prentice-Hall PTR, Englewood Cliffs, NJ, 2002. ISBN: 013031398X. [12] E.G. Richardson, Video Codec Design—Developing and Video Compression Systems, Wiley, England, 2002. [13] H. Schulzrinne, S. Cosner, R. Frederic, V. Jacobson, RFC1889: RTP: a transport protocol for real-time applications, January 1996. [14] N. Seelam, P. Sethi, W. Feng, A Hysteresis Based Approach for Quality, Frame Rate, and Buffer Management for Video Streaming Using TCP, MMNS 2001, Lecture Notes on Computer Science, Vol. 2216, Springer, Berlin, 2001, pp. 1–15. [15] T. Stockhammer, M.M. Hannuksela, T. Wiegand, H.264/ AVC in wireless environments, IEEE Trans. Circuits Syst.
ARTICLE IN PRESS A. Kantarci et al. / Signal Processing: Image Communication 19 (2004) 479–497 Video Technol. (Special Issue on H.26LJVT Coding), 13(7) (2003) 657–673. [16] T.T. Stockhammer, M.M. Hannuksela, S. Wenger, H.26L/ JVT coding network abstraction layer and IP-based transport, Proceedings of the ICIP2002, 2002, pp. 485–488. [17] A. Triphathi, M. Claypool, Adaptive content-aware scaling for improved video streaming, Proceedings of the Second International Workshop on Intelligent Multimedia Computing and Networking (IMMCN), 2002, Durham, North Carolina, USA, pp. 1021–1024.
497
. [18] T. Tunalı, A. Kantarcı, N. Ozbek, Robust quality adaptation for internet video streaming, J. Multimedia Tools Appl. Kluwer Acad., accepted for publication. [19] D. Wu, Y.T. Hou, Y. Zhang, Transporting real-time video over the internet: challenges and approaches, Proc. IEEE 88 (12) (2000) 1855–1875. [20] D. Wu, Y.T. Hou, W. Zhu, Y. Zhang, J.M. Peha, Streaming video over the internet: approaches and directions, IEEE Trans. Circuits Syst. Video Technol. 11 (1) (2001) 1–20.