Non-intrusive speech quality prediction in VoIP networks using a neural network approach

ARTICLE IN PRESS Neurocomputing 72 (2009) 2595–2608 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/loca...

Download PDF

2MB Sizes 1 Downloads 21 Views

Report

PDF Reader
Full Text

ARTICLE IN PRESS Neurocomputing 72 (2009) 2595–2608

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Non-intrusive speech quality prediction in VoIP networks using a neural network approach M. AL-Akhras a,, H. Zedan b, R. John b, I. ALMomani a a b

King Abduallah II School for Information Technology, The University of Jordan, Amman 11942, Jordan School of Computing, De Montfort University, Gateway House, The Gateway, Leicester LE1 9BH, UK

a r t i c l e in fo

abstract

Article history: Received 7 August 2007 Received in revised form 18 December 2007 Accepted 27 October 2008 Communicated by X. Wu Available online 30 November 2008

Measuring speech quality in Voice over Internet Protocol (VoIP) networks is an increasingly important application for legal, commercial and technical reasons. Any proposed solution for measuring the quality should be applicable in monitoring live-trafﬁc non-intrusively. The E-Model proposed by the International Telecommunication Union-Telecommunication Standardisation Sector (ITU-T) achieves this, but it requires subjective tests to calibrate its parameters. In this paper a solution is proposed to extend the E-Model to any new network conditions and for newly emerging speech codecs without the need for the time-consuming, expensive, hard to conduct subjective tests. The proposed solution is based on an artiﬁcial neural network model and is compared against the E-Model to check its prediction accuracy. & 2008 Elsevier B.V. All rights reserved.

Keywords: Artiﬁcial neural network Speech quality VoIP Non-intrusive E-Model Perceptual evaluation of speech quality

1. Introduction Transmitting Voice over Internet Protocol (VoIP) networks is an increasingly important application in the world of communication due to many advantages, including [1,2]: Lower bandwidth requirements. Integration of voice and data applications into one network which makes the creation of new and innovative applications possible. Reduction of cost for long distance calls. Lower operating and management expenses as one network needs maintenance rather than two separate networks for voice and data. Enable live broadcasting of radio/TV channels. Many of emerging systems and applications adopt the new technology to achieve one or more of the above advantages and mostly to achieve some of the revenue achieved by the traditional approach of transmitting voice over public switched telephone networks (PSTN). However, to be able to compete with the highly reputable PSTN networks, VoIP networks should be able to Corresponding author.

E-mail addresses: [email protected] (M. AL-Akhras), [email protected] (H. Zedan), [email protected] (R. John), [email protected] (I. ALMomani). 0925-2312/$ - see front matter & 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2008.10.019

achieve comparable quality to that achieved by telephony networks. This comes from the fact that customers who are used to the high-quality telephony networks, expect to receive a comparable quality from any potential competitor, otherwise they will not be attracted to the new technology. To this end, it is important to measure the quality of VoIP networks as this is important for legal, commercial and technical reasons. Measurement of the quality would be a necessity as customers and companies would be bound by a contract usually requiring the company to achieve acceptable quality, otherwise customers may sue the companies for poor quality. Also, measuring the quality gives the chance to network administrators to overcome temporal problems that could affect the quality of ongoing voice calls. There are many possible ways for measuring the quality of a voice call and the selection of a method for this task should take the characteristics of IP networks and voice calls into consideration. Such characteristics that affect the selection include the requirement to measure the quality while the network is running in a real environment during a voice call. To be able to do this it is necessary to use an automated solution that measure the quality without human interference and depending on the received signal at the receiver side without the need for the original speech signal at the sender side. i.e. non-intrusively. In this paper an extension to one of the well known methods for measuring the speech quality known as the E-Model is proposed utilising an artiﬁcial neural network (ANN) approach.

ARTICLE IN PRESS 2596

M. AL-Akhras et al. / Neurocomputing 72 (2009) 2595–2608

The rest of this paper is structured as follows: In Section 2 the main methods used for measuring the speech quality are reviewed. Section 3 discusses the E-Model and the main problems associated with it. Section 4 discusses the proposed technique to extend the E-Model while in Section 5 the detailed derivation steps of the proposed technique are outlined. In Section 6 the results of the extension are discussed and in Section 7 conclusions are drawn and possible avenues for future work are highlighted.

2. Assessment technologies for measuring VoIP quality Quality in non-managed IP networks such as the Internet is not guaranteed, therefore, it is important to monitor the speech quality in telecommunication systems and take appropriate actions when necessary. It is also important to measure the quality even in managed networks. In addition to its importance for legal, commercial and may be technical reasons, this also allows service providers to evaluate their own and their competitors’ service using a standard scale. It is also a strong indicator of user’s satisfaction of the service provided. In doing so, a specialised mechanism is required for measuring the quality [3]. Speech quality assessment techniques can be categorised into three main categories: subjective assessment techniques, intrusive objective assessment techniques and non-intrusive objective assessment techniques. 2.1. Subjective assessment of speech quality The primary criterion for voice and video communication is subjective quality, the user’s perception of service quality. Subjective quality factors affect the quality of service of VoIP, among those factors are: packet loss, delay, jitter (difference in interarrival time between successive voice packets), loudness, echo, and codec distortion. To measure the subjective quality, a subjective quality assessment method is used. The most widely used subjective quality assessment methodology is opinion rating deﬁned in International Telecommunication Union-Telecommunication Standardisation Sector (ITU-T) Recommendation P.800. Such subjective tests could be conversational or listening-only tests. In conversational tests, two subjects share a conversation while in listening tests, one subject is listening to pre-recorded sentences [4]. The most common metric in opinion rating is mean opinion score (MOS) metric with ﬁve point scale: (5) excellent, (4) good, (3) fair, (2) poor, (1) bad [4]. A MOS value is obtained as an arithmetic mean for a collection of MOS scores for a set of subjects. MOS is internationally accepted metric as it provides a direct link to the quality as perceived by the user. When the subjective test is listening-only, the results are in terms of listening subjective quality, i.e. MOS—listening quality subjective or MOSLQS. When the subjective test is conversational, the results are in terms of conversational subjective quality, i.e. MOS— conversational quality subjective or MOSCQS [3–6]. The problem with MOS measurement and with subjective tests in general is the difﬁculty in performing such tests. To conduct a subjective experiment according to ITU-T Recommendation P.800, strict lab conditions should be in place. Such conditions concern the room size, noise level, and the use of sound-proof cabinet in a room with a volume not less than 20 m3 . Also the sound pressure level should be measured from a vertical position above the subject’s seat while the furniture is in place. In case of recording, the microphone is positioned between 140 and 200 mm from the talker’s lips. Recommendation P.800 also speciﬁes other conditions related to the subjects participating in a subjective test.

It is apparent from the strict conditions mentioned above that the inherent problems in subjective MOS measurement are that it is: time-consuming, expensive, lacks repeatability, inapplicable for monitoring live trafﬁc as commonly needed for VoIP applications, which if not addressed appropriately may result in legal disputes and commercial and technical problems. This has made objective methods very attractive to estimate the subjective quality for meeting the demand for voice quality measurement in communication networks. However, as subjective methods are the most accurate methods for measuring the speech quality, they are used to calibrate objective methods. 2.2. Intrusive objective assessment of speech quality Intrusive methods for measuring speech quality require the original speech signal to measure the speech quality. The most prominent intrusive method is perceptual evaluation of speech quality (PESQ) which is the latest ITU-T standard for objective evaluation of speech quality as deﬁned in ITU-T Recommendation P.862. It was a result of a collaboration project between KPN, Netherlands and BT, UK [7–9]. It was speciﬁcally developed to be applicable to end-to-end voice quality testing under real network conditions, such as VoIP, ISDN, etc. The results obtained by PESQ was found to be highly correlated with subjective tests with correlation factor of 0.935 on 22 ITU benchmark experiments, which cover nine different languages. Upon its standardisation, the previous perceptual speech quality measure (PSQM) method deﬁned in Recommendation P.861 was withdrawn by the ITU-T because PSQM was designed to work in error-free conditions where real systems may include ﬁltering and variable delay, as well as distortions due to channel errors and low-bit rate (LBR) codecs [7,10,11]. In PESQ the original and the degraded signals are time-aligned, then both signals are transformed to an internal representation that is analogous to the psychophysical representation of audio signals in the human auditory system, taking account of perceptual frequency (Bark) and loudness. After this transformation to the internal representation, the original signal is compared with the degraded signal using a perceptual model. This is achieved in several stages: level alignment to a calibrated listening level, compressive loudness scaling, and averaging distortions over time as illustrated in Fig. 1 [7,9]. PESQ score lies in the range 0:5 to 4.5, a function is provided in ITU-T Recommendation P.862.1 to map these values to the range 1–5 representing MOS—listening quality objective or MOSLQO . Recommendation P.862.1 [10] also provides the formula to move back to PESQ score from an available MOSLQO score. As PESQ and similar techniques are full-reference, intrusive, they provide an accurate method for measuring the speech quality as they use the original or reference speech signal as input to produce an estimate of the listening MOS by comparing the posttransmitted signal with the original one. However, such methods are inapplicable in monitoring live trafﬁc because it is difﬁcult or impossible to obtain actual speech samples as the reference signal is not available at the receiver. 2.3. Non-intrusive objective assessment of speech quality As subjective methods and the intrusive methods for measuring the quality cannot be used in monitoring live trafﬁc, this makes non-intrusive methods attractive as the only available solution for monitoring the quality of live trafﬁc in VoIP networks. One of the most widely used methods for objectively evaluating the speech quality non-intrusively is opinion modelling. The most famous standard for opinion modelling is the

ARTICLE IN PRESS M. AL-Akhras et al. / Neurocomputing 72 (2009) 2595–2608

Internal Representation of Reference Signal

Perceptual Model

Reference Signal

2597

System Under Test

Quality Difference in Internal representation

Degraded Signal

Time-Alignment

Cognitive Model

Internal Representation of Degraded Signal

Perceptual Model

Fig. 1. Conceptual diagram of PESQ philosophy [7].

E-Model which is deﬁned according to ITU-T Recommendation G.107 [3,12]. The E-Model, abbreviated from the European Telecommunications Standards Institute (ETSI) is a computational tool originally developed as a network planning tool, but it is now being used for objectively estimating voice quality for VoIP applications using network and terminal quality parameters. In the E-Model, the original or reference signal is not used to estimate the quality as the estimation is based purely on the terminal and network parameters. As such, the E-Model is a nonintrusive method of measuring the quality as it does not require the injection of the reference signal [3,12,13]. The E-Model was used in enormous number of studies for the purpose of network planning or to help network operators in designing the network or to perform live monitoring of the network. The details of the E-Model are discussed in the next section.

range of 0 and 100 to indicate the level of estimated quality where R ¼ 0 represents an extremely bad quality and R ¼ 100 represents a very high quality. The R-Rating Factor can be mapped into MOS score based on the ITU-T Recommendation G.107 [12,15]. The R-Rating Factor is calculated according to the following formula which follows the previous summation principle: R ¼ R0 Is Id Ie-eff þ A

R0 Is

Id Ie-eff

(1)

basic signal-to-noise ratio (groups the effects of noise) impairments which occur more or less simultaneously with the voice signal e.g. (quantisation noise, sidetone level) impairments due to delay, echo impairments due to codec distortion, packet loss and jitter advantage factor or expectation factor (e.g. 10 for GSM)

3. The E-Model

A

In the non-intrusive objective method of the E-Model the subjective quality factors are mapped into manageable network and terminal quality parameters to automatically produce an estimation of the subjective quality. Among the network quality parameters are: network delay and packet loss. Among the terminal quality parameters are: jitter buffer overﬂow, coding distortion, jitter buffer delay, and echo cancellation. Example of mapping is the mapping of delay subjective quality parameter into network delay and jitter buffer delay. The network parameters such as packet-loss rate can be estimated from information contained in the headers of real-time transport protocol (RTP) and real-time transport control protocol (RTCP). The fundamental principle of the E-model is based on a concept established by Allnatt more than 20 years ago [14]:

The advantage factor captures the fact that users might be willing to accept some degradation in quality in return for ease of access, e.g. users may ﬁnd that speech quality is acceptable in cellular networks because of its access advantages. The same quality would be considered poor in the public circuit-switched telephone network. In the former case A could be assigned the value 10, while in the later case A would take the value 0 [16]. Each of the parameters in Eq. (1) except the advantage factor (A) is further decomposed into a series of equations as deﬁned in ITU-T Recommendation G.107 [12]. When all parameters set to their default values, R-Rating Factor as deﬁned in Eq. (1) has the value of 93.2 which is mapped to an MOS value of 4.41. Also when the effect of delay is considered, the estimated quality according to the E-Model is conversational, i.e. MOS—conversational quality estimated MOSCQE . When the effect of delay is ignored and Id is set to its default value the estimation is listening only, i.e. MOS— listening quality estimated MOSLQE . The computed R-Rating Factor from Eq. (1) can be mapped into an MOS value. Eq. (2) [12] gives the mapping function between the computed R-Rating Factor and the MOS value. 8 9 Ro0 > > < 1; = 6 MOS ¼ 1 þ 0:035R þ RðR 60Þð100 RÞ 7 10 ; 0oRo100 > > : 4:5; ; R4100

Psychological factors on the psychological scale are additive. In the E-model all factors responsible for quality degradation are summed on the same psychological scale. Due to its additive principle, the E-Model is able to describe the effect of several impairments occurring simultaneously. The E-model is a function of 20 input parameters that represent the terminal, network, and environmental quality factors (quality degradation introduced by speech coding, bit error, and packet loss is treated collectively as an equipment impairment factor). E-model starts by calculating the degree of quality degradation due to individual quality factors on the same psychological scale. Then the sum of these values is subtracted from a reference value to produce the output of the E-model which is a single scalar value called the R-Rating Factor. The R-Rating Factor can lie in the

(2) ITU-T Recommendation G.107 [12] also provides the formula to move back to R-Rating Factor from an available MOS score. The equation is R¼

pﬃﬃﬃﬃﬃﬃﬃﬃﬃ 20 p 8 226 h þ 3 3

(3)

ARTICLE IN PRESS 2598

M. AL-Akhras et al. / Neurocomputing 72 (2009) 2595–2608

with h¼

1 3atan 2

qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ð18 566 6750MOS; 15 903 522 þ 1 113 960MOS 202 500MOS2 Þ

(4) where atan 2ðx; yÞ ¼

8 < atan x y

: p atan y x

9 for xX0 =

(5)

for xo0 ;

Quality measurement and quality degradation due to packet loss as deﬁned in Eq. (1) is characterised by packet-loss dependent effective equipment impairment factor (Ie-eff). In this paper the effect of other parameters will not be considered and as such the default values for all the parameters except Ie-eff-related parameters will be used. For example Id will be set to zero. Ie-eff is calculated according to the following formula [12]: Ie eff ¼ Ie þ ð95 IeÞ:

Ppl Ppl þ Bpl BurstR

(6)

where Ie Bpl Ppl BurstR

codec-speciﬁc equipment impairment factor codec-speciﬁc packet-loss robustness factor packet-loss probability burst ratio (to count for burstiness in packet loss)

The packet-loss dependent effective equipment impairment factor (Ie-eff)—as deﬁned in Eq. (6)—is derived using codec-speciﬁc values for the equipment impairment factor (Ie) and packet-loss robustness factor (Bpl) at zero packet loss. The values for Ie and Bpl for several codecs are listed in ITU-T Recommendation G.113 Appendix I [17] and they are derived using subjective MOS test results. For example for the speech coder deﬁned according to ITU-T Recommendation G.729 [18], the corresponding Ie and Bpl values are 11 and 19, respectively. On the other hand packet-loss probability (Ppl) and burst ratio (BurstR) depend on the packetloss properties presented in the system. BurstR deﬁned in the latest version of the E-Model [12] as BurstR ¼

Average length of observed bursts in an arrival sequence Average length of bursts expected for the network under ‘‘random’’ loss

(7) Based on Eq. (7), when packet loss is random (i.e. independent) BurstR ¼ 1 and when packet loss is bursty (i.e. dependent) BurstR41. The impact of packet loss in previous versions of the

Sender

E-Model (prior to the current version, 2005) was characterised by equipment impairment (Ie) factor. Speciﬁc impairment factor values for codec operating under random packet loss have been previously tabulated to be packet-loss dependent. In the current version of the E-Model (2005), packet-loss robustness factor (Bpl) is deﬁned as codec-speciﬁc value and Ie is replaced by the packetloss dependent effective equipment impairment factor (Ie-eff). A simpliﬁed version of the R-Rating Factor of the one deﬁned in Eq. (1) will be used here to consider the degradation due to the packet loss only. The simpliﬁed version is R ¼ R0 Ie-eff

(8)

The E-model is a good choice for non-intrusive estimation of voice quality, but it has an obvious problem in that it is based on subjective tests to calibrate model parameters (Ie and Bpl) [12]. The inherent problems of subjective tests are that they are hard to conduct (as they require strict lab conditions), time-consuming, expensive, and lacks repeatability. This makes the E-Model applicable just for limited number of cases, speciﬁcally to those cases where the corresponding subjective tests are already performed. Additionally, as the E-Model is deﬁned over a speciﬁc range of parameters, consequently, it cannot be applied to any new codec or even for new network conditions before conducting a series of subjective tests corresponding to the new codec and the new network conditions to derive model parameters and this hinders its use in new and emerging applications in an evolving world of communication. To overcome this restriction, an attempt is made in this paper to extend the E-Model without the need for the subjective tests.

4. Extension to the E-Model using PESQ Because the applicability range of the E-Model depends on applying the corresponding subjective tests, it would be useful and practical if a methodology exists to extend the E-Model’s applicability range, without the need for the time-consuming and expensive subjective tests needed to calibrate the E-model’s parameters, as this will remove one of the major obstacles in the applicability of the E-Model for new coders and new network conditions. The proposed extension in this paper will use the latest version of the intrusive-based speech quality prediction methodology—PESQ—as a reference criteria for the accuracy of the prediction of the E-model parameters instead of performing the expensive subjective tests. An ANN model acts as a function approximator will also be utilised in order to perform the proposed extension. The setup for the system is depicted in Fig. 2.

Packet Loss%, Burst Ratio

Reference Signal Speech Encoder

IP Network Packet Loss Simulator

Degraded Signal Speech Decoder

PESQ

R

Ie-eff

Neural Network

Ie-eff

Fig. 2. System setup for the E-Model extension.

R

MOS

Receiver

ARTICLE IN PRESS M. AL-Akhras et al. / Neurocomputing 72 (2009) 2595–2608

In the system setup shown in Fig. 2, the reference speech signal is encoded and then packet loss is simulated using 2-state Gilbert model. In the Gilbert model the system moves between ‘‘found’’ and ‘‘loss’’ states with transition probabilities p and q, respectively. The system suffers from bursty packet loss if it remains in ‘‘loss’’ state. The probabilities of the Gilbert model are calculated based on the values of Ppl and BurstR [12]. The received degraded stream after simulating packet loss is decoded to retrieve the degraded speech signal. Using PESQ, the speech quality is measured by comparing the reference signal with the degraded signal as deﬁned in ITU-T’s Recommendation P.862 and discussed earlier in Section 2.2. Based on the PESQ score retrieved from the PESQ measurement of speech quality, the MOS which represents the speech quality as measured by PESQ is calculated. As the purpose here is to extend the E-Model’s applicability range, an assumption is made that the quality measured by PESQ is the same as the quality predicted by the E-Model. The question of whether the two models give the same estimation of the quality is investigated in a separate paper. After this the quality is mapped from MOS using the E-Model into an R-Rating Factor value derived from MOS using Eq. (3). By manipulating the simpliﬁed version of the R-Rating Factor deﬁned in Eq. (8), Ie-eff can be calculated from an R-Rating Factor: Ie-eff ¼ R0 R

(9)

Recall that R0 has the value of 93.2 when all the parameters of the E-Model are set to their default values. The above sequence of derivations can be summarised as follows: under speciﬁc Ppl and BurstR, packet loss is simulated and speech quality is measured using PESQ algorithm to retrieve the PESQ score which can be then mapped into an MOS score. From the MOS score the R-Rating Factor can then be derived and then Ie-eff is calculated. From the above sequence, a relation between (Ppl and BurstR) and Ie-eff is constructed by simulating packet loss for several values of Ppl and BurstR and performing the above derivation. This three-dimensional relation is then used to derive a model to relate (Ppl and BurstR) with Ie-eff. Many methods can be used to derive such a relation. In this paper the use of an ANN method is investigated. By deriving such a relation, Ie-eff is deﬁned in terms of Ppl and BurstR in absence of Ie and Bpl which will be absorbed in form of weighs in the ANN. Both Ie and Bpl depend on subjective tests while originally (Eq. (6)) Ie-eff was deﬁned according to these four terms Ie, Bpl, Ppl, and BurstR. In this way the E-Model becomes

Sender

2599

extendable to new network conditions and to new coders as soon as the relation is re-derived for these new coders and new network condition utilising PESQ. This will by-pass the time-consuming, expensive subjective tests, required to calculate Ie and Bpl, which are a major obstacle toward the generalisation of the E-Model. It should be noted, however, that the derived model is applicable for the speech coder in use. If a new speech coder is to be used, a new derivation is required. Therefore before applying the derived model, the used speech coder should be determined and the model derived for that coder should be used. However, requiring objective tests to derive such a model is much simpler, cheaper, and faster than requiring subjective tests which is the aim here. The derived model can be integrated with the E-Model in monitoring live-trafﬁc non-intrusively as depicted in Fig. 3. IP packets can be captured at some appropriate point in the IP network (possible at an ingress gateway) and the information about packet loss, delay, and the codec type are extracted from RTP header. By identifying the speech codec and determining the packetloss characteristics of the received stream (Ppl and BurstR), then the derived ANN for that speech codec as derived above is used and the information about packet loss are fed into the ANN model to calculate the corresponding Ie-eff value. Similarly the information about delay and other parameters in the E-Model are used to calculate Id and Is to use them in calculating the R-Rating Factor in Eq. (1). The value of the advantage factor, A is also added according to the characteristics of the system in-hand (wired, wireless, etc.). Ie-eff value computed from the model and the results from the computation of Id and Is are combined with A to calculate an overall R-Rating Factor which can then be mapped into an MOS score to give an estimation for the overall conversational quality. Using the proposed technique, if a new speech coder emerges, it is readily applicable to the E-Model as soon as the required objective tests are performed to derive a relation between Ie-eff with Ppl and BurstR. Previous efforts have been going on to extend the E-Model based on the PESQ intrusive-based speech quality prediction methodology [13,19–23]. Ding and Goubran [19,20] attempted to relate the E-Model with PESQ but as their study relied on a previous version of the E-Model (before 2005 version) [24] they did not consider the burstiness in packet loss which is something they pointed out in their study as proposal for future work. Also when they made

Reference Signal

Degraded Signal Speech Encoder

IP Network

Speech Decoder

Ppl, BurstR Is, Id ANN

MOS

R

Ie-eff

A Fig. 3. Conceptual diagram of E-Model extension used to monitor live trafﬁc.

Receiver

ARTICLE IN PRESS 2600

M. AL-Akhras et al. / Neurocomputing 72 (2009) 2595–2608

experiments using different packet-loss concealment algorithms including silence insertion, repetition and G.729 built-in algorithm they made an incorrect assumption by saying that each one of them has its own equipment impairment factor from ITU-T Recommendation G.113 [17] although that recommendation mentioned the equipment impairment factor just for the builtin algorithm which make their results unreliable. The work proposed by Sun [13,21–23] is limited as the old E-Model was considered where the effect of burstiness on the speech quality was not included. Additionally, the results of this work were not checked against the original E-Model to prove its validity. Also, the choice of the used ﬁtness curve was not justiﬁed and other ﬁtness curves could provide better extension to the E-Model. These limitations are avoided in this study as the latest E-Model (year 2005 version) [12] is considered to study the effect of the burstiness, also the accuracy of the prediction is evaluated by comparing the quality of prediction using the proposed technique against the quality of the prediction using the current E-model.

5. Derivation of the model Using speech signal encoded according to the speech coder deﬁned in ITU-T Recommendation G.729 [18], this section details the steps toward deriving a relation between (Ppl and BurstR) with Ie-eff using an ANN model. This will provide an alternative formula to Eq. (6) and thus will by-pass the subjective tests required to calculate Ie and Bpl values for the speech coder. The basic steps for the derivation are: (i) Simulate packet loss using 2-state Gilbert model, for each combination of Ppl and BurstR, then compare the resulted degraded signal with the original signal to compute the PESQ score. (ii) Convert PESQ scores into MOS values. (iii) Map the MOS values to R-Rating Factor values. (iv) Calculate Ie-eff from R-Rating Factor using the simpliﬁed E-Model. (v) Derive a model to relate Ie-eff with Ppl and BurstR using ANN Model. (vi) Use the derived model to calculate the R-Rating Factor. (vii) Map the R-Rating Factor values into MOS scores. The above steps are described in more details next: Step 1—Calculate PESQ score: Using the speech coder deﬁned in ITU-T Recommendation G.729 [18], packet loss is simulated using 2-state Gilbert model for each combination of Ppl and BurstR. Then the reference signal is compared with the degraded signal to calculate the PESQ score. When packet loss is simulated using 2-state Gilbert model with speciﬁc values for Ppl and BurstR, the exact locations of packet loss are unknown in advance and as such the speech quality could differ between two runs of the simulator even with the same combination of Ppl and BurstR due to the fact that loss during voiced period of the signal has different effect on the quality in comparison with loss during unvoiced periods of the speech signal [13]. To remove the effect of randomisation on the resultant quality, a pilot study was conducted to ﬁgure out the required number of times the simulation needs to be repeated for each combination of Ppl and BurstR to have enough conﬁdence in the results. Step 1.1—Pilot study: The purpose of this pilot study is to determine the required number of iterations the simulation should be repeated in order to accurately predict the quality in

the presence of packet loss in random locations. The goal is to have an accurate estimation of the quality with 0:01 MOS because any difference of 0.01 or less is indistinguishable to the ear. The equation used to calculate the error in the estimation is zs Error ¼ pﬃﬃﬃﬃ N

(10)

where Error z

s N

the error in estimating the quality between the actual and the prediction a number to indicate how conﬁdent we want to be the standard deviation of the population the size of the sample

As the size of the sample increases, the accuracy of the prediction increases and the error decreases, but as an inﬁnite sample size cannot be used, therefore a sample of limited size should be used and calculations can be made with a margin of error. In this case the standard deviation of the population s is replaced with the standard deviation of the sample S and a modiﬁed version of Eq. (10) is used as follows: zS Error ¼ pﬃﬃﬃﬃ N

(11)

Manipulation of Eq. (11) yields N¼

zS Error

2

(12)

In Eq. (12) the required level of error is 0:01 MOS, as the MOS scores are normally is given with up to two digits, and the required conﬁdence level is 95% (95% is high conﬁdence level and if the required conﬁdence level is as high as 99% conﬁdence level, a much higher number of experiments is needed in this case). The corresponding value of z for the 95% conﬁdence level is 1.96. To determine the required number of iterations ðNÞ to achieve this conﬁdence level with this error margin, the sample’s standard deviation, S is still to be determined which will be computed using the pilot study. In the pilot study, one value of Ppl is used, speciﬁcally 5 which lies in the range 0–20 which are the boundaries of the permitted range for Ppl as deﬁned in the E-Model. The selected value for BurstR is 1.5 (middle point between the permitted range 1–2). For this combination of Ppl and BurstR (5, 1.5), p and q parameters for the 2-state Gilbert model are computed and this model is used to simulate the packet-loss behavior. The degraded signal is compared against the reference signal (without loss) to compute the resultant PESQ score. This was repeated for 10 000 times to have a large enough sample to represent the population. The standard deviation for this sample ðSÞ was 0.0249. When this value was fed into Eq. (12), the required number of iterations was found to be 23.755. The speech material used in this study is the speech ﬁle named (b_eng_f1) from the speech data set deﬁned in ITU-T Recommendation P.50. The pseudo-code for this pilot study is listed below in Fig. 4. As a result it was concluded that to achieve 95% conﬁdence level of having 0:01 MOS, at least 24 iterations should be performed for each combination of Ppl and BurstR. In order to have a normal distribution (according to the Central Limit Theorem) together with having a 95% conﬁdence level, 30 iterations are used to satisfy both requirements. Step 1.2—Calculate PESQ score: From the previous step it was determined that for each combination of Ppl and BurstR the simulation need to be repeated for 30 times to have 95% conﬁdence of the results with 0:01 MOS.

ARTICLE IN PRESS M. AL-Akhras et al. / Neurocomputing 72 (2009) 2595–2608

2601

Fig. 4. Pseudo-code for the pilot study.

The permitted range according to the E-Model is 0–20 for Ppl and is 1–2 for BurstR. For the purpose of comparing the results with those of the E-Model within the permitted range of the E-Model and the expected quality outside the range, packet loss was simulated in the range 0–30 for Ppl and the range 1–5 was used for BurstR. The ﬁrst step in the derivation of the model is the calculation of the corresponding PESQ score for each combination of Ppl and BurstR. The selected values for Ppl are f0; 0:5; 1 . . . 30g and for BurstR is f1 . . . 5g, consequently there are 160 possible combination. For each possible combination, the parameters for the 2-state Gilbert model (p and q) are calculated. Then 30 iterations are performed (as discussed in the previous section) and for each iteration, packet loss is simulated using the values of p and q. In each iteration the original speech ﬁle is compared against the degraded speech ﬁle using PESQ algorithm to compute the PESQ score. The overall PESQ score for each combination of Ppl and BurstR is computed by taking the average over the 30 iterations. The pseudo-code for this process is listed in Fig. 5. As a result of this experiment, the PESQ score is calculated for each possible combination of Ppl and BurstR. This forms a threedimensional relation that relates Ppl and BurstR with PESQ score as shown in Fig. 6. As expected from the experiment, the quality of the signal expressed as a PESQ score is inversely proportional to both Ppl and BurstR. In other words the PESQ score decreases with any increase in either Ppl or BurstR. This can be noticed from Fig. 6 where the maximum PESQ achieved with the minimum values of Ppl and BurstR namely 0 and 1, respectively. With any value of BurstR such as 1, PESQ score decreases monotonically with the increase in Ppl.

Similarly, with any value of Ppl such as 10, PESQ score decreases with the increase in BurstR. It should be noted that with Ppl ¼ 0, PESQ score remains the same, 3.7535, which is the maximum possible value of PESQ for the used speech ﬁle. This is due to the fact that when packet loss equals 0, there is no degradation in the quality due to packet loss. The only degradation to the quality in this case is due to the distortion of the coder. This is consistent with the expected quality under the E-Model. According to the E-Model the impairment due to packet loss is characterised by the packet-loss dependent Ie-eff as deﬁned in Eq. (6). When Ppl equals 0, this equation reduces to Ie-eff ¼ Ie

(13)

where Ie is the impairment due to the coder in the absence of packet loss. Step 2—Convert PESQ score to MOS score: As shown in the system setup in Fig. 2, the next step of the derivation is to convert the PESQ score into an MOS score. This can be achieved using the following formula [10]: MOS ¼ 0:999 þ

4:999 0:999 1 þ eð1:4945PESQ þ4:6607Þ

(14)

Each PESQ value shown in Fig. 6 can be mapped into a corresponding MOS value for each combination of (Ppl and BurstR). As the MOS score is proportional to the PESQ score, the same comments can be said here. The quality of the signal expressed as an MOS score is inversely proportional to both Ppl and BurstR, i.e. the MOS score decreases with any increase in either Ppl or BurstR.

ARTICLE IN PRESS 2602

M. AL-Akhras et al. / Neurocomputing 72 (2009) 2595–2608

Fig. 5. Pseudo-code for the calculation of the PESQ score.

Fig. 6. PESQ (experimental) vs. packet-loss probability and burst ratio.

Step 3—Mapping MOS score to R-Rating Factor: The next step of the derivation (Fig. 2) is to map the MOS scores retrieved in the previous step into their corresponding R-Rating Factor values. Eq. (3) is used to convert the MOS scores into an R-Rating Factor value.

Step 4—Calculate Ie-eff from R-Rating Factor: From the R-Rating Factor derived in the previous step, the value of Ie-eff can be computed using Eq. (9). As a result Fig. 7 is produced to show the relation of Ie-eff against Ppl and BurstR.

ARTICLE IN PRESS M. AL-Akhras et al. / Neurocomputing 72 (2009) 2595–2608

2603

Fig. 7. Ie-eff (experimental) vs. packet-loss probability and burst ratio.

As Ie-eff is inversely proportional to the R-Rating Factor values, Ie-eff is proportional to both the Ppl and BurstR, i.e. Ie-eff increases with any increase in either Ppl or BurstR, as can be seen in Fig. 7. The maximum Ie-eff value is achieved with the maximum values of Ppl and BurstR. Ie-eff reaches its minimum when Ppl ¼ 0 as a result of applying Eq. (13). This is due to the fact that when packet loss equals 0, there is no degradation in the quality due to the packet loss. The only degradation to the quality in this case is due to the distortion of the coder. For the used speech signal (b_eng_f1) the basic quality (in terms of the R-Rating Factor) due to coder’s distortion is 80.6834. When the R-Rating Factor reaches its maximum, applying Eq. (8), the basic degradation (in terms of Ie-eff) is produced which is equal 12.5166 in this case. Step 5—Derive an ANN model to relate Ie-eff with Ppl and BurstR: In the previous step a three-dimensional ﬁgure that relates Ie-eff with Ppl and BurstR was derived. In this step an attempt will be made to derive a relation from the set of available data in order to relate Ie-eff with packet-loss conditions of the network (Ppl and BurstR). By deriving such a potential relation, the E-Model becomes extendable to new network conditions and to new coders provided the relation is re-derived for these new coders. This will by-pass the time-consuming, expensive subjective tests which are a major obstacle toward the generalisation of the E-Model. The derivation in this step will be done using an ANN model that acts as function approximator. Multi-layer feedforward ANNs, trained using the backpropagation algorithm, are used to train a network in order to approximate a function relating Ie-eff with Ppl and BurstR. Multiple-layer networks can be powerful. For instance, two-layer networks with biases, a sigmoid transfer function in the ﬁrst layer and a linear transfer function in the output layer can be used as general function approximator networks as they are capable of approximating any function with a ﬁnite number of discontinuities arbitrarily well, given sufﬁcient neurons in the hidden layer. Having neurons with nonlinear transfer functions allow the network to learn nonlinear and linear relationships between input and output vectors. While the sigmoid functions squash the output into a limited range, between 0 and 1 or between 1 and þ1, the linear output layer lets the network produce values outside the range 1 to þ1 [25]. Several algorithms can be used to train multi-layer backpropagation networks. For instance, Levenberg–Marquardt (LM)

Fig. 8. Multi-layer feedforward neural network.

algorithm performs well for function approximation problems using moderate-sized feedforward ANNs (up to several hundred weights) where the approximation must be very accurate. It also has a very efﬁcient MATLAB implementation, since the solution of the matrix equation is a built-in function, so its importance becomes more pronounced in a MATLAB setting [26]. The above characteristics ﬁt the problem in hand, ﬁnding a function approximation model to characterise the relation between Ppl and BurstR with Ie-eff. For this problem a two layer neural network with sigmoid transfer function in the ﬁrst layer and linear transfer function in the output layer will be used and the network will be trained using LM algorithm. The number of input units is restricted to two corresponding to Ppl and BurstR, the number of output unit is one corresponding to Ie-eff. The number of neurons in the hidden layer is free parameter and will be determined empirically. Fig. 8 shows an example feedforward neural network with two inputs, one output, and four neurons in the hidden layer, sigmoid function is used in the ﬁrst layer and linear transfer function is used in the output layer. In the input layer of Fig. 8, a vector of size 2 is used to represent the input parameters to the ANN, speciﬁcally Ppl and BurstR. These two inputs are connected to four neurons in the hidden layer through weights and each of the four neurons in the hidden layer is connected to a bias. The input to each of the four neurons results from the summation of the result of multiplication of each weight by the corresponding input. Then for each neuron the sum of multiplications is added to the neuron’s bias as the neuron’s net input. Each neuron in the hidden layer then uses a sigmoid

ARTICLE IN PRESS 2604

M. AL-Akhras et al. / Neurocomputing 72 (2009) 2595–2608

Table 1 Division of data set into training, validation, and testing. Ppl

Assigned data set

Ppl

Assigned data set

0 0.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Training Training Training Validation Testing Training Training Training Validation Testing Training Training Training Training Validation Testing

15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Training Training Training Validation Testing Training Training Training Validation Testing Training Training Training Training Validation Testing

transfer function which is able to squash its input to an output in the range of 1 to þ1. The output of the hidden layer is then fed to the input of output layer which consists of a single neuron to represent the required output of the ANN, Ie-eff. The four outputs of the hidden layer which lies between 1 and þ1 are then multiplied by the four weights connecting the output of the four hidden neurons to the input of the single neuron in the output layer. The result of the four multiplication operations are then summed and added to the bias connected to the output neuron. The net input is then passed to the linear transfer function of the output neuron to produce the ﬁnal output. Before using an ANN model similar to the one shown in Fig. 8, the Ie-eff data shown in Fig. 7 are divided into training, validation and test subsets. The validation subset is used to improve generalisation accuracy and avoid overﬁtting the trained network into the training data. From the combination of Ppl in the range 0–30 and BurstR in the range 1–5, there are 160 input vectors

Fig. 9. Ie-eff (neural network derived) vs. packet-loss probability and burst ratio.

Fig. 10. Ie-eff (experimental and neural network predicted) vs. packet-loss probability and burst ratio.

ARTICLE IN PRESS M. AL-Akhras et al. / Neurocomputing 72 (2009) 2595–2608

available, 100 vectors will be used for training, 30 for validation and 30 for testing. Among the 160 vector, there are ﬁve possible burst ratios and 32 possible packet-loss probabilities belong to each burst ratio. The training, validation and test sets were selected as equally spaced points throughout the original data to avoid bias in the training set. Table 1 illustrates the division of data into training, validation and testing sets for any burst ratio. The same pattern is repeated for other burst ratios. Different number of neurons in the hidden layer were attempted ranging from 1 neuron to 40 neurons. With one neuron the total number of weights and biases in the network equals 5, knowing with 40 neurons the network will have the capability to fully remember the training set, this was done to study the effect of the number of neurons on the performance of the test set. For each setting the experiment was repeated for 30 different trials, where different random initial

2605

weights were generated in each trial. This counts to 1200 run in total ð40 30Þ. During the experiment the network was allowed to be trained for up to 10 000 epochs, although in all cases the training stopped before reaching this number due to the error in the validation set exceeding the error in the training set. The best network in terms of performance of the test set was found to be a network with ﬁve neurons in the hidden layer. This network will be used for subsequent derivations in this section. The inputs to this network are Ppl and BurstR, if you compare this to the original Ie-eff in Eq. (6), it can be noticed that the speech coder dependent parameters (Ie and Bpl) have been absorbed in the form of weights and biases in the ANN model. These parameters were resulted from subjective tests. As such Ie-eff as derived from the ANN model does not depend on the time-consuming subjective tests to normalise its equations which is the goal of this derivation.

Fig. 11. R-Rating Factor (neural network derived) vs. packet-loss probability and burst ratio.

Fig. 12. MOS (neural network derived) vs. packet-loss probability and burst ratio.

ARTICLE IN PRESS 2606

M. AL-Akhras et al. / Neurocomputing 72 (2009) 2595–2608

When different combinations of Ppl and BurstR are fed into the best network obtained from the experiment, which has ﬁve neurons in the hidden layer, a new three-dimensional graph can be drawn to relate Ie-eff against Ppl and BurstR. This new graph is shown in Fig. 9. Fig. 10 visually compares Ie-eff from the experiments as shown in Fig. 7 with Ie-eff from the ANN derivation shown in Fig. 9 by drawing both surfaces in the same ﬁgure. The ﬁt was good to the degree it is hard to distinguish between the experimental values and the ANN-predicted values. The multiple correlation coefﬁcient ðRÞ between the observed and the ANN-predicted values for the dependent variable (Ie-eff) has the value of 0.998 which indicates strong positive correlation and a good ﬁt. R2 , the

Table 2 E-Model MOS and ANN MOS for different possible values of Ppl for speech coder G.729. BurstR ¼ 1

coefﬁcient of determination has the value of 0.996 which indicates that 99.6% of the time the variation in the independent variable is explained by the model. Step 6—Use the derived model to calculate R-Rating Factor from Ie-eff: From the values of Ie-eff retrieved from the best network with ﬁve neurons in the hidden layer in the previous step, R-Rating Factor values can be derived using Eq. (8). Based on this equation, Fig. 11 is produced to show the relation between R-Rating Factor with Ppl and BurstR. Step 7—Map the R-Rating Factor values into MOS score: MOS values can be used to test the performance of the ANN in comparison with the performance of the original E-Model in measuring the speech quality. Utilising the relation between R-Rating Factor and MOS in Eq. (2), Fig. 12 is produced to show the values of MOS with Ppl and BurstR. Similarly PESQ scores can be derived from the MOS values to retrieve a ﬁgure relating PESQ with Ppl and BurstR. These scores

BurstR ¼ 2

Ppl

E-Model MOS

ANN MOS

Absolute difference

E-Model MOS

ANN MOS

Absolute difference

0 0.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

4.1 4.03 3.95 3.79 3.63 3.48 3.34 3.21 3.08 2.96 2.85 2.75 2.65 2.56 2.47 2.4 2.32 2.25 2.19 2.13 2.07 2.02

4.09 4.02 3.95 3.81 3.67 3.54 3.41 3.29 3.18 3.07 2.97 2.88 2.78 2.69 2.59 2.49 2.38 2.27 2.16 2.07 1.98 1.91

0.01 0.01 0.00 0.02 0.04 0.06 0.07 0.08 0.10 0.11 0.12 0.13 0.13 0.13 0.12 0.09 0.06 0.02 0.03 0.06 0.09 0.11

4.1 4.02 3.94 3.77 3.59 3.41 3.24 3.06 2.89 2.73 2.58 2.43 2.29 2.16 2.03 1.92 1.81 1.71 1.62 1.54 1.46 1.39

4.03 3.94 3.84 3.65 3.45 3.26 3.08 2.91 2.76 2.62 2.49 2.37 2.26 2.15 2.05 1.94 1.83 1.73 1.62 1.52 1.43 1.36

0.07 0.08 0.10 0.12 0.14 0.15 0.16 0.15 0.13 0.11 0.09 0.06 0.03 0.01 0.02 0.02 0.02 0.02 0.00 0.02 0.03 0.03

Fig. 14. Scatter diagram of quality prediction.

Fig. 13. MOS (E-Model and neural network) vs. packet-loss probability and burst ratio.

ARTICLE IN PRESS M. AL-Akhras et al. / Neurocomputing 72 (2009) 2595–2608

could be used for comparison with PESQ scores obtained empirically. From the above derivation, the best network can be used to compute the value of Ie-eff from Ppl and BurstR. The computed Ie-eff can then be used to calculate R-Rating Factor which can then be mapped into an MOS value and then PESQ scores. By doing this the E-Model by-pass the time-consuming, expensive subjective tests which was the purpose of this derivation.

6. Results of the extension To correctly evaluate the proposed extension, comparisons should be made in terms of MOS value as they reﬂect the quality as perceived by the user. The derived MOS values in Fig. 12 represent the model prediction (ANN prediction of the quality) values. These values should be compared against the E-Model’s predicted values to determine the successfulness of the ANN model in extending the E-Model to new network conditions and new speech coders. To study the effectiveness of the ANN model in modelling Ie-eff and ultimately predicting the speech quality in terms of MOS score, a comparison is made between ANN-predicted MOS (as shown in Fig. 12) and E-Model-predicted MOS. The comparison is constrained to the MOS values corresponding to Ppl in the range 0–20 and BurstR in the range 1–2 as deﬁned by the E-Model. Table 2 lists the MOS values as calculated by the E-Model series of equations and by the ANN model in the above ranges. The multiple correlation coefﬁcient (R) between the E-Model’s MOS values and the ANN’s MOS values (both listed in Table 2) is

2607

found to be 0.9942 which indicates strong positive correlation between the E-Model’s MOS and the ANN’s MOS values. The average absolute difference is 0.0716 MOS which indicates a good estimation of the quality while the maximum absolute difference is 0.1600 MOS and the standard deviation is 0.0486. Visual comparison between the MOS values from the E-Model with the MOS values from the ANN model (in the speciﬁed range) is shown in Fig. 13. It appears from the ﬁgure that the two relations for the MOS both have very similar characteristics with very small differences which are hardly noticeable from the ﬁgure. Fig. 14 shows a scatter diagram between the E-Model-based prediction and the ANN prediction to visualise the correlation between the two predictions. Most of the points are concentrated near the perfect ﬁt line due to the very high correlation. Fig. 15 shows the box plot of difference in quality prediction between the E-Model and the ANN model. From the ﬁgure it appears that the values of error in prediction are clustered in the lower range i.e. toward zero. The ﬁrst quartile (ﬁrst 25% of the data) lies in the range 0–0.02 MOS, and the ﬁrst two quartiles 50% are in the range 0–0.07 (median value) MOS, the third quartile lies between 0.07 and 0.12 while the last quartile is 0.12 and 0.16 MOS. Fig. 16 shows the cumulative distribution function (CDF) of the difference in quality prediction between the E-Model and the ANN model. The upper bound for the error was 0.16 MOS, with 38.64% below or equal 0.05 MOS, 68.18% below or equal 0.10 MOS, 97.73% below 0.15 MOS and 100.00% below 0.16 MOS. Although the ANN derivation does take time for training and preparation, but the estimation accuracy justiﬁes its use in predicting the quality degradation due to Ie-eff. The derived ANN model can be integrated with the E-Model in monitoring live-trafﬁc non-intrusively and predicting conversational speech quality as depicted in Fig. 3.

7. Conclusions and future work

Fig. 15. Box plot of the error in neural network prediction.

In this paper a method to extend the E-Model for predicting effective equipment impairment factor (Ie-eff) from packet-loss probability (Ppl) and burst ratio (BurstR) was derived based on an

Fig. 16. Cumulative distributed function (CDF) of quality prediction error.

ARTICLE IN PRESS 2608

M. AL-Akhras et al. / Neurocomputing 72 (2009) 2595–2608

ANN model, also the accuracy of the method in comparison with the E-Model was evaluated. The proposed model avoids the hard to conduct, timeconsuming, expensive, and lacks of repeatability subjective tests required to calibrate the E-Model’s parameters depending on conducting objective tests using PESQ method instead. Using the proposed method, the E-Model becomes readily extensible to new network conditions and to new speech codecs. Using the proposed technique, if a new speech codec emerges, it is readily applicable to the E-Model as soon as the required objective tests are performed to derive a relation between Ie-eff with Ppl and BurstR. From Ie-eff, R-Rating Factor and MOS values can be computed. The proposed model has wide applicability in estimating the speech quality for voice applications over IP networks which makes it signiﬁcantly important for this widely interesting, growing application in the continuously changing world of communication. In this paper an ANN was attempted to ﬁnd a relation between Ppl and BurstR with Ie-eff, a separate paper is being prepared to attempt other methods to ﬁnd such relation. Some of the possible methods include linear regression and nonlinear regression. The performance of such methods should be compared against the performance of the E-Model and against the performance of the ANN presented in this paper in terms of prediction accuracy of the quality. References [1] D. Collins, Carrier Grade Voice over IP, second ed., McGraw-Hill Companies, New York, 2003. [2] B. Khasnabish, Implementing Voice over IP, second ed., Wiley-Interscience, New York, 2003. [3] A. Takahashi, H. Yoshino, N. Kitawaki, Perceptual QoS assessment technologies for VoIP, IEEE Communications Magazine 42 (7) (2004) 28–34. [4] ITU-T, Recommendation P.800—methods for subjective determination of transmission quality, International Telecommunication Union-Telecommunication Standardisation Sector (ITU-T), August 1996. [5] ITU-T, Recommendation P.800.1—mean opinion score (MOS) terminology, International Telecommunication Union-Telecommunication Standardisation Sector (ITU-T), March 2003. [6] A. Takahashi, Opinion model for estimating conversational quality of voip, in: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, 17–21 May 2004, pp. iii—1072–1075. [7] ITU-T, Recommendation P.862—perceptual evaluation of speech quality (PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs, International Telecommunication Union-Telecommunication Standardisation Sector (ITU-T), February 2001. [8] ITU-T, Recommendation P.862—perceptual evaluation of speech quality (PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs, Amendment 2: revised Annex A: Reference implementations and conformance testing for Recommendations P.862, P.862.1 and P.862.2, International Telecommunication Union-Telecommunication Standardisation Sector (ITU-T), November 2003. [9] A. Rix, J. Beerends, M. Hollier, A. Hekstra, Perceptual evaluation of speech quality (PESQ)—a new method for speech quality assessment of telephone networks and codecs, in: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, 7–11 May 2001, pp. 749–752. [10] ITU-T, Recommendation P.862.1—mapping function for transforming P.862 raw result scores to MOS-LQO, International Telecommunication UnionTelecommunication Standardisation Sector (ITU-T), March 2005. [11] E.E. Zurek, J. Leffew, W.A. Moreno, Objective evaluation of voice clarity measurements for VoIP compression algorithms, in: Proceedings of the Fourth IEEE International Caracas Conference on Devices, Circuits and Systems, 17–19 April 2002, 2002, pp. T033-1–T033-6. [12] ITU-T, Recommendation G.107—the E-model, a computational model for use in transmission planning, International Telecommunication Union-Telecommunication Standardisation Sector (ITU-T), March 2005. [13] L. Sun, Speech quality prediction for voice over Internet protocol networks, Ph.D. Thesis, School of Computing, Communications and Electronics, University of Plymouth, UK, January 2004. [14] J. Allnatt, Subjective rating and apparent magnitude, International Journal of Man–Machine Studies 7 (1975) 801–816. [15] J.H. James, B. Chen, L. Garrison, Implementing VoIP: a voice transmission performance progress report, IEEE Communications Magazine 42 (2004) 36–41. [16] A.P. Markopoulou, F.A. Tobagi, M.J. Karam, Assessment of VoIP quality over Internet backbones, in: IEEE Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies, Proceedings, INFOCOM 2002, vol. 1, 23–27 June 2002, pp. 150–159.

[17] ITU-T, Recommendation G.113 Appendix I—provisional planning values for the equipment impairment factor Ie and packet-loss robustness factor Bpl, International Telecommunication Union-Telecommunication Standardisation Sector (ITU-T), May 2002. [18] ITU-T, Recommendation G.729—coding of speech at 8 kbit s using conjugatestructure algebraic-code-excited linear-prediction (CS-ACELP), International Telecommunication Union-Telecommunication Standardisation Sector (ITU-T), March 1996. [19] L. Ding, R.A. Goubran, Assessment of effects of packet loss on speech quality in VoIP, in: Proceedings of the 2nd IEEE International Workshop on Haptic, Audio and Visual Environments and Their Applications, 20–21 September 2003, 2003, pp. 49–54. [20] L. Ding, R.A. Goubran, Speech quality prediction in VoIP using the extended E-model, in: IEEE Global Telecommunications Conference, GLOBECOM ’03, vol. 7, 2003, pp. 3974–3978. [21] L. Sun, E.C. Ifeachor, Prediction of perceived conversational speech quality and effects of playout buffer algorithms, IEEE International Conference on Communications, ICC ’03, vol. 1, 11–15 May 2003, 2003, pp. 1–6. [22] L. Sun, E.C. Ifeachor, Voice quality prediction models and their application in VoIP networks, IEEE Transactions on Multimedia 8 (2006) 809–820. [23] L. Sun, E. Ifeachor, New models for perceived voice quality prediction and their applications in playout buffer optimization for VoIP networks, in: IEEE International Conference on Communications, vol. 3, 20–24 June 2004, 2004, pp. 1478–1483. [24] ITU-T, Recommendation G.107—the E-model, a computational model for use in transmission planning, International Telecommunication Union-Telecommunication Standardisation Sector (ITU-T), March 2000. [25] S. Haykin, Simon, Neural Networks: A Comprehensive Foundation, second ed., Prentice-Hall, Englewood Cliffs, NJ, 1998. [26] MathWorks, Levenberg–Marquardt (trainlm)—Speed and Memory Comparison. Mousa Tawﬁq AL-Akhras received his B.Sc. and M.Sc. degrees in Computer Science from the University of Jordan, Jordan, in 2000 and 2003, respectively. He received his Ph.D. degree from De Montfort University, UK, in 2007. He is currently working in the Computer Information Systems Department in King Abdullah II School for Information Technology at the University of Jordan. His main area of interest is artiﬁcial intelligence and mainly artiﬁcial neural networks. His research interests include genetic algorithm, fuzzy logic, statistics, and multimedia communications.

Professor Hussein Zedan is the Technical Director of the Software Technology Research Laboratory (STRL) at De Montfort University, UK. His research interests include formal methods, veriﬁcation, semantics, critical systems, re-engineering, computer security, CBD, and IS development.

Professor Robert John heads the Division of Computer Engineering at De Montfort University, UK. He is also Director of the Centre for Computational Intelligence. He has published extensively in the area of type-2 fuzzy logic and his research interests are in the general ﬁeld of modelling human decision making using type2 fuzzy logic.

Iman Musa Almomani received her B.Sc. degree in Computer Science from UAE University (UAE), M.Sc. degree in Computer Science from the University of Jordan (Jordan). Iman then worked at the University of Jordan as a Lecturer. After that she got her Ph.D. degree from De Montfort University, UK, in 2007. Her research interests include wireless mobile ad hoc networks (WMANETs) and security issues in wireless networks. She is currently working in the Computer Science Department in King Abdullah II School for Information Technology at the University of Jordan.

Non-intrusive speech quality prediction in VoIP networks using a neural network approach

Non-intrusive speech quality prediction in VoIP networks using a neural network approach

Recommend Documents