Partial likelihood for online order selection

Partial likelihood for online order selection

ARTICLE IN PRESS Signal Processing 85 (2005) 917–926 www.elsevier.com/locate/sigpro Partial likelihood for online order selection$ Tu¨lay Adalı, Ho...

288KB Sizes 0 Downloads 56 Views

ARTICLE IN PRESS

Signal Processing 85 (2005) 917–926 www.elsevier.com/locate/sigpro

Partial likelihood for online order selection$ Tu¨lay Adalı, Hongmei Ni Department of Computer Science and Electrical Engineering, 1000 Hilltop Circle, University of Maryland Baltimore County, Baltimore, MD 21250, USA Received 25 June 2004; received in revised form 17 October 2004

Abstract Partial likelihood (PL) is a flexible framework for adaptive nonlinear signal processing allowing the use of a wide class of nonlinear structures—probability models—as filters. PL maximization has been shown to be equivalent to relative entropy minimization for the general case of time-dependent observations and its large sample properties have been established. In this paper, we use these properties to derive an information-theoretic criterion for order selection— the penalized partial likelihood (PPL) criterion,—for the general case of dependent observations. We then consider nonlinear signal processing by conditional finite normal mixtures as an example, a problem for which true order selection is particularly important. For this case, in which the PL coincides with the usual likelihood formulation, we present a formulation for online order selection by eliminating the need to store all data samples up to the current time. We demonstrate the successful application of the PPL criterion and its online implementation for the equalization problem by simulation examples. r 2005 Elsevier B.V. All rights reserved. Keywords: Information-theoretic criterion; Partial likelihood; Online order selection; Parametric modeling

1. Introduction Online order selection for signal processing applications is an important but difficult problem, which has not been well studied. A truly online order estimation scheme that updates the order estimate as new samples are processed is highly $

Supported in part by the National Science Foundation Career Award, NSF NCR-9703161. Corresponding author. Tel.: +410 455 3521; fax: +410 455 3969. E-mail address: [email protected] (T. Adalı).

desirable for real-time communications for a number of reasons. Such a procedure can potentially provide savings in training time, increase the information transmission rate, and reduce the storage requirement and computational cost significantly. The importance of order selection for signal processing has been noted (see e.g., [7,9,10,15]), however the literature on online order selection has been limited. In [5], a sequential Bayes learning and model selection approach is proposed and applied to radial basis function (RBF) networks. The procedure is computationally intensive

0165-1684/$ - see front matter r 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2005.01.001

ARTICLE IN PRESS 918

T. Adalı, H. Ni / Signal Processing 85 (2005) 917–926

and its performance depends on the selection of diffusion parameters. In [8], a minimal resource allocation network algorithm is used to grow and prune the RBF network’s hidden neurons online. Again, selection of training parameters and thresholds is very critical for the performance of the scheme. Information-theoretic criteria determine an optimal model order for a parameterized model such that a suitable metric is minimized (or maximized), and they have been widely used in a number of signal processing applications. The two most widely used information criteria are the Akaike’s information criterion (AIC) [4] and the minimum description length (MDL)—or the Bayesian information criterion (BIC)—[12,13]. Both criteria can be regarded as penalized maximum likelihood criteria and assume independent and identically distributed (i.i.d.) samples in their derivations, not a realistic assumption for most practical applications as correlations among samples typically do exist. Also, the formulations imply batch processing, i.e., computations where all data samples available. In this paper, we address these problems and show how PL formulation can be used to derive an information-theoretic criterion for order selection for the general case of dependent observations and how it can be adapted to online processing. Partial likelihood (PL) [6,16] is a recent extension of maximum likelihood theory, and is particularly attractive for application to problems in which time-ordering is essential or can be conveniently defined. Three aspects of maximum PL estimation are unique as compared to other extensions of maximum likelihood: (i) PL can easily be characterized for dependent observations, (ii) it can tolerate missing data, and (iii) in its characterization, it allows for sequential processing, hence is a suitable formulation for real-time applications. In [2], we show that PL is an effective framework for developing nonlinear techniques for signal processing, and in [1], use the PL theory to establish the fundamental information theoretic connection, i.e., to show the equivalence of likelihood maximization and relative entropy minimization without making the assumption of independent observations, which is an unrealistic

assumption for most signal processing applications. This equivalence is later shown in [2] to hold for the basic class of probability models, the exponential family, which includes a large class of structures for nonlinear signal processing. An important observation made in [11] is that PL learning is efficient and the overall performance is enhanced when the order of the parametric model—the probability density function (pdf)—is correct. In this paper, we address the order selection problem for real-time signal processing and derive penalized partial likelihood (PPL) for model order selection for the exponential family. Section 2 presents the PL function and derives the PPL. In Section 3, we consider practical issues in the implementation of the PPL for determining the true order and derive a formulation suitable for online implementation for the conditional finite normal mixtures (FNM) model. Section 4 presents simulation results and a summary is given in Section 5.

2. Penalized partial likelihood 2.1. Partial likelihood formulation Partial likelihood is defined as [2,14]: Let Fk ; k ¼ 0; 1; . . . ; be an increasing sequence of s-fields, fnk g; k ¼ 1; 2; . . . ; a sequence of random variables on some common probability space such that nk is Fk measurable. If the density of nk given Fk1 is written as py ðnk jFk1 Þ; where h is the parameter, the PL function relative to h; fFk g; and the data fnk g; is given by Ln ðhÞ ¼

n Y

py ðnk jFk1 Þ.

(1)

k¼1

Given a sequence of random variables fnk g; k ¼ 1; 2; . . . ; Fk1 represents the collection of all relevant events up to the discrete (time) instant k, i.e., represents the history at k. In a signal processing problem where k refers to the time index, Fk typically contains the observations and any additional information available about the process fnk g that can be either scalar or vector

ARTICLE IN PRESS T. Adalı, H. Ni / Signal Processing 85 (2005) 917–926

valued. Hence, the condition Fk1  Fk ; for Fk to be an increasing sequence of s-fields is naturally satisfied. The nested conditioning requirement is key in the development of the asymptotic theory of PL, i.e., to show that PL possesses the desirable large sample properties of maximum likelihood [1,16]. Hence, one can use asymptotic properties of likelihood to develop adaptive-structure and robust designs by using modified likelihood functions and informationtheoretic criteria. It is important to note that under certain conditions, the PL formulation coincides with the conditional likelihood—hence the traditional maximum likelihood—formulation. In [2], we give examples of specific cases that are relevant for most signal processing problems. For the finite normal mixtures model we consider in Section 3, the two formulations of likelihood coincide. For a multilayer perceptron or nonlinear regression type modeling however, the two formulations are different with conditional likelihood formulation including more information than the PL [2]. Without loss of generality, consider the following case that is a common scenario for most signal processing applications. Given a time series fxk g; k ¼ 1; 2; . . . ; and its time-dependent covariates fyk g; the objective is the estimation of the distribution of yk given all the available information up to time k. Thus, we define Fk1 sf1; ½xk ; . . . ; x1 ½yk1 ; . . . ; y1 g and write the conditional distribution of yk given Fk1 as py ðyk jFk1 Þ: Here h is the parameter vector and the PL function is now written as Ln ðhÞ ¼

n Y

py ðyk jFk1 Þ.

(2)

k¼1

We are going to use the formulation in (2) to derive the PPL in the next section. 2.2. Penalized partial likelihood for order selection The order selection problem can be formulated as: Given the time series X ¼ fx1 ; x2 ; . . . ; xn g and its time-dependent covariates Y ¼ fy1 ; y2 ; . . . ; yn g; choose the model order m that best matches the observed data. The model order that is best

919

supported by the data is given by maxm PðmjX ; Y Þ; or equivalently by maxm pðm; X ; Y Þ by the Bayes rule. Suppose that the model parameter h ranges over the parameter space O and the model of order m has the parameter space om ; which is a linear submanifold of O: We have Z pðm; X ; Y Þ ¼ pðh; m; X ; Y Þ dh om Z ¼ pðX ; Y jh; mÞpðh; mÞ dh. ð3Þ om

Note that information about the order m is contained in the parameter h: We define py ðX ; Y Þ pðX ; Y jh; mÞ to write py ðX ; Y Þ ¼

n Y

py ðyk jxk ; yk1 ; . . . ; y1 ; x1 Þ

k¼1 n Y



py ðxk jyk1 ; xk1 ; . . . ; y1 ; x1 Þ.

ð4Þ

k¼1

The first term of the product in (4) is the partial likelihood Ln ðhÞ in (2). For a causal system where yk is completely characterized by fxk ; xk1 ; . . . ; x1 g; the second term of the product in (4) becomes n Y

py ðxk jxk1 ; yk1 ; . . . ; y1 ; x1 Þ

k¼1

¼

n Y

py ðxk jxk1 ; . . . ; x1 Þ ¼ pðX Þ.

ð5Þ

k¼1

Since pðX Þ is independent of model parameter h; the Rbest model order is given by maxm om Ln ðhÞpðh; mÞ dh: If the family of the conditional distributions of yk belongs to an exponential family, it can be written in the form py ðyk jFk1 Þ ¼ expðaT ðhÞcðyk Þ  cðaðhÞÞÞ.

(6)

In [2], we establish the fundamental informationtheoretic relationship and show the equivalence of likelihood maximization and relative entropy minimization without making the assumption of independent observations. We also show that this result holds for the exponential family of distributions, which includes many important structures that can be used as nonlinear filters. The PL function in (2) can be rewritten for the exponential

ARTICLE IN PRESS T. Adalı, H. Ni / Signal Processing 85 (2005) 917–926

920

family as T

(7) Ln ðhÞ ¼ expðnða ðhÞ¯c  cðaðhÞÞÞÞ, P where c¯ ¼ 1n nk¼1 cðyk Þ: Let bm be the prior probability that the mth model is the true one and mm ðhÞ the conditional prior distribution of h on om given the mth model, such that its probability density function is bounded and locally bounded away from 0 on om : Then, as given in [13], the model with the largest posterior probability is the one that maximizes Z Sðn; mÞ ¼ ln bm expðnðaT ðhÞ¯c om

 cðaðhÞÞÞÞ dmm ðhÞ.

ð8Þ

Since a is a function of h; we can map the ^ m in submanifold om in the h-coordinate space to o the a-coordinate space to write Z bm expðnðaT ðhðaÞÞ¯c Sðn; mÞ ¼ ln ^m o

3. Online order selection with the FNM model

 cðaðhðaÞÞÞÞÞ dmm ðhðaÞÞ Z ¼ ln bm expðnðaT c¯  cðaÞÞÞ dm^ m ðaÞ. ð9Þ ^m o

Here, m^ m ðaÞ is the conditional prior distribution of a given the mth model. Assume that m^ m ðaÞ has a K m -dimensional probability density function, from [13] we know that, for any m, as n ! 1; Sðn; mÞ ¼ n max ðaT c¯  cðaÞÞ  a

K m ln n þ R, 2

(10)

where R is bounded in n. Ignoring the last term in (10), we can write the penalized partial likelihood criterion for order selection as PPLðmÞ ¼ ln Ln ðh Þ 

Rissanen [12] arrived at the same expression, the MDL criterion, from a totally different viewpoint, reformulating the problem as an information coding problem. The main difference of PPL criterion is that in its derivation, partial likelihood is used and hence it provides a more general formulation that allows for dependent observations. In this context, it is also important to remember the fact that the PL formulation can, in certain cases, coincide with the traditional conditional likelihood formulation [2]. For the cases where the two are different, it is conjectured that the PL estimator will tend to the conditional likelihood estimator when the correlation span of the observations is finite [2]. Since the derivations of MDL (or BIC) and PPL all assume large sample statistics, it is natural to observe that they coincide in the final form that they take.

K m ln n , 2

(11)

where h is the maximum PL estimate of the exponential model parameter and K m is the number of independently adjusted parameters for the mth model. The optimal model order is the one that maximizes the PPL criterion. Note that we followed Schwarz’s approach [13] in the derivation of the PPL criterion, while

3.1. The FNM model for nonlinear signal processing The observation vector yk defined in Section 2 is usually finite dimensional and includes only the last d observations yk such that ydk ¼ ½yk ; yk1 ; . . . ; ykdþ1 ; where the superscript is used to denote the dimensionality of the vector. Consider the general model for a causal system of memory L where yk ¼ uðxk ; xk1 ; . . . ; xkL Þ þ k : Here, uðÞ is a linear or nonlinear function and k is an additive zero mean noise component and Efyk g ml ¼ uðxk ; xk1 ; . . . ; xkL Þ: With this formulation, the distribution of ydk depends only on xLþd ¼ ½xk ; xk1 ; . . . ; xkLdþ1 : We can thus write k f y ðydk jFk1 Þ ¼ f y ðydk jxkLþd Þ for the conditional pdf in the PL formulation of (2). We assume that xk takes a value from a finite alphabet S ¼ fa1 ; a2 ; . . . ; aM g; where all symbols are equally likely, and thus map xLþd to a discrete k variable zk that takes values f1; . . . ; Kg where K ¼ M Lþd : If we let the additive noise k to be normally distributed, we can model the observations ydk as normal distributed with mean ll and

ARTICLE IN PRESS T. Adalı, H. Ni / Signal Processing 85 (2005) 917–926

covariance matrix Rl when zk ¼ l; to write f y ðydk jzk Þ

introduce the modified PPL (MPPL) criterion:

K X

dl ðzk Þ pffiffiffiffiffiffi d ¼ 1=2 l¼1 ð 2pÞ jRl j   1 d T 1 d exp  ðyk  ll Þ Rl ðyk  ll Þ , 2 ð12Þ

where dl ðzk Þ ¼ 1 when zk ¼ l and 0 otherwise. This is a conditional finite normal mixtures model, since the distribution of ydk takes the form f y ðydk Þ ¼

921

K X

q pffiffiffiffiffiffi dl 1=2 l¼1 ð 2pÞ jSl j   1 d T 1 d exp  ðyk  ll Þ Rl ðyk  ll Þ , 2 ð13Þ

where ql is the prior probability that zk ¼ l:

3.2. Practical considerations In a practical implementation, it is desirable to keep the sample size N as small as possible while making sure that it is large enough to ensure that the estimates are close to satisfying the desirable large sample properties of maximum PL estimation. Thus, when seeking the maximum of PPLðmÞ; we can start from a small order m and keep increasing the order till the maximum value of PPL is recorded. While doing so, it would be reasonable to let the number of samples N increase proportionally with the number of model parameters. For the FNM model, the number of parameters for channel order m is K m ¼ dM mþd þ 1 when the noise is i.i.d. so that covariance matrix Rl is given by s2 I for all l. Thus, the sample size should increase exponentially with the order estimate and the practical rule of having samples 10–20 times the number of free parameters—K m in this case— should provide satisfactory performance. This observation has been verified by our simulation results. Since the sample size N changes with the order estimate, to discount the dependence on the number of samples we modify the PPL and

MPPLðmÞ ¼

ln LN m ðh Þ ln N m  Km , Nm 2N m

(14)

where we use indexed N m to show that the sample size varies with the model order m. The implementation of Eq. (14), however, requires the use of all past data samples to calculate the maximum PL value, as ln LN ðhÞ ¼ Q N d Lþd Þ; which is undesirable for realk¼1 f y ðyk jxk time implementation. In the next subsection, we show that the FNM formulation for the conditional pdf leads to a convenient cancellation in the expression and a simple form for the PPL that uses the maximum PL estimates and eliminates the need to store all samples upto time N. 3.3. Online order selection for the FNM model When the noise k is i.i.d., the pdf given in (12) takes the simpler form K X kyd  l k2 dl ðzk Þ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi exp  k 2 l f y ðydk jzk Þ ¼ , 2s d 2 l¼1 ð2ps Þ (15) where dl ðzk Þ ¼ 1 when zk ¼ l and 0 otherwise. The log PL can be evaluated using this model as ln Ln ðhÞ n Y f ðydk jxLþd Þ ¼ ln k k¼1 n X

K X

0

1 B dl ðzk Þ ln@qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi k¼1 l¼1 ð2ps2 Þd 1 2 d ky  l k C exp  k 2 l A 2s

¼

¼

n X K X k¼1 l¼1

kyd  l k2 d dl ðzk Þ  lnð2ps2 Þ  k 2 l 2 2s

nd lnð2ps2 Þ ¼  2 Pn PK dl ðzk Þkydk  ll k2 .  k¼1 l¼1 2 2s

ð16Þ

ARTICLE IN PRESS T. Adalı, H. Ni / Signal Processing 85 (2005) 917–926

922

Maximizing (16) with respect to h; i.e., solution of q ln Ln ðhÞ ¼ 0; qll

l ¼ 1; . . . ; K;

q ln Ln ðhÞ ¼ 0, qs2 yields Pn dl ðzk Þydk n ll ¼ Pk¼1 ; n k¼1 dl ðzk Þ 2

sn ¼

and

l ¼ 1; . . . ; K;

1. Initialize Le ¼ 1 and the sample size N ¼ N 0 : 2. Adaptively estimate the parameters h to approximate h ¼ arg maxy LN ðhÞ for Le (and Le  1 if this step is executed the first time), using N samples. 3. Calculate OPPLðLe Þ: 4. If OPPLðLe ÞoOPPLðLe  1Þ; the channel order is estimated as Le  1: Stop. Else, Le Le þ 1; N 2N; and go to step 2.

and

n X K 1 X dl ðzk Þkydk  lnl k2 . nd k¼1 l¼1

(17)

When we insert the maximum likelihood estimates into (16), the log PL reduces to nd nd 2 lnð2psn Þ  . (18) 2 2 If we replace n in (18) by the order-indexed number of samples N m ; and ignore the last term in (18) that is independent of the model order m, we can define the online PPL (OPPL) criterion as ln Ln ðhn Þ ¼ 

d K m ln N m 2 OPPLðmÞ ¼  lnð2psn Þ  . 2 2N m

memory. The online channel order detection scheme can be summarized as:

The steps above can be repeated for the MPPL by replacing the OPPL computations with MPPL. The difference is that in step 3, with MPLL all N samples will need to be reused whereas for OPPL only sn 2 is needed for the computation. Also, in step 4, with OPPL 2N new samples will be used for the next order estimate whereas MPPL can use the original N samples and taken in N new samples for the next order estimate.

4. Simulation results (19)

In [11], we use the information-theoretic relationship of PL estimation to construct the information geometry of PL and derive an online algorithm to estimate the FNM parameters such that the PL is maximized—or the relative entropy is minimized— through information-geometric alternating projections. The large sample properties of PL guarantee that the estimates for lk and s2k converge to the maximum PL estimates lk and sk 2 : Note that we only need these two parameter estimates to calculate the order selection criterion given in (19) and do not need the samples up to the current time instant n. For the FNM model, the number of mixture components is determined by the system memory length L, for a given observation vector dimension d. As discussed earlier, for an online implementation, it is preferable to start with the smallest order for Le and continue increasing the order until the maximum of OPPL or MPPL is detected. Again, for efficiency considerations the value of d can be set to 1 to calculate Le ; the estimate of system

In this section, we present simulation results based on a channel equalization example to demonstrate the properties of the PPL criterion and its variants. The simulation results are in three groups, in the first subsection we demonstrate the importance of true order selection for the FNM model. The second subsection presents results for PPL, and the last subsection, for MPPl and OPPL. 4.1. Simulation results: Importance of order selection for the FNM model In [2], we show that the conditional FNM model given in (12) belongs to the exponential family and thus satisfies the fundamental information-theoretic connection for learning on the PL cost. An important property of learning on this cost function—as is the case for almost all parametric models—is that learning is very efficient and the performance is optimal when the data generation mechanism provides a perfect match to the model. We consider the problem of channel equalization,

ARTICLE IN PRESS T. Adalı, H. Ni / Signal Processing 85 (2005) 917–926

i.e., given the noisy observations yk of the output of a channel—linear or nonlinear—with finite memory, determine the transmitted sequence. We consider the supervised case where the information on the transmitted symbols is available. The correct number of mixtures in the FNM model is determined by the channel memory length L and the observation vector dimension d. For a given dimension d for the observation vector, the channel memory length L determines the true model order. In Figs. 1 and 2, we show the bit error

0 -1

log10 BER

-2 Le =0 Le =1 Le =2 Le =3 Le =4 Le =5 Le =6 Le =7 Le =8

-3 -4 -5

923

rate (BER) curves for the channel characterized by yk ¼ ykl  0:1y2kl þ k where ykl ¼ xk þ 0:8xk1 þ 0:7xk2 þ 0:6xk3 þ 0:5xk4 þ 0:4xk5 : The channel order estimate Le varies from 0 to 8 while the true order is L ¼ 5: To reflect the effect of the training sample size N, we fix N ¼ 5000 for Fig. 1 and let N increase exponentially with Le for Fig. 2. We can see that when the channel order is underestimated, the BER performance degrades significantly, especially at high signal-to-noise-ratio (SNR) levels. At low SNR values, the BER is not very sensitive to the order mismatch as the observations are too noisy. When the order is overestimated, the BER performance does not change much if there are enough data samples, as is clear from Fig. 2. However, with a given sample size, performance degrades considerably because of the ‘‘curse of dimensionality’’ caused by overparametrization, as is evident from the BER performance for Le ¼ 8 in Fig. 1. Note that our metric for the performance evaluation is the BER, which is the ultimate performance index for many classification problems of this type.

4.2. Simulation results: Properties of PPL criterion -6

0

5

10

15

20

25

30

SNR

Fig. 1. BER curves when d ¼ 2 and N ¼ 5000:

0 -1

log10BER

-2 Le =0 Le =1 Le =2 Le =3 Le =4 Le =5 Le =6 Le =7 Le =8

-3 -4 -5 -6

0

5

10

15 SNR

20

25

30

Fig. 2. BER curves when d ¼ 2 and N increases exponentially with Le :

To study the properties of the PPL criterion, again we consider the channel equalization problem described in Section 4.1. We have studied a number of channels and levels of SNR, and here present the PPL curves for the linear channel HðzÞ ¼ 1 þ 0:5z1 þ 0:3z2 with different d and Le values at SNRs of 6, 13, and 20 dBs in Figs. 3, 4, and 5, respectively. For these examples, the sample size N is chosen sufficiently large to show the asymptotic performance. How to further incorporate the finite sample size into consideration is another important problem. From Figs. 3–5 we can see that, the PPL criterion gives the correct channel order, i.e., L ¼ 2; for all values of d at low and high SNRs for this example, hence we can use d ¼ 1 for order selection initially to reduce the number of samples needed estimation as noted in Section 3.3. It is worth mentioning that PPL curves for the nonminimum phase and nonlinear channels result in characteristics very similar to those shown in Figs. 3–5.

ARTICLE IN PRESS T. Adalı, H. Ni / Signal Processing 85 (2005) 917–926

924

x 105

7

-1

6

-2

5

-3

4

-4

PPL

PPL

0

-5

d=1 d=2 d=3 d=4 d=5 d=6 d=7

3 2

-6

1

d=1 d=2 d=3 d=4 d=5 d=6 d=7

-7 -8 -9

x 105

0

0 -1

1

2

3 Le

4

5

Fig. 3. PPL curves for channel HðzÞ ¼ 1 þ 0:5z1 þ 0:3z2 when SNR ¼ 6 dB and sample size N ¼ 100; 000:

0.5

-2

6

0

1

4

5

6

3

x 105

SNR = 0dB SNR = 5dB SNR = 10dB SNR = 15dB SNR = 20dB SNR = 25dB

2.5 2 1.5

-0.5

1 MPPL

PPL

3 Le

Fig. 5. PPL curves for channel HðzÞ ¼ 1 þ 0:5z1 þ 0:3z2 when SNR ¼ 20 dB and sample size N ¼ 100; 000:

0

-1

-1.5 -2

-0.5 -1 -1.5

d=6

-2 0

d=7

1

0

d=3 d=5

0

0.5

d=1 d=2 d=4

-2.5

2

2

3 Le

4

5

6

Fig. 4. PPL curves for channel HðzÞ ¼ 1 þ 0:5z1 þ 0:3z2 when SNR ¼ 13 dB and sample size N ¼ 100; 000:

4.3. Simulation results: properties of MPPL and OPPL criteria We study the properties of the MPPL and OPPL criteria at different SNRs for several channels for the channel equalization problem. The MPPL and OPPL criteria demonstrate very similar performances. For the channel HðzÞ ¼ 1 þ 0:5z1 þ 0:3z2 ; as shown in Fig. 6, the OPPL/MPPL

1

2

3

4 Le

5

6

7

8

Fig. 6. OPPL/MPPL curves for channel HðzÞ ¼ 1 þ 0:5z1 þ 0:3z2 :

criterion gives the correct channel order L ¼ 2 at all SNR levels. For the channel HðzÞ ¼ 1 þ 0:5z1 þ 0:4z2 þ 0:3z3 þ 0:2z4 þ 0:1z5 shown in Fig. 7, the correct order L ¼ 5 is determined at high SNRs, and there is slight underestimation at low SNRs. It is interesting to note the effect of noise as a function of the multipath component with the lowest power. When the noise is high relative to the lowest multipath component, ignoring this

ARTICLE IN PRESS T. Adalı, H. Ni / Signal Processing 85 (2005) 917–926

5. Discussions

1.5 SNR = 0dB SNR = 5dB SNR = 10dB SNR = 15dB SNR = 20dB SNR = 25dB

1

MPPL

0.5 0 -0.5 -1 -1.5 -2

925

0

1

2

3

4 Le

5

6

7

8

Fig. 7. OPPL/MPPL curves for channel HðzÞ ¼ 1 þ 0:5z1 þ 0:4z2 þ 0:3z3 þ 0:2z4 þ 0:1z5 :

component does not degrade the BER performance much. In this case, the selection of ‘‘effective channel order’’ as noted in [10] may be more meaningful. We tested the online order selection scheme for a number of linear and nonlinear channels as well as channels that are minimum phase and nonminimum phase. The correct channel order is obtained for most of the tested channels. For example, at all SNR levels, our scheme gives Le ¼ 1 for the linear channel yk ¼ xk þ 0:5xk1 þ k within 100 samples with MPPL and 150 samples with OPPL, Le ¼ 2 for the nonlinear channel yk ¼ ykl  0:2y2kl þ k where ykl ¼ xk þ 0:5xk1 þ 0:3xk2 ; within 200 samples with MPPL and 350 samples with OPPL, and Le ¼ 2 for the non-minimum phase nonlinear channel yk ¼ ykl  0:2y2kl þ k where ykl ¼ 0:3482xk þ 0:8704xk1 þ 0:3482xk2 ; within 200 samples with MPPL and 350 samples with OPPL. For a channel with longer memory yk ¼ xk þ 0:5xk1 þ 0:4xk2 þ 0:3xk3 þ 0:2xk4 þ 0:1xk5 þ k ; the criterion yields the order estimate Le ¼ 5 within 1600 samples with MPPL and 3150 samples with OPPL when SNR X10 dB; and Le ¼ 4 within 800 samples with MPPL and 1550 samples with OPPL when SNR o10 dB:

We address the problem of order selection for nonlinear adaptive signal processing, and derive a penalized likelihood criterion for order selection by drawing on the PL theory. We consider the FNM model as the nonlinear filter, which is intimately related to the RBF network, and note the importance of order selection for this problem when used to model the output of a system with memory. For the FNM model, we show that an online version of the PPL criterion derived can be obtained. We demonstrate the successful application of the criterion by using examples in channel equalization. We study the performance of PPL and note the tradeoffs involved in the two versions of the criterion introduced, the MPPL and the OPPL for the FNM model. The order selection scheme with the MPPL criterion reuses all the previous data samples to calculate the maximum PPL value for the new order estimate. In that case, the number of total data samples used to detect the channel order is smaller than that needed by the OPPL criterion. However the MPPL scheme requires the storage of all the data samples while OPPL uses all samples only once. The problem we consider in this paper is adaptive signal processing with nonlinear filters. Kernel-type filters such as those based on the FNM model are attractive for this task, and provide satisfactory performance for a wide class of signal processing problems described by the general finite memory system introduced in this paper. However it is worth noting that with the increase of the system memory L, the complexity of these type of models increases substantially. For an online order selection scheme, this implies a substantial increase in the number of samples required to effectively estimate the model order. In cases such as these, the use of a posterior-type modeling using logistic regression type models rather than conditional density modeling—which typically implies kernel-type filters—can be more attractive [1–3]. The sensitivity of information-theoretic criteria such as the AIC and MDL have been studied for order selection of a linear model and their

ARTICLE IN PRESS 926

T. Adalı, H. Ni / Signal Processing 85 (2005) 917–926

limitations have been noted [7,9,17]. This is primarily the case when the noise is colored causing the noise eigenvalues to be dispersed. It is important to study the properties of information-theoretic criteria for nonlinear models such as the FNM, especially when implemented for signal processing applications. It is also a promising direction to investigate other approaches to order selection, such as the use of numerical analysis arguments to derive more robust criteria as demonstrated in [10].

References [1] T. Adalı, X. Liu, M.K. So¨nmez, Conditional distribution learning with neural networks and its application to channel equalization, IEEE Trans. Signal Process. 45 (4) (April 1997) 1051–1064. [2] T. Adalı, H. Ni, Partial likelihood for real-time signal processing, IEEE Trans. Signal Proc. 51 (1) (January 2003) 204–212. [3] T. Adalı, H. Ni, B. Wang, Partial likelihood for estimation of multi-class posterior probabilities, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, Phoenix, AZ, March 1999, pp. 1053–1056. [4] H. Akaike, A new look at the statistical model identification, IEEE Trans. Automat. Control AC-19 (December 1974) 716–723. [5] C. Andrieu, N.D. Freitas, Sequential Monte Carlo for model selection and estimation of neural networks, Proceedings of the IEEE International Conference Acous-

[6] [7]

[8]

[9]

[10]

[11]

[12] [13] [14] [15]

[16] [17]

tics, Speech, and Signal Processing (ICASSP), Istanbul, Turkey, June 2000. D.R. Cox, Partial likelihood, Biometrika 62 (1975) 69–72. P.M. Djuric, A model selection rule for sinusoids in white gaussian noise, IEEE Trans. Signal Process. 44 (7) (July 1996) 1744–1751. P.C. Kumar, P. Saratchandran, N. Sundararajan, Communication channel equalization using minimal radial basis function neural networks, Proceedings of the IEEE Workshop on Neural Networks for Signal Processing (NNSP), Cambridge, England, September 1998, pp. 477–485. A.P. Liavas, P.A. Regalia, On the behavior of information-theoretic criteria for model order selection, IEEE Trans. Signal Process. 49 (8) (August 2001) 1689–1695. A.P. Liavas, P.A. Regalia, J.P. Delmas, Blind channel approximation: effective channel order determination, IEEE Trans. Signal Process. 47 (12) (December 1999) 3336–3344. H. Ni, T. Adalı, B. Wang, X. Liu, A general probabilistic formulation for supervised neural classifiers, J. VLSI Signal Process. Systems 26 (1/2) (August 2000) 141–153. J. Rissanen, Modeling by shortest data description, Automatica 14 (1978) 465–471. G. Schwarz, Estimating the dimension of a model, Ann. Statist. 6 (2) (1978) 461–464. E. Slud, Partial likelihood for continuous-time stochastic processes, Scand. J. Statist. 19 (1992) 97–109. M. Wax, T. Kailath, Detection of signals by information theoretic criteria, IEEE Trans. Acoust., Speech, Signal Process. ASSP-33 (April 1985) 387–392. W.H. Wong, Theory of partial likelihood, Ann. Statist. 14 (1986) 88–123. W. Xu, M. Kaveh, Analysis of the performance and sensitivity of eigendecomposition-based detectors, IEEE Trans. Signal Process. 43 (June 1995) 1413–1426.