Statistical Analysis of Functional Data

Statistical Analysis of Functional Data

Statistical Analysis of Functional Data F Ferraty, P Sarda, and P Vieu, Toulouse University, Toulouse, France ã 2010 Elsevier Ltd. All rights reserved...

806KB Sizes 0 Downloads 216 Views

Statistical Analysis of Functional Data F Ferraty, P Sarda, and P Vieu, Toulouse University, Toulouse, France ã 2010 Elsevier Ltd. All rights reserved.

In many applications, the object of interest is a function X ¼ X(t) depending continuously on a variable t which may be time, a vector of spatial position, etc. In practice, for each individual i, one observes the function X on a grid t1, t2, . . ., tp, and the statistical sample contains the values of Xi (t1), . . ., Xi (tp) for i ¼ 1, . . ., n (i.e., n individuals). One encounters such data, for instance, in longitudinal studies (when the measurements t1, t2, . . . are dates), in chemistry (i.e., t1, t2, . . . could be wavelengths), in image analysis (i.e., t1, t2, . . . are a two or three-dimensional vector), etc. Finally, the collected data consist in either a population of discretized curves (spectrometric curves, radar waveforms, electricity consumptions, growth curves, digitized voice recordings, etc.), a collection of surfaces, multidimensional images (digitized pictures, satellite images, multispectral images, hyper-spectral images, times series of images), or a set of any other higher dimensional (discretized) functions. With the development of modern technology, the data are recorded on finer and finer grids, and so one can consider the observed data in a continuous manner. Such data are called functional data, and a new branch of statistics, called functional data analysis (FDA), has emerged with the aim of developing models and methodologies that are able to study such high dimensional data and integrate specific information. The terminology functional variable is also used to indicate a statistical variable whose observations lead to functional data. This recent statistical topic is motivated by numerous domains of applications: biology, biomechanics, chemometrics, computer sciences, ecology, econometrics, medical sciences, industry, etc. Problems encountered in FDA are of the same nature as in general statistics: analyze variability and find structures (factorial analysis, classification, etc.) in a set of functional data, predict a (functional or scalar) variable depending on the variations of a functional predictor (regression), depicting a distribution of functional data, identifying outliers, etc. The reader can find a good overview on FDA methodologies in Ramsay and Silverman (2005) and several case studies in Ramsay and Silverman (2002). A more recent book on FDA from Ferraty and Vieu (2006) deals with general statistical models and discusses the estimation of a distribution of a population of curves and some related features. Functional data can also be studied in the setting of time series (see Bosq, 2000). Why developing specific statistical strategies for functional data? The starting point of FDA comes from the inability of standard multivariate technics to extract all

the functional richness of such high dimensional data. For instance, a standard multivariate linear regression generally fails when one aims to predict a scalar response from a functional explanatory variable. Thus, our goal is to present the usefulness of FDA as well as its potentialities. This contribution is organized as follows. We start with a general description of functional data by means of several real examples coming from a wide scope of applied scientific fields (chemometrics, computer sciences, geophysics). Then we discuss why the standard multivariate linear regression (through a spectrometric functional data set) fails when one considers an explanatory functional variable. In a third section, one proposes to give an idea on what one can do with FDA by giving a sample of functional statistical studies (unsupervised classification, regression on functional variable, curves discrimination, and functional prediction). Our main goal is to stress situations where FDA is potentially pertinent. We conclude by opening some motivating perspectives for educational purposes.

A Few Functional Data Sets and Problematics Let us first introduce some illustrating examples depicting what is called functional data. The first one comes from the food industry and will be the unifying thread throughout this article. It concerns a quality control problem. One has at hand 215 small pieces of meat (i.e., 215 individuals). Each piece of meat is analyzed through a Tecator Infratec Food and Feed Analyzer. This apparatus allows to obtain, for each unit, one spectrometric curve (i.e., near-infrared spectrum in the wavelength range 850– 1050 nm). Finally, one collects 215 spectrometric curves (i.e., {Xi (t); t 2 [850, 1050]}i=1, . . ., 215), one per piece of meat. Of course, in reality, one has at hand 215 discretized curves at 100 measurements t1, . . ., t100 (i.e., for the ith unit one gets the discretized curves Xi (t1), . . ., Xi (t100)). However, the difference between two consecutive grid points (i.e., |tj  tj+1|) is so small that the discretized curves (see Figure 1) can be considered as continuous ones as in Figure 2. At this stage, several questions could interest the practitioners: Can we explore this population of curves? Can we extract pertinent informations? Is it possible to build several groups in order to classify such spectrometric curves? Can we depict the distribution of such a population of curves? (median curve?, modal curve?, etc.).

417

418

Statistics

Other interesting questions arise if there are additional variables. Indeed, throughout an analytical chemical process, it is possible to obtain, for each piece of meat, the fat content (i.e., a scalar reponse Y ), and one has to study the sample {Xi (.), Yi}i=1,. . .,215 where Xi (.) refers to the curve {Xi (t); t 2 [850, 1050]}. A natural question could be: can we explain the Yi’s from the Xi (.)’s? Or, in other words, can we predict the unknown fat content for an incoming new curve X216(.)? In the chemometrician community, this type of problem is called calibration problem (see Martens and Naes, 1989, for a good overview on this topic). The practical interest of finding a solution to this regression problem resides in the fact that the analytical chemical process to find Yi ’s is much more expensive than the use

Absorbances

5

4

3

2 850

900

950 Wavelenghts

1000

Figure 1 Sample of 15 discretized curves.

1050

of a spectrometer to find Xi(.)’s. From a statistical point of view, this is clearly a regression problem where the explanatory variable is the functional one X(.) and the response is the scalar one Y. Another problem could emerge if one splits our population of curves into several groups. For instance, assume that a first group contains the curves with a fat content smaller than 20% and the other with a fat content greater than 20%. So, the class membership for each curve is known. A usual question in such a setting could be the following one: for a new curve X216(.), are we able to assign this curve to a group? Typically, this is a supervised curve classification (or curve discrimination) problem. One can find numerous studies dealing with functional data sets and devoted to similar or new problematics. For instance, a curve discrimination arises with the following speech recognition data. This data set contains the digitized recordings of speakers of 32 ms duration and concerns five speech frames corresponding to five phonemes transcribed as follows: ‘sh’ as in ‘SHe’ (group 1), ‘iy’ as in ‘shE’ (group 2), ‘dcl’ which is the symbol for coding the sound D as in ‘Dark’ (group 3), ‘aa’ as the vowel in ‘dArk’ (group 4) and ‘ao’ as the first vowel in ‘wAter’ (group 5). Finally, for each unit, one observes a log-periodogram which is a discretized curve sampled on a fine grid (150 measurements). Figure 3 displays a sample of 10 logperiodograms for each phoneme (i.e., group). The question of interest is how to determine the class membership of an incoming logperiodogram, thus leading to a problem of (supervised) classification of a set of curves. The third example focuses on the monthly time series of sea surface temperature (El Nin˜o) from June 1950 to

Absorbances

5

4

3

2 850 Figure 2 215 spectrometric curves.

900

950 Wavelengths

1000

1050

Statistical Analysis of Functional Data

419

iy

sh 20

20

15

15

10

10 5

5 0

50

100

0

150

50

Frequencies dcl

100

150

100

150

Frequencies aa 25

15

20 10

15

5

10

0

5 0

50

100

150

Frequencies ao

0

50 Frequencies

25 20 15 10 5 0

50

100

150

Frequencies Figure 3 Sample of log-periodograms.

May 2004. The temperatures correspond to an average over the area defined by the coordinates 0–10 South and 90–80 West. Figure 4 displays this time series, whereas Figure 5 plots the same data but year by year. In this way, the whole time series is cut into 54 dependent curves (one per year), which can be considered as functional data. Some interesting questions arising in such a situation are the following: Can we forecast future values? Can we predict the maximum temperature for the next year from the previous ones?, etc. This forecasting setting can be viewed as a regression problem with scalar response and functional predictor, but with nonindependent statistical sample. In all cases, solutions of the presented statistical problems can be found in the recent literature on Food and Drug Administration (FDA).

Why Do Multivariate Methods Fail with Functional Data? We explain at this point why a standard multivariate method for studying functional data could fail. Let us focus on the spectrometric data presented before. A standard mutlivariate

point of view would treat each discretized spectrometric curve as one observation of the 100 variables X(t1), . . ., X(t100). Consequently, a standard multivariate linear model for explaining P the response Y from the 100 variables is given by Y ¼ 100 j ¼1 aj X ðtj Þ þ error. In such a statistical situation, one has ‘to’ estimate the 100 unknown parameters a1, . . ., a100. Clearly, two crucial problems arise. First high correlation could appear between variables X(tj) and X ðtj 0 ) for j and j 0 close enough (i.e., corrðX ðtj Þ; X ðtj 0 ÞÞ ’ 1Þ. To see that, one considers a zoom on the discretized spectrometric curves (Figure 1) by focusing on the first 10 measurements (see Figure 6). Indeed, Figure 6 shows that the observations obtained for consecutive measurements are very similar. This results in collinearity in the regression, and hence, to high standard error of estimates. The second important weakness is that the number of variables can be large in comparison with the sample size. Once again, this leads to poor estimates. Although some works have been proposed to adapt multivariate methods to this situation, a successful way to overcome the problems described before is to consider the functional data as a functional mathematical object (i.e., Xi (t1), . . ., Xi (tp) is the discretization of the random

420

Statistics

Sea surface temperature

28

26

24

22

20

1950

1960

1970

1980 Dates

1990

2000

Figure 4 Monthly sea surface temperature.

Sea surface temperature

28

26

24

22

20

Jun

Jul

Aug

Sep

Oct

Nov

Dec

Jan

Feb

Mar

Apr

May

Figure 5 Temperatures year by year.

function {Xi (t); t 2 T }). Unlike multivariate approaches, it is possible to operate on such a functional object some standard mathematical calculus such as differentiation and integration, whereas more sophisticated techniques such as approximation and smoothing can be useful.

developing statistical methods that are able to take into account the richness of the functional features of such data by using functional mathematical tools. For instance, if one considers the sample {Xi (.), Yi}i=1, . . ., n, the functional linear regression model is defined by Z

Yi ¼

rðt ÞXi ðt Þ þ errori ; i ¼ 1; . . . ; n

Some Modeling Aspects The main goal of this section is to give some basic ideas on modeling functional data. FDA techniques consist in

(instead of its multivariate version Pp Yi ¼ j ¼1 aj Xi ðtj Þ þ errori ). A useful mathematical tool consists in expanding the functional parameter r(.)

Statistical Analysis of Functional Data

421

10rst measurements

4.0

Absorbances

3.5

3.0

2.5

X(t1)

X(t2)

X(t3)

X(t4)

X(t5)

X(t6)

X(t7)

X(t8)

X(t9) X(t10)

Wavelengths Figure 6 Zoom on spectrometric curves.

P onto some fixed function basis (i.e., rð:Þ ¼ Kk¼1 bk Bk ð:Þ) and also in introducing some smoothness constraints on r(.) through a penalized least-squares criterion: Cr ¼

X

2

Z

Yi 

rðt ÞXi ðt Þdt

þ lPðrÞ

i

where P(r) measures the smoothness of the function r (e.g., the L2-norm of its second derivative) and l controls the regularity (the larger the l, more smooth is the functional parameter). An estimator r^ð:Þ is such that r^ð:Þ ¼ minr Cr . If one considers the expansion of r(.) in terms of the basis set, minimizing over r amounts to minimizing over the vector of coefficients b ¼ (b1, P . . ., bK). Let b^ be the minimizer of Cb over b: r^ð:Þ ¼ k b^k Bk ð:Þ. From a statistical modeling point of view, the functional linear

regression model assumes a linear relationship between the functional variable and the scalar response. Contrary to the multivariate case, one has no graphical tool that is able to display the relationship between a functional variable and a scalar one. In such a context, assuming linearity becomes critical and an alternative consists in proposing more general models, called nonparametric functional models, allowing us to investigate nonlinear relationship. So, a nonparametric functional model is defined by Yi ¼ RðXi Þ þ errori ; i ¼ 1; . . . ; n

where the regression operator R(.) (which has to be estimated) is assumed to satisfy some mathematical smoothness properties (continuous, lipschitzian, etc.). The linear

422

Statistics

regression model R can be viewed as a particular case by taking RðuÞ ¼ rðt Þuðt Þdt. Concerning the estimation of R(.), kernel-type estimator ideas have been extended to the functional setting: ^ ¼ RðxÞ

n X

Yi wi;h ðxÞ

i¼1

with " 1

wi;h ðxÞ ¼ K ðh dðXi ; xÞÞ=

n X

# 1

K ðh dðXj ; xÞÞ

j ¼1

This P estimator is just a weighted average of the responses (i.e., i wi;h ðxÞ ¼ 1). K is a positive decreasing function, d(. , .) measures the proximity between two curves (distance, semi-metric, etc.), the smoothing parameter h, ^ called band-width, controls the regularity of Rð:Þ. So, the principle of such a kernel estimator is very simple: closer x is to Xi, larger is the weight assigned to Yi (i.e., wi,h(x)). Both the bandwidth and the index of proximity d(. , .) have to be selected (generally through a criterion based on the quality of the predictions). It is clear that the main feature of the kernel-type estimator is to focus on local properties of the distribution of the functional variable. From a probabilistic point of view, one studies this local behavior through the small balls probabilities defined by Prob (d(Xi, x)  h). It is worth noting that kernel estimator ideas can be applied to settings other than the regression one. For example, the simple kernel-type estimator f^ðxÞ ¼ const 

n X

K ðh 1 dðXi ; xÞÞ

i¼1

is a useful statistical tool for deriving what we call the modal curve (i.e., xmod ¼ supx f^ðxÞ). One can also define the Pn median curve as the one which minimizes i¼1 dðXi ; xÞ over x. These distribution features are good tools for depicting samples of curves and an interesting by-product concerns the unsupervised classification. Several other statistical methodologies such as generalized functional linear model, conditional distribution given a functional variable (conditional mode, conditional quantiles, etc.), curves discrimination, survival analysis with functional explanatory variable, testing procedures, etc., have been investigated.

FDA in Action Spectrometric Data It is time to give an idea on what one can do with FDA. At first, let us consider the functional data concerning the population of spectrometric curves. We investigate the

local behavior (i.e., small ball probabilities evaluated at each curve Xi(.)) of the sample of curves in order to split it into several subgroups. The homogeneity of a subgroup is computed by means of features like median curve, modal curves, etc., and a general homogeneity index is computed for the obtained partition. Finally, an automatic unsupervised classification method for curves is built (the algorithm stops as soon as the gain in terms of homogeneity between two successive partitions is smaller than a threshold). For the spectrometric curves, the proximity index d is based on their second derivatives and defined by qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi R 00 00 dðXi ; Xj Þ ¼ ðXi ðt Þ  Xj ðt ÞÞ2 dt ; the derivatives are calculated by using a numerical approximation. This choice of d is based on the experience of chemometricians and also on a data-driven procedure. The result of this functional classification based on this index leads to three groups. Figure 7 plots for each obtained group, the original curves, their second derivatives, and the corresponding modal curve (the curve of the sample which maximizes a kernel-type estimator of the distribution function of the population of curves). This example gives a good idea on the usefulness of FDA methods. In particular, modal curves are very interesting features for depicting a distribution of curves. Now, if one takes into account the responses (i.e., the fat content), the problematic changes as one can try to predict the fat content from the spectrometric curve. This is typically a regression with scalar response and explanatory functional variable. Precursor works as regression on principal components, ridge regression, or partial least-squares extended standard multivariate methods. Recent advances in FDA extended these methodologies by exploiting systematically mathematical background of function spaces. In particular, a nonparametric FDA (with kernel estimator) allows to make fat content predictions as the one plotted in Figure 8. The solid line displays a perfect prediction: the proximity of the points with this line indicates a good accuracy of the predictions obtained with FDA. Instead of collecting the accurate fat content, one can have at hand only two groups (G1 and G2). For instance, the first one contains the curves corresponding to a fat content smaller than 20% and the second one the other curves. This is typically a curve discrimination (or supervised classification) problem: for an incoming curve Xn+1, one has to assign it to one group. Once again, one can find in FDA several methods for discriminating these spectrometric curves into these two groups. Such supervised curves classification methodologies can give very low rates of misclassification. For instance, if one implements the kernel estimator of the posterior probabilities Prob(Yn+1 2 Gk|Xn+1), one assigns the new curves Xn+1 to the class of highest estimated posterior probability. This curve discrimination method leads us to a rate of around 2%; one affects (on average) to the right group 98 spectrometric curves over 100.

Statistical Analysis of Functional Data

Group 1: second derivatives

Group 1

150

5.0

100

100

50

50

4.0 3.5

Absorbances

150

Absorbances

Absorbances

Group 1: modal curve

5.5

4.5

0 −50

423

0 −50

3.0 −100

−100

−150

−150

2.5 2.0 850

850

900 950 1000 1050 Wavelengths Group 2

850

Group 2: second derivatives 150

5.0

100

100

50

50

Absorbances

3.5

Absorbances

150

4.0

0 −50

900 950 1000 1050 Wavelengths Group 2: modal curve

5.5

4.5 Absorbances

900 950 1000 1050 Wavelengths

0 −50

3.0 −100

−100

−150

−150

2.5 2.0 850

900 950 1000 1050 Wavelengths

850

Group 3

150

5.0

100

100

50

50

Absorbances

3.5

Absorbances

150

4.0

50 −50

900 950 1000 1050 Wavelengths Group 3: modal curve

Group 3: second derivatives

5.5

4.5 Absorbances

850

900 950 1000 1050 Wavelengths

0 −50

3.0 −100

−100

−150

−150

2.5 2.0 850

900 950 1000 1050 Wavelengths

850

900 950 1000 1050 Wavelengths

850

900 950 1000 1050 Wavelengths

Figure 7 Unsupervised curves classification.

Phoneme Data

El Nin˜o Time Series

If one focuses on the phoneme data set (i.e., logperiodograms) which is also a curves discrimination problem, one can get a rate of misclassification around 8% (according to the functional methodology implemented).

Let us consider now the time series El Nin˜o. In this forecasting situation, FDA allows one to predict future values in an accurate way. Figure 9 gives an idea of the pertinence of using such statistical methods. The solid

424

Statistics

26 40

Predicted values

25

30

24

23 20

22

10

21

10

20 30 Observed values

40

Figure 8 Observed responses vs predicted ones.

20

Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May 54th year Figure 9 Forecasted/observed values.

line corresponds to the observed values for the last year available in the used El Nin˜o sample and the dashed one displays the forecasted temperatures. Of course, the previous examples are not an exhaustive presentation on what one can do with FDA. There are several other settings not considered here. An important one occurs when a functional data set proposes a population of curves with measurements changing from one unit to another one (i.e., one has at hand the values Xi ðti1 Þ; . . . ; Xi ðtipi Þ, where the measurements ti1 ; ti2 ; . . . ; tipi depend on the individual i ). This means that, from a standard multivariate point of view, the variables Xi ðti1 Þ; . . . ; Xi ðtipi Þ (even their number) are not the same from one individual to another one, which prevents the use of any standard multivariate statistical analysis. Such data sets are called unbalanced functional data. For instance, this situation can occur when the starting point of the measurements is not the same for each unit. In order to analyze such data sets, preprocessing procedures like curves registration methods or time-warping techniques have been developed. Another interesting case of functional data appears when, for each curve, one gets a few measurements. In such a situation, one says that the functional data are sparse. Such functional data sets require specific methodologies. Other useful developments covering a large

number of statistical situations and extending the generalized linear model to the functional data setting have been investigated. Good recent overviews on FDA methodologies as well as potentialities in terms of applications can be found in special issues of various statistical journals entirely devoted to this area: Davidian et al. (2004), Gonza´lez-Manteiga and Vieu (2007), and Valderrama (2007) (see also the recent proceedings edited by Dabo-Niang and Ferraty, 2008).

Conclusion FDA covers a wide area of statistics. Indeed, according to recent works on this topic, one can imagine that several standard statistical methods will be extended in order to take into account functional data. One can also imagine the huge potential of such new statistical methodologies in terms of applications. Indeed, the newness of FDA implies that numerous fields of potential applications are underdeveloped or not investigated up to now. This is the case about educational data. However, phoneme

Statistical Analysis of Functional Data

functional data could be a good example in relation with educational domain. It is clear that one can obtain measurements on children and with FDA techniques, identify some of them with problems such as speech defects. Any study which allows one to get a continuous trajectory for each individual can be analyzed through functional data methods. It is clear that we have given here a concise overview on FDA in a pedagogical manner. Therefore, only functional data as curves have been presented. In order to complete this curve-based point of view on functional data, it is worth mentioning that FDA can deal with populations of any more high dimensional objects. For instance, with the progress of technologies, it is now possible to collect samples of images (2D-images, multispectral images, hyper-spectral images, etc.). These result in spatiofunctional data. This promising recent area in terms of applications is underdeveloped, but certainly future investigations will enlarge our knowledge in this domain.

Bibliography Bosq, D. (2000). Linear Processes in Function Spaces. Theory and Applications. Lecture Notes in Statistics, 149. New York: Springer. Dabo-Niang, S. and Ferraty, F. (2008). Functional and Operatorial Statistics. Heidelberg: Physica. Davidian, M., Lin, X., and Wang, J. L. (2004). Introduction to the emerging issues in longitudinal and functional data analysis (with discussion). Statistca Sinica 14(3), 613–629. Ferraty, F. and Vieu, P. (2006). Nonparametric Functional Data Analysis. Theory and Practice. Springer Series in Statistics. New York: Springer. Gonzale´z-Manteiga, W. and Vieu, P. (2007). Introduction to the special issue on statistics for functional data. Computational Statistics and Data Analysis 51(10), 4788–4792. Martens, H. and Naes, T. (1989). Multivariate Calibration. New York: Wiley. Ramsay, J. O. and Silverman, B. W. (2002). Applied Functional Data Analysis. Methods and Case Studies. Springer Series in Statistics. New York: Springer. Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis. Springer Series in Statistics, 2nd edn. New York: Springer.

425

Valderrama, M. (2007). Introduction to the special issue on modelling functional data in practice. Computational Statistics 22(3), 331–334.

Further Reading Crambes, C., Kneip, A., and Sarda, P. (forth). Smoothing spline estimators for functional linear regression. Annals of statistics. Ferraty, F. and Vieu, P. (2006). NPFDA R/S+ library. http://www.lsp. ups-tlse.fr/staph/npfda (accessed August 2009). Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regression tools (with discussion). Technometrics 35(2), 109–148. Gasser, T. and Kneip, A. (1995). Searching for structure in curve samples. Journal of the American Statistical Association 31, 1179–1188. Hastie, T., Buja, A., and Tibshirani, R. (1995). Penalized discriminant analysis. Annals of statistics 23(1), 73–102. James, G. M., Hastie, T. J., and Sugar, C. A. (2000). Principal component models for sparse functional data. Biometrika 87(3), 587–602. Ramsay, J. O. (2005). FDA Matlab/R/S+ library. http://www.psych. mcgill.ca/misc/fda (accessed August 2009). Ramsay, J. O. and Li, X. (1998). Curve registration. Journal of the Royal Statistical Society: Series B 60(2), 351–363. Yao, F., Mu¨ller, H. G., and Wang, J. L. (2005). Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association 100(470), 577–590.

Relevant Websites http://www.lsp.ups-tlse.fr – Activities of the STAPH Working Group on Functional and Operatorial Statistics. http://ego.psych.mcgill.ca – McGill University, Functional Data analysis, a useful, great and pioneer website for studying functional data with lots of complementary methods (see the excellent monographs of J. Ramsay and B. Silverman). http://www.r-project.org – The R Project for Statistical Computing, website devoted to R (useful language and environment for statistical computing and graphics; available as free software). http://www.lsp.ups-tlse.fr – website oriented towards practitioners for implementing nonparametric functional (companion website of the book Nonparametric Functional Data Analysis of F. Ferraty and P. Vieu).