Regression imputation in the functional linear model with missing values in the response

Regression imputation in the functional linear model with missing values in the response

JSPI: 5694 Model 3G pp. 1–17 (col. fig: nil) Journal of Statistical Planning and Inference xxx (xxxx) xxxx Contents lists available at ScienceDire...

1MB Sizes 0 Downloads 45 Views

JSPI: 5694

Model 3G

pp. 1–17 (col. fig: nil)

Journal of Statistical Planning and Inference xxx (xxxx) xxxx

Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi

Regression imputation in the functional linear model with missing values in the response Christophe Crambes a , Yousri Henchiri b,c ,



a

Institut Montpelliérain Alexander Grothendieck (IMAG), Université de Montpellier, France Université de la Manouba, Institut Supérieur des Arts Multimédia de la Manouba (ISAMM), Tunisie Université de Tunis El Manar, École Nationale d’ingénieures de Tunis, LR99ES20, Laboratoire de Modélisation Mathématique et Numérique dans les Sciences de l’Ingénieur (ENIT-LAMSIN), B.P. 37, 1002 Tunis, Tunisie b c

article

info

Article history: Received 10 May 2018 Received in revised form 8 October 2018 Accepted 13 December 2018 Available online xxxx MSC: 62G20 62J05 62F12 62P12

a b s t r a c t We are interested in functional linear regression when some observations of the real response are missing, while the functional covariate is completely observed. A complete case regression imputation method of missing data is presented, using functional principal component regression to estimate the functional coefficient of the model. We study the asymptotic behavior of the error when the missing data are replaced by the regression imputed value, in a ’missing at random’ framework. The completed database can be used to estimate the functional coefficient of the model and to predict new values of the response. The practical behavior of the method is also studied on simulated datasets. A real dataset illustration is performed in the environmental context of air quality. © 2018 Elsevier B.V. All rights reserved.

Keywords: Functional linear model Missing data Missing at random Principal components regression Mean square error prediction

1. Introduction

1

Literature on functional data is really wide, as attested by the numerous books on this subject these last years. The estimation and forecasting theories of linear processes in function spaces are developed in Bosq (2000). A comprehensive introduction to functional data analysis can be found in Ramsay and Silverman (2005). In the focus of Ferraty and Vieu (2006) are nonparametric approaches. Computational issues are explained in Ramsay et al. (2009). Nonparametric statistical methods for functional regression analysis, specifically the methods based on a Gaussian process prior in a functional space are discussed in Shi and Choi (2011). In Horváth and Kokoszka (2012) inferential procedures based on functional principal components are considered. Zhang (2014) mainly focuses on hypothesis testing problems about functional data. Among this, the functional linear model has received a special attention (see Ramsay and Dalzell, 1991; Cardot et al., 1999, 2003; Cai and Hall, 2006; Hall and Horowitz, 2007; Crambes et al., 2009; Cardot and Johannes, 2010; Yuan and Cai, 2010 for main references). In this paper, we are interested in the functional linear model Y = ⟨θ , X ⟩ + ε,

(1)

∗ Corresponding author at: Université de la Manouba, Institut Supérieur des Arts Multimédia de la Manouba (ISAMM), Tunisie. E-mail addresses: [email protected] (C. Crambes), [email protected] (Y. Henchiri). https://doi.org/10.1016/j.jspi.2018.12.004 0378-3758/© 2018 Elsevier B.V. All rights reserved.

Please cite this article as: C. Crambes and Y. Henchiri, Regression imputation in the functional linear model with missing values in the response. Journal of Statistical Planning and Inference (2018), https://doi.org/10.1016/j.jspi.2018.12.004.

2 3 4 5 6 7 8 9 10 11 12 13

JSPI: 5694

2

C. Crambes and Y. Henchiri / Journal of Statistical Planning and Inference xxx (xxxx) xxxx

45

where θ is the unknown function of the model, Y is a real variable of interest, ε is a centered real random variable representing the error of the model, with finite variance E(ε 2 ) = σε2 , and X is a functional covariate belonging to some functional space H endowed with an inner product ⟨., .⟩ and its associated norm ∥.∥. Usually, H is the space L2 ([a, b]) of square integrable ∫b functions defined on some real compact [a, b] and the corresponding inner product is defined by ⟨f , g ⟩ = a f (t)g(t) dt for functions f , g ∈ L2 ([a, b]). Without loss of generality, we consider our work on [0, 1]. Moreover, we assume that X and ε are independent. All the previously cited works are devoted to analyze complete data, however, this is not the case in many interesting applications including for example survival data analysis. For this reason, we focus in this work on the problem of missing data (see Little and Rubin, 2002; Graham, 2012 for a wide introduction in the multivariate framework). This subject has been widely studied, in particular the way to impute missing data and the accuracy of this imputation according to the types of missing data: Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). Even if this problematic has received a lot of attention in a multivariate framework, it is not the case for the functional data framework. Our objective is to study the problem of combining regression imputation, missing data mechanisms and functional data analysis. As far as we know, few results are available for the moment. In MAR setting, He et al. (2011) have explored this area by developing a functional multiple imputation approach modeling missing longitudinal response under a functional mixed effects model. They developed a Gibbs sampling algorithm to draw model parameters and imputations for missing values. Besides, Ferraty et al. (2013) have considered two kinds of mean estimates of a scalar outcome, based on a sample in which an explanatory variable is observed for every subject while responses are missing (which is the closest to our context). A weak convergence result was proved. In MCAR setting, Preda et al. (2010) have adapted a methodology based on the NIPALS (Nonlinear Iterative Partial Least Squares) algorithm, which provides an imputation method for missing data, which have affected the functional covariates. In MNAR setting, Bugni (2012) adapts a specification test for functional data with the presence of missing observations. His method is able to extract the information available in the observed portion of the data while being agnostic about the nature of the missing observations. In MAR and MCAR setting, Chiou et al. (2014) have recently proposed a nonparametric approach to missing value imputation and outlier detection for functional data. To our knowledge, there is no existing theoretical result in the case of functional linear model under missing assumption operating on the response variable, this problem only being until now the subject of studies in the multivariate framework (see for instance Manski, 1995, 2003). We carefully distinguish the missing data problem from a simple prediction problem. Indeed, the missing data mechanism involves a random variable (which indicates whether the response is missing or not) which plays a central role when obtaining our asymptotic results. This random variable and the variable X are dependent in the MAR case. This is also highlighted in Ferraty et al. (2013). In this paper, we first propose an imputation method, based on the completely observed cases, to replace missing values in the response of the functional linear model. We get mean square error rates for these imputed values. Secondly, once the database is completed, we are able to estimate the unknown function θ of the model with the whole sample. This estimator can then be used for predicting other values of the response on a test set. Combining missing data and functional variables offers a very large field of applications. Among all possible applications, environment is a core issue interesting many people for the future of our planet, in particular in the study of pollution indexes. The dataset we study here deals with temperature curves in some French cities to predict a specific pollution atmospheric index. The atmospheric index is missing in some cities in the northwest of France, for which the corresponding temperature curves (the explanatory variable) are mild, and leads to consider MAR data. The main objective is to get a map of the atmospheric index on the whole French territory. The rest of the paper is organized as follows. Section 2 introduces the problem of functional linear model under missing assumption operating on the response variable and formulates our main results of the imputation method and of the mean square error for prediction of a new observation using the complete dataset. A simulation study is performed in Section 3. An environmental data illustration is presented in Section 4. Some preliminary lemmas, which are used in the proofs of the main results, are collected in Section 5.

46

2. Imputation of a missing value of the response

47

2.1. Functional principal components regression

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

48 49 50 51

52

53

54

Let us consider a sample (Xi , Yi )i=1,...,n independent and identically distributed with the same distribution as (X , Y ). An estimation of θ based on principal components analysis of the curves X1 , . . . , Xn has been studied in many papers, see for instance Cardot et al. (1999). We ( recall ) below the construction of this estimator. Considering the covariance operator of X defined under the condition E ∥X ∥2 < +∞ (which is supposed to be satisfied in the following) by

( ) Γ u = E ⟨ X , u⟩ X , for all u ∈ H and its empirical version

ˆn u = Γ

n 1∑

n

⟨Xi , u⟩Xi ,

i=1

Please cite this article as: C. Crambes and Y. Henchiri, Regression imputation in the functional linear model with missing values in the response. Journal of Statistical Planning and Inference (2018), https://doi.org/10.1016/j.jspi.2018.12.004.

JSPI: 5694

C. Crambes and Y. Henchiri / Journal of Statistical Planning and Inference xxx (xxxx) xxxx

we call λj

(

resp. ˆ λj

( ) j≥1

)

( )

ˆn resp. Γ

)

and vj

( )

(

3

resp. ˆ vj

( )

)

the sequence of

1

ˆn . The identifiability of model (1) is ensured as long as we assume that λ1 > λ2 > · · · > 0 resp. Γ

2

j≥ 1

(

eigenfunctions of Γ

the sequence of eigenvalues of Γ

(

j≥ 1

j≥ 1

)

(see Cardot et al., 1999). Moreover, assuming that ˆ λ1 > · · · > ˆ λkn > 0 for some integer kn depending on n, the estimator of θ is defined by

ˆ θ=

kn n 1 ∑ ∑ ⟨X i , ˆ vj ⟩Yi

n

ˆ λj

i=1 j=1

ˆ vj .

(2)

A consistency result of this estimator is given in Cardot et al. (1999), while more recent results can be found in Cai and Hall (2006) and Hall and Horowitz (2007). In particular, Cardot et al. (1999) give technical conditions on the decreasing rate to zero of the eigenvalues λj ’s in order to ensure the consistency of the estimator. 2.2. Operatorial point of view

3 4

5

6 7 8

9

We notice in this subsection that the model (1) can be seen from an operatorial point of view. Indeed, we can write the model Y = Θ X + ε,

(3)

ˆ where Θ : H −→ R is a linear continuous operator ∑ defined by Θ u = ⟨θ, u⟩ for any function u ∈ H. Let us consider ∆n the ˆn u = 1 ni=1 ⟨Xi , u⟩Yi , for all u ∈ H. Then, it is easily seen that an estimator Θ ˆ of Θ , cross covariance operator defined by ∆ n ˆ = ⟨ˆ satisfying Θ θ , .⟩, is given by ) ( ˆn −1 , ˆn Π ˆkn Γ ˆ = ⟨ˆ ˆkn ∆ (4) Θ θ , .⟩ = Π

10 11 12 13 14 15

16

ˆkn is the projection operator onto the subspace Span(ˆ where Π v1 , . . . , ˆ vkn ).

17

2.3. Imputation principle

18

Now, we present the context of missing data. There can be many reasons for which missing data can appear: breakdown in a measurement process, a person who is not willing to answer to some question of a questionnaire, . . . We consider that some of the observations Y1 , . . . , Yn are not available. We define the real variable δ and we consider the sample (δi )i=1,...,n such that δi = 1 if the value Yi is available and δi = 0 if the value Yi is missing, for all i = 1, . . . , n. The data we observe are

{(Yi , δi , Xi )}

n i=1

.

P (δ = 1 | X , Y ) = P (δ = 1 | X ) .

(5)

Note that the MAR assumption is much weaker than MCAR (for which P (δ = 1 | X , Y ) = P (δ = 1)), as it allows the missing data to possibly depend on the observed data and may be reasonable for many practical problems. As a consequence of this MAR assumption, the variable δ (the fact that an observation is missing) is independent of the error of the model ϵ , conditionally on X . In the following, the number of missing values in the sample is denoted mn =

20 21 22 23

We consider that the missing values are MAR. The MAR assumption implies that δ and Y are conditionally independent given X . That is,

n ∑

19

11{δi =0} .

(6)

24 25 26 27 28 29 30

31

i=1

Then, to impute a missing value, say Yℓ (where ℓ is a given integer between 1 and n), a simple way is to consider complete case analysis (see for instance Little and Rubin, 2002; Cheng, 1994; Wang et al., 2004; Mojirsheibani, 2007; Van Buuren, 2012). This regression imputation method uses the pairs of observed data to define the estimator of the model coefficient. More precisely, we define Yℓ,imp =

n

kn

i=1 i̸ =ℓ

j=1

∑ ∑ ⟨Xi , ˆ vj ⟩⟨Xℓ , ˆ vj ⟩δi Yi

1 n − mn

ˆ λj

.

(7)

From the operatorial point of view, the imputation of the missing value Yℓ comes back to

ˆkn ,obs ∆ ˆn,obs Π ˆkn ,obs Γ ˆn,obs Yℓ,imp = Π (

∑n

)−1

Xℓ , 1 n−mn

33 34 35

36

37

(8)

∑n

ˆn,obs = ˆ ˆ where Γ i=1 ⟨Xi , .⟩δi Xi , ∆n,obs = i=1 ⟨Xi , .⟩δi Yi and Πkn ,obs is the projection operator onto the subspace ˆn,obs . span(ˆ v1,obs , . . . , ˆ vkn ,obs ) where ˆ v1,obs , . . . , ˆ vkn ,obs are the kn first eigenfunctions of the covariance operator Γ Now we give our main results. We consider the following assumptions. 1 n−mn

32

Please cite this article as: C. Crambes and Y. Henchiri, Regression imputation in the functional linear model with missing values in the response. Journal of Statistical Planning and Inference (2018), https://doi.org/10.1016/j.jspi.2018.12.004.

38

39 40 41

JSPI: 5694

4

1 2 3 4 5 6 7 8

9 10

11 12

C. Crambes and Y. Henchiri / Journal of Statistical Planning and Inference xxx (xxxx) xxxx

(A.1) We assume that there exists a convex function λ such that λ(j) = λj for all j ≥ 1 that continuously interpolates the λj ’s between j and j + 1. ( ) (A.2) There exists a positive constant C such that E ∥X ∥4 ≤ C . Our assumptions are quite classic in this context. Assumption (A.1) is similar to an assumption from Crambes and Mas (2013). It is a mild condition that allows a large class of decreasing rate of eigenvalues for the covariance operator Γ , for example polynomial decay or exponential decay (see Example 1, for more details). Assumption (A.2) holds for many processes X (Gaussian processes, bounded processes) and can also be found for example in Cardot et al. (1999). Then, we give our main results. Remark 1. Notice that the assumptions (A.1) and (A.2) are just needed to obtain a convergence rate, whether there are missing data on the response or not. The only assumption needed on missing data is actually the MAR model. Theorem 2.1. Assume (A.1) and (A.2) are satisfied, if, moreover λkn kn goes to zero as n goes to infinity, we have the mean square error

(

13

E Yℓ,imp − ⟨θ , Xℓ ⟩

)2

+∞ ∑ (

=

ΘΓ

1/2

vj

)2

j=kn +1 14

15

Moreover, for the aggregate mean square error of all the imputed values, we have n ∑

(

(1 − δℓ )E Yℓ,imp − ⟨θ, Xℓ ⟩

)2

= mn

ℓ=1 16 17

( ) kn σε2 kn +o . + n − mn n − mn

) ( +∞ ∑ )2 ( σ 2 kn mn k n mn ΘΓ 1/2 vj + ε . + o n − mn n − mn

j=kn +1

In order to precise the convergence rate of the imputed value Yℓ,imp to the real one ⟨θ, Xℓ ⟩, we need an additional notation. For a function ϕ : R⋆+ −→ R⋆+ and a positive real number L, we define

{

18

19 20 21

22 23

}



C (ϕ, L) = T : H −→ R / ∀j ≥ 1, T vj ≤ L ϕ (j) . 1/2 Note that simple cases satisfy the fact that ΘΓ ∑+∞ belongs to C (ϕ, L). For example, consider the operator Θ expressed in the eigenfunctions basis (vj )j≥1 such that Θ u = j=1 θj ⟨vj , u⟩ for any u √∈ H, with √ θj going to zero as j goes to infinity. Hence there exists a bound L such that θj ≤ L for any j ≥ 1 and ΘΓ 1/2 vj = θj λj ≤ L λj .

Remark 2. We introduce two notations to compare the magnitudes of two functions u˜ (x) and v˜ (x) as the argument x tends to a limit ℓ˜ (not necessarily finite). The notation u˜ (x) ∼ v˜ (x), stands for limx→ℓ˜ u˜ (x)/˜v (x) = 1, and the notation u˜ (x) ≲ v˜ (x) x→ℓ˜

x→ℓ˜

24

denotes that |˜u(x)/˜v (x)| remains bounded as x → ℓ˜ .

25

  ΘΓ 1/2 vj Theorem 2.2. Let L = ΘΓ 1/2 ∞ and ϕ the function defined by ϕ (j) = for all j ≥ 1 that continuously interpolates L2 the ϕ (j)’s between j and j + 1. Under assumptions (A.1)–(A.2), the operator ΘΓ 1/2 belongs to C (ϕ, L) and

)2

(

26

27

28

E Yℓ,imp − ⟨θ , Xℓ ⟩

(

+∞

ϕ (t) dt =

29

x

31 32 33

34

∼ n→+∞

2σε2

k⋆n n − mn

,

where k⋆n is the solution of the equation in x



30

)2

σε2 x. L2 (n − mn )

(9)

Notice that this equation is nothing but the equality between the square bias σε2 kn n−mn

∑+∞

j=kn +1

(

ΘΓ 1/2 vj

)2

(approximated by an integral)



obtain in Theorem 2.1. Hence, kn is the optimal number of principal components with respect to the mean and the variance square error criterion. Again, for the aggregate mean square error of all the imputed values, we have n ∑

(1 − δℓ )E Yℓ,imp − ⟨θ, Xℓ ⟩

ℓ=1

(

)2

∼ n→+∞

2σε2

k⋆n mn n − mn

.

Please cite this article as: C. Crambes and Y. Henchiri, Regression imputation in the functional linear model with missing values in the response. Journal of Statistical Planning and Inference (2018), https://doi.org/10.1016/j.jspi.2018.12.004.

JSPI: 5694

C. Crambes and Y. Henchiri / Journal of Statistical Planning and Inference xxx (xxxx) xxxx

5

Remark 3. Notice that Eq. (9) has a unique solution (the left and right hand sides are decreasing and increasing in x, respectively). The practical resolution of Eq. (9) to get k⋆n seems quite complicated due to the computation of L. In order to solve this problem, we will use other ways to select the optimal number of principal components (see Section 3 ).

1 2 3

The last result giving the convergence rate of the imputed value Yℓ,imp is similar to the convergence rate obtained in Crambes and Mas (2013) (who considered the case of a completely observed functional response). The rate is simply affected by the number mn of missing values. We precise the resulting rate of convergence in the following examples.

6

Example 1. We consider two different functions ϕ such that ϕpol (j) = Cα j−(2+α ) and ϕexp (j) = Dα exp(−α j) where Cα and Dα are positive constants and α > 0. Then the solution of Eq. (9) is

8

⎧ ⎪ ⎪ ⋆ ⎪ ⎪ ⎨ kn,pol ⎪ ⎪ ⎪ ⎪ ⎩ k⋆n,exp

∼ n→+∞

(1 + α )σε2 log n

≲ n→+∞

)1/(2+α)

Cα L2

(

α

n1/(2+α ) ,

(

if ϕ = ϕexp .

2 σε2

(

∼ n→+∞

)(1+α)/(2+α)

(

10

)1/(2+α)

C α L2 1+α

n1/(2+α ) n − mn

,

11

for a single imputation and n ∑

12

(1 − δℓ )E Yℓ,imp − ⟨θ, Xℓ ⟩

(

)2

2 σε2

(

∼ n→+∞

ℓ=1

)(1+α)/(2+α)

(

C α L2 1+α

)1/(2+α)

n1/(2+α ) mn n − mn

,

for the aggregate error of all the imputed values. For ϕ = ϕexp , the result of Theorem 2.2 becomes

(

14 15



2σε log n , α (n − mn )

n→+∞

16

for a single imputation and

17

n



13

2

)2

E Yℓ,imp − ⟨θ , Xℓ ⟩

7

9

,

)2

5

if ϕ = ϕpol ,

For ϕ = ϕpol , the result of Theorem 2.2 becomes

E Yℓ,imp − ⟨θ , Xℓ ⟩

4

(1 − δℓ )E Yℓ,imp − ⟨θ, Xℓ ⟩

(

)2

ℓ=1

2σε2 mn log n

≲ n→+∞

α (n − mn )

,

for the aggregate error of all the imputed values.

19

Remark 4. Let us notice that the choice of such functions ϕ in the previous example is compatible with Theorem 2.2. Indeed, if we consider for example the choice ϕ = ϕpol , and if we consider the operator Θ expressed in the eigenfunctions basis ∑+∞ 1/2 −1/2 (−1−α/2) j , then the assumptions of Theorem 2.2 are (vj )j≥1 such that Θ u = j=1 θj ⟨vj , u⟩ for any u ∈ H, with θj = Cα Lλj satisfied (we can notice that

(ΘΓ 1/2 vj )2 L2

=

θj2 λj L2

=

18

Cα j2+α

).

Example 2. To precise in more specific cases our convergence rates, we consider three different levels of missing data: (i) when the number of missing data mn is negligible compared to the sample size, that is mn = an n with an going to zero as n goes to infinity, (ii) when the number of missing values is proportional to the sample size, that is mn = ρ n with 0 < ρ < 1, and (iii) when the number of observed values is negligible compared to the sample size, that is un := n − mn = o(n). We can sum up all the rates of convergence for the single imputation mean square error (Table 1) and for the aggregate mean square error (Table 2). We can see that missing data do not affect the convergence rate for a single imputed value when there are not too many missing values (mn = o(n) or mn = ρ n). The rate 1/n(1+α )/(2+α ) matches the usual optimal rates in this context. The rate log n/α n is not exact but obviously sharp since parametric up to a logarithm. It is no more the case when the number of missing values is high (mn ∼ n), the convergence rate is affected. For the aggregate error of several imputed values, when there are not too many missing values (mn = o(n)), the number of missing values plays a crucial role, since the convergence depends on the fact that an n1/(2+α ) or an log n go to zero as n goes to infinity. In other cases (mn = ρ n or mn ∼ n), missing data affect the convergence of the aggregate error term for several imputed values, since it cannot converge to zero. Please cite this article as: C. Crambes and Y. Henchiri, Regression imputation in the functional linear model with missing values in the response. Journal of Statistical Planning and Inference (2018), https://doi.org/10.1016/j.jspi.2018.12.004.

20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35 36

JSPI: 5694

6

C. Crambes and Y. Henchiri / Journal of Statistical Planning and Inference xxx (xxxx) xxxx Table 1 Single imputation 2 σε2

(

mean

)(1+α)/(2+α) ( Cα L2 )1/(2+α) 1+α

square

error

convergence

rates,

ϕ = ϕpol mn := an n = o(n)

∼ n→+∞

mn = ρ n

∼ n→+∞

un := n − mn = o(n)

Table 2 Aggregate 2 σε2

(

imputation

mean

1+α

Kα n

Kα (1 − ρ )1/(2+α ) n−(1+α )/(2+α ) Kα un

square

error

convergence

5 6

8

9 10

where

2σε2 log n α un



:=

n→+∞

−(1+α )/(2+α )

Kα nun

∼ n→+∞

≲ n→+∞

≲ n→+∞

≲ n→+∞

2σε2 an log n

α

2σε2 ρ log n α (1−ρ ) 2σε2 n log n α un

Once the database being reconstructed, we can use the full database to estimate the functional coefficient θ of the model (directly inspired from (2)) (see also Chu and Cheng, 1995), namely

˜ θ=

kn n vj ⟩Yi⋆ 1 ∑ ∑ ⟨Xi , ˆ

n

ˆ λj

i=1 j=1

ˆ vj ,

(10)

where Yi⋆ = Yi δi + Yi,imp (1 − δi ) for all i = 1, . . . , n. Then this estimator of θ can be used to predict new values of the response Y on a test sample. Indeed, if Xnew is a new curve, the corresponding predicted response value is k

n vj ⟩⟨Xnew , ˆ vj ⟩Yi⋆ 1 ∑ ∑ ⟨X i , ˆ ˆ Ynew = ⟨Xnew , ˜ θ⟩ = . ˆ n λj i=1 j=1

(11)

We give below a result allowing to control the mean square prediction error of ˆ Ynew . Theorem 2.3. Under the assumptions of Theorem 2.1, if we additionally assume that mn = o(n) (that is mn = an n with an going to zero as n goes to infinity) and a2n n = o(1), then

(

11

rates,

2σε2 log n α (1−ρ )n

2.4. Estimation of θ and prediction of future values

n

7



2σε2 log n αn

ϕ = ϕexp 1/(2+α )

Kα ρ (1 − ρ )1/(2+α ) n1/(2+α )



un := n − mn = o(n)

4

≲ n→+∞

n→+∞

Kα an n

∼ n→+∞

mn = ρ n

3

:=

.

mn := an n = o(n)

2

≲ n→+∞

ϕ = ϕpol

1



ϕ = ϕexp −(1+α )/(2+α )

−(1+α )/(2+α )

∼ n→+∞

)(1+α)/(2+α) ( Cα L2 )1/(2+α)

where

.

E ˆ Ynew − ⟨θ , Xnew ⟩

)2

+∞ ∑ (

=

ΘΓ 1/2 vj

)2

( +O

j=kn +1

kn n

)

.

13

Remark 5. This result shows that, under the condition that there are not too many missing values, the convergence rate of the mean square error prediction of a new value of the covariate remains the same compared to the non missing values case.

14

3. Simulations

12

15

16

17

To observe the behavior of our estimator in practice, this section considers a simulation study. 3.1. Models Two models are considered: 1

∫ 18

sin(4π t)Xt dt + ϵ,

Model1 : Y =

(12)

0 1

∫ 19

log(15t 2 + 10) + cos(4π t) Xt dt + ϵ,

(

Model2 : Y =

)

(13)

0 20

where the error ϵ is a Gaussian noise : ϵ ∼ N (0, σϵ ) and Please cite this article as: C. Crambes and Y. Henchiri, Regression imputation in the functional linear model with missing values in the response. Journal of Statistical Planning and Inference (2018), https://doi.org/10.1016/j.jspi.2018.12.004.

JSPI: 5694

C. Crambes and Y. Henchiri / Journal of Statistical Planning and Inference xxx (xxxx) xxxx

7

• in Eq. (12), X := {Xt }t ∈[0,1] is the standard Brownian motion. • In Eq. (13), X := {Xt }t ∈[0,1] is a Gaussian process where the covariance function is defined as ⏐ ⏐ ⏐t − t ′ ⏐2 cov (Xt , Xt ′ ) = exp(− ). 0.2 The simulation aims at considering processes X with different regularities (the standard Brownian motion being the case of the less smooth) in order to see if it has an impact on the results. All the procedures described below were implemented by using the R software:

⋆ the trajectories of Xi , 1 ≤ i ≤ n, in the two models are discretized in p = 100 equidistant points, ⋆ values of Y are computed using integration by rectangular interpolation, (∫ ) 1 ⋆ the variability of noise is such that σϵ = τ ∗ Var 0 θ (t)X (t)dt ≈ 0.2. Note that some Monte Carlo experiments are achieved to determine the values of τ : τ ≈ 21.726 for the model1 (low level of noise) and τ ≈ 0.048 for the model2 (high level of noise),

1 2

3

4 5 6 7 8 9 10 11

⋆ the sample sizes are respectively n = 100, 300 and 1200 for the training sets (X1 , Y1 ), . . . , (Xn , Yn ) and n1 = 50, 150 and 600 for the test sets (Xn+1 , Yn+1 ), . . . , (Xn+n1 , Yn+n1 ). 3.2. Criteria

12 13

14

The criteria we used are the following. Criteria 1 and 2 are related to the imputation step with the training samples, criteria 3 and 4 are related to the prediction step with the test samples, and criteria 5 is related to the estimation step with the reconstructed database.

15 16 17

• Criterion 1: the mean square errors (MSE) averaged over S samples MSE =

S 1 ∑

S

MSE(j),

j=1 n ∑ (

j

j

)2

Yℓ,imp − ⟨θ, Xℓ ⟩ (1 − δℓ ) is the mean square error computed on the jth simulated sample,

1 mn

where MSE(j) =

18

ℓ=1

j ∈ {1, . . . , S }. • Criterion 2: the ratio respect to truth between the mean square prediction error and the mean square prediction error

19

when the true mean is known averaged over S samples RT =

S 1 ∑

S

RT (j),

j=1 n ∑ (

where RT (j) =

ℓ=1

j

j

)2

Yℓ,imp − ⟨θ, Xℓ ⟩ (1 − δℓ ) is the ratio between the mean square prediction error and the mean

n ∑ ( j )2 ϵℓ (1 − δℓ )

20

ℓ=1

square prediction error when the true mean is known, computed on the jth simulated sample. • Criterion 3: the mean square errors (MSE ′ ) averaged over S samples MSE ′ =

S 1 ∑

S

21

MSE ′ (j),

j=1 n+n1

where MSE ′ (j) =

1 n1

∑ (

j

j

Yℓ′ − ⟨θ, Xℓ′ ⟩

)2

is the mean square error computed on the jth simulated sample, j ∈

{1, . . . , S }. • Criterion 4: the ratio respect to truth between the mean square prediction error and the mean square prediction error when the true mean is known averaged over S samples RT ′ =

S 1 ∑

S

22

ℓ′ =n+1

RT ′ (j),

j=1

Please cite this article as: C. Crambes and Y. Henchiri, Regression imputation in the functional linear model with missing values in the response. Journal of Statistical Planning and Inference (2018), https://doi.org/10.1016/j.jspi.2018.12.004.

23

JSPI: 5694

8

C. Crambes and Y. Henchiri / Journal of Statistical Planning and Inference xxx (xxxx) xxxx n+n1

∑ ( 1

where RT ′ (j) =

j

)2

j

Yℓ′ − ⟨θ, Xℓ′ ⟩

ℓ′ =n+1

is the ratio between the mean square prediction error and the mean square

n+n1

∑ (

j

ϵℓ′

)2

ℓ′ =n+1

2

prediction error when the true mean is known, computed on the jth simulated sample.

• Criterion 5: the mean square errors (MSE ′′ ) averaged over S samples MSE ′′ =

S 1 ∑

S

MSE ′′ (j),

j=1

3

2  j θ − θ  is the square error of estimation computed on the jth simulated sample. The MSE ′′ criterion where MSE (j) = ˜

4

is decomposed into variance and square bias in our results.

5 6

7

8 9 10 11

′′

Notice that all the criteria tend to zero when the sample size tends to infinity. RT and RT ′ are rescaled versions of MSE and MSE ′ if we substitute the denominator by its limit (specifically, MSE(j) = RT (j)σϵ2 ). 3.3. Methodology We use a smoothed version of the estimator (2) based on the Smooth Principal Components Regression (SPCR) (Cardot et al., 2003). We use a regression spline basis with parameters: the number κ of knots of the spline functions, the degree q of spline functions and the number m of derivatives. Let us remark that, with appropriate conditions, all the theoretical results obtained in Section 2 will also apply to the SPCR estimation. For example, we θ has r ′ derivatives ⏐ assume that the estimator ˜ ⏐

θ (r ) (t1 ) − ˜ θ (r ) (t2 )⏐ ≤ C |t1 − t2 |ν , for all t1 , t2 ∈ [0, 1]. If we denote for some integer r ′ and ˜ θ (r ) satisfies, for some ν ∈]0, 1], ⏐˜ ′

12 13 14 15 16 17 18 19 20

21 22 23 24

26 27 28 29

30 31 32 33 34

35

36

37







r = r ′ + ν and if we assume that the degree q of the splines is such that q ≥ r, then supt ∈[0,1] ⏐˜ θ (t) − Sκ,q (˜ θ )(t)⏐ = O κ −r , where Sκ,q (˜ θ ) is the spline approximation of ˜ θ (see De Boor, 1978). In other words, any of the convergence results obtained in Section 2 can be transposed to the smoothed version of the estimators. Here, we have fixed the number of knots to be 20, the degree has been chosen to be 3 and the number of derivatives was fixed to the moderate value of 2. The choice of these parameters is not the most important in our study, especially in comparison with the choice of the number of principal components. In this subsection, we show firstly how to determine the number of missing data. Secondly, we present a procedure to choose the optimal tuning parameter (the best dimension k∗n of the projection space for the SPCR).





(

)

3.3.1. Missing data simulation scenario To determine the number of missing data in our simulations, we have adopted the following scenario. In the MAR case, we simulate δ according to the logistic functional regression. The variable δ follows the Bernoulli law with parameter p(X ) such that

( 25



log

p(X )

)

1 − p(X )

= ⟨α0 , X ⟩ + ct ,

where α0 (t) = sin(2π t) for all t ∈ [0, 1] and ct is a constant allowing to take different levels of missing data. We take ct = 2 for around 12.5% of missing data, ct = 1 for around 27.4% of missing data and ct = 0.2 for around 44.9% of missing data. Notice that, in the MCAR case, we simulate δ with the Bernoulli law with parameter p(X ) := p = 0.9 (10% of missing data), p(X ) := p = 0.75 (25% of missing data) or p(X ) := p = 0.6 (40% of missing data). 3.3.2. Criteria for optimal parameter selection We focus on the procedure allowing to select the optimal tuning parameter. We consider a Generalized Cross Validation (GCV) criterion versus a Cross Validation (CV) criterion and K-fold Cross Validation (K-fold CV) criterion and we select the optimal tuning parameter k∗n by minimizing these criteria. The GCV procedure is known to be computationally fast. The CV, K-fold CV and GCV criteria are respectively given as follows for imputation CV(kn ) =

K-fold CV(kn ) =

GCV(kn ) =

n ∑

1 n − mn K 1∑

K

[−i]

(Yˆi

− ⟨θ, Xi ⟩)2 δi ,

i=1

|Bk |−1



[−Bk ]

(Yˆi

− ⟨θ, Xi ⟩)2 δi ,

i∈Bk

k=1

∑n

ˆ − ⟨θ, Xi ⟩)2 δi . ((n − mn ) − kn )2

(n − mn )

i=1 (Yi

Please cite this article as: C. Crambes and Y. Henchiri, Regression imputation in the functional linear model with missing values in the response. Journal of Statistical Planning and Inference (2018), https://doi.org/10.1016/j.jspi.2018.12.004.

JSPI: 5694

C. Crambes and Y. Henchiri / Journal of Statistical Planning and Inference xxx (xxxx) xxxx

9

Fig. 1. GCV, CV and K-fold criteria for different values of dimension kn in model1 : best dimension k∗n = 8 and MSE’ (×104 ) = 1.6640 (in GCV criterion case), best dimension k∗n = 6 and MSE ′ (×104 ) = 2.3081 (in 5-fold CV criterion case), best dimension k∗n = 8 and MSE ′ (×104 ) = 1.9584 (in 10-fold CV criterion case), best dimension k∗n = 8 and MSE ′ (×104 ) = 1.6598 (in CV criterion case), for the model1 .

The analogous criteria are given as follows for prediction

1

n

CV(kn ) =

1 ∑ ∗[−i] (Yˆi − ⟨θ, Xi ⟩)2 , n

2

i=1

K-fold CV(kn ) =

GCV(kn ) = [−i]

where Yˆi

[−Bk ]

and Yˆi

K 1∑

K n

|Bk |−1

[−Bk ]

(Yˆi∗

− ⟨θ, Xi ⟩)2 ,

3

i∈Bk

k=1

∑n



ˆ ∗ − ⟨θ, Xi ⟩)2

i=1 (Yi

(n − kn )2

,

4

respectively mean that the value of Yi is predicted using the whole sample except the ith observation [−i]

[−Bk ]

or except the set of observations indexed in Bk . In the same way Yˆi∗ and Yˆi∗ respectively mean that the value of Yi is predicted using the whole sample except the ith observation or except the set of observations indexed in Bk . The dataset is randomly partitioned into K equally sized (as equal as possible) subsets ∪Kk=1 Bk such that Bj ∩ Bk = ∅ (j ̸ = k). In practice, often K = 5 or K = 10 are used. In our case, the K-fold CV splits are chosen in a special deterministic way. For imputation, we consider K-fold CV(kn ) =

K 1∑ ((n − mn )/K)−1 K k=1

nk/K ∑

[−k]

(Yˆi

− ⟨θ, Xi ⟩)2 δi .

K 1∑ (n/K)−1 K k=1

6 7 8 9 10

11

i=(n(k−1))/K +1

The analogous criterion is given as follows for prediction K-fold CV(kn ) =

5

nk/K ∑

(Yˆi∗

[−k]

− ⟨θ, Xi ⟩)2 .

12

13

i=(n(k−1))/K +1

In order to illustrate the advantage of the GCV criterion, we compared the computational times to obtain the tuning parameter with the three criteria on a growing sequence of dimension kn = 2, . . . , 22. The characteristics of the computer used to perform these computations were McBook pro: Processor 2.66 GHz intel core 2 Duo, Memory 4 Gb 1067 MHz DDR3. The computational times are displayed in Table A.7 given in supplementary material. The GCV criterion shows a clear advantage with regard to computational time compared with the CV and K-fold criteria. In addition, we see that the three criteria behave in the same way and select the same optimal projection dimension (see Fig. 1 for model 1, a similar behavior is observed for model 2, which is given in supplementary material) for both models (under n = 1000 and p = 100). Notice that the GCV criterion (faster to compute) has been used in different simulations. We show on Fig. 2 different estimates of the slope function of the Model1 (under n = 1000 and p = 100) with different values of dimension (kn = 4, 6, 8, 12, 16) and (kn = 2, 3, 5, 7, 8), respectively, by using the GCV criterion (used for its computational efficiency). We have chosen a percentage of missing values equal to 45.8518% (we obtain this rate with ct = 1). A similar plot is obtained for model 2 with 46.8888% missing values (obtained with ct = 1), it is given in supplementary material. Please cite this article as: C. Crambes and Y. Henchiri, Regression imputation in the functional linear model with missing values in the response. Journal of Statistical Planning and Inference (2018), https://doi.org/10.1016/j.jspi.2018.12.004.

14 15 16 17 18 19 20 21 22 23 24 25 26

JSPI: 5694

10

C. Crambes and Y. Henchiri / Journal of Statistical Planning and Inference xxx (xxxx) xxxx

Fig. 2. Plots of the true slope function (solid black) and estimates with different values of dimension kn in model1 . The plots of estimates slope function with best dimension k∗n = 8 (solid red), with dimension kn = 4 (dotted), with dimension kn = 6 (dashed), with dimension kn = 12 (dotdashed), with dimension kn = 16 (twodash) . (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

1

2 3 4 5 6 7 8 9 10 11 12

13 14 15

3.4. Analysis of results 3.4.1. Analysis of mse We first analyze the results of the criteria presented in the previous subsection. Both MAR and MCAR context were considered. We only show the results for MAR and the results for MCAR are available on demand. The different results are given in supplementary material. Tables A1 and A2 give the mean and standard deviation errors for the imputed values on training samples for both models. Tables A3 and A4 give the mean and standard deviation errors for the predicted values on test samples for both models. Tables A5 and A6 give the mean and standard deviation errors for the estimation of θ using the fulfilled database with imputed values for both models. We can see that the errors increase when the rate of missing data increases. Similarly, the errors decrease as the size of the sample increases. When we compare the case of MAR and MCAR, we see that the error in case of MAR is slightly higher that in the MCAR case. Moreover, we can see that the regularity of the process X does not have a crucial impact on the results at least on these simulated examples. In the following paragraph, we check on a simulated example that the behavior of the MSE is in accordance with the theoretical results. 3.4.2. Comparison with theory In order to get explicit rates of convergence for imputed values, we use random functions Xi ’s expressed in the same eigenfunctions basis as the unknown functional parameter of the model. More precisely, we consider the new model 1

∫ 16

θ (t) X (t) dt + ϵ,

Model3 : Y =

(14)

0 17 18 19 20

21

In Eq. (14), we adopt a similar design as that in Hall and Horowitz (2007). The functional covariate X (t) is generated by a set of cosine basis functions,

φ1 ≡ 1 and φj+1 =



2 cos(jπ t), for 1 ≤ j ≤ 50,

such that X (t) =

50 ∑

γj ζj φj (t),

j=1



22 23 24 25 26

27

where the ζj are independently sampled from the uniform distribution on [− 3, following

γj = (−1)j+1 (j)−η/2 ,

29

3] and γj are deterministic as the

with η = 2.

For these coefficients, the eigenvalues of the operator of covariance were λj = j−η and were distinct. The covariance function is defined as cov (Xt , Xs ) =

50 ∑ 2 j=1

28





cos(jπ s) cos(jπ t).

The true slope function θ (t) is given by

θ (t) =

50 ∑

bj φj (t),

j=1

Please cite this article as: C. Crambes and Y. Henchiri, Regression imputation in the functional linear model with missing values in the response. Journal of Statistical Planning and Inference (2018), https://doi.org/10.1016/j.jspi.2018.12.004.

JSPI: 5694

C. Crambes and Y. Henchiri / Journal of Statistical Planning and Inference xxx (xxxx) xxxx

11

Fig. 3. Plot of MSE with respect to the sample size n with 28% missing data.

where b1 ≡ 0.3 and bj = 4 (−1)j+1 j−2 , 2 ≤ j ≤ 50. In this context, we computed the imputed values MSE mean (notated by MSE) over 300 Monte Carlo simulated samples, for different sample sizes: n = 150, 300, 450, . . . , 1650, 1800 (the sample size increase from 150 to 1800), and for 5% and 28% of missing data. Fig. 3 shows the behavior of the MSE with respect to the sample size in the case of 28% missing values. A decreasing power behavior seems to appear. To go further, we adjusted a linear model between the logarithms of n and MSE. The adjusted model is almost perfect (R2 ≃ 0.998 in both cases). Thanks to the linear adjustments (log(MSE) = −2.4100 − 0.8276 log n in the case of 28% missing data and log(MSE) = −2.6178 − 0.8308 log n in the case of 5% missing data), we get a power expression of the MSE with respect to the sample size: MSE = 0.0898n−0.8276 in the case of 28% missing data and MSE = 0.0730n−0.8308 in the case of 5% missing data. Comparing this rate with the theoretical rate n−(1+α )/(2+α ) obtained in Section 3, we get α ≃ 3.80 for 28% missing data and α ≃ 3.91 for 5% missing data. Coming back to the result of Theorem 2.2, we are in the case where the eigenfunctions of ΘΓ 1/2 decrease with a polynomial rate 1/j3 , hence the functions ϕpol have a polynomial decreasing rate 1/j6 . As we have written the function ϕpol (j) ∼ j−(2+α ) , this confirms the value of α found on simulations, with a slight difference due to the percentage of missing values. 4. Illustration In order to illustrate the contribution of our approach in functional prediction setting when the covariates are functions and some observations of the real response are missing, we present in this section an environmental dataset application. We start by describing the dataset. The functional covariate X is a daily temperature curve in some cities in France (from May 7, 2015 at 4 pm up to May 8, 2015 at 3 pm) obtained from www.meteociel.fr. This daily continuous curve is observed at some discretization points (here, at 24 discretization points, every hour). The graphical display of this daily temperature curves can be observed in Fig. 4. The response variable Y is an atmospheric index of air quality called ATMO (for a detailed description of this atmospheric index, see www.atmo-france.org). Its values range from 1 (very good quality of air) to 10 (very bad quality of air). Though these values are discrete, we will consider that Y is a continuous approximation. We obtained the values of the atmospheric index on May 8, 2015, for these same cities, from www2.prevair.org. Furthermore, we added some cities for which the temperature curve is available but the atmospheric index is missing. Notice that the response is missing for mild temperature curves cities: the fact that the value of the response variable Y is missing for these cities depends on the temperature curve X , and thus we consider the MAR case. We also refer the reader to the paper (Junninen et al., 2004) for more discussions about missing data mechanism when dealing with air quality data. In particular, this paper highlights the fact that air quality missing data can be considered as MAR. Fig. 5 illustrates the selected cities in our study, the blue cities are given when the response variable Y is missing and the red cities are given when the response variable Y is observed. It is of primary importance to get a map of the atmospheric index on the whole French territory, and thus to impute missing data. ′ We have built a sample of 78 pairs {(Yi , Xi )}78 i=1 , where we have 8 missing values of the variable Y (the Yi s, i = 71, . . . , 78, 78 are missing). Our goal is to impute these missing values {Yi }i=71 . We have fixed the number of knots to be 20, the degree of splines has been chosen equal to 3 and the number of derivatives was fixed to the moderate value of 2. Then, we use the GCV criterion to find the best parameter of projection dimension kn trying growing sequences: kn = 2, 3, . . . , 21, 22. In order to see the impact of missing data on this dataset, we have randomly Please cite this article as: C. Crambes and Y. Henchiri, Regression imputation in the functional linear model with missing values in the response. Journal of Statistical Planning and Inference (2018), https://doi.org/10.1016/j.jspi.2018.12.004.

1 2 3 4 5 6 7 8 9 10 11 12 13

14

15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

JSPI: 5694

12

C. Crambes and Y. Henchiri / Journal of Statistical Planning and Inference xxx (xxxx) xxxx

Fig. 4. Plot of the 78 daily temperature curves (the blue curves are given when the response variable Y is missing) . (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 5. Map of France locating the selected cities of our study: the cities are red when the variable Y is observed and the cities are blue when the variable Y is missing . (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

1

drawn 700 tests samples in the initial sample and computed prediction errors on these tests samples, using the remaining

2

of the sample as training sample. Results are given in Table 3. Here again, the more we have missing data in the training set,

3

the more the prediction error on the test sample is.

4

Now, we come back to the initial goal, imputing the missing data. The minimum value of the GCV criterion is reached

5

for k∗n = 5 and MSE ′ (×102 ) = 20.791. Table 4 gives the imputed values of the missing data. We see imputed values mainly

6

around 4, which is a moderate value for the atmospheric index corresponding to a good quality of air. It is in accordance with

7

the fact that these cities have moderate temperature curves. We can mention two particular cases. The highest imputed value

8

(4.161) corresponds to the city of Angers, and in parallel, we can see that the temperature curve of this city becomes high at Please cite this article as: C. Crambes and Y. Henchiri, Regression imputation in the functional linear model with missing values in the response. Journal of Statistical Planning and Inference (2018), https://doi.org/10.1016/j.jspi.2018.12.004.

JSPI: 5694

C. Crambes and Y. Henchiri / Journal of Statistical Planning and Inference xxx (xxxx) xxxx

13

Table 3 Real dataset: prediction errors over 700 drawn samples. Test sets

n = 78, 8 missing data, 70 observed data

Rate of missing data (%) MSE ′ × 102

n/4

n/3

n/2

13 24.5650 (8.4750)

15 25.5172 (8.1444)

20 29.7827 (15.0889)

Table 4 Imputed values of the missing response variable. Missing values of Y

Y71

Y72

Y73

Y74

Y75

Y76

Y77

Y78

Imputed values

4.161

3.496

3.850

3.758

3.590

3.491

3.990

3.821

the end of May 8. On the contrary, the lowest imputed value (3.491) corresponds to the city of Quimper, and the temperature curve of this city presents few variations along the 24 h.

1 2

5. Proof of the results

3

5.1. Proof of Theorem 2.1

4

We begin with the following decomposition

ˆn,obs = ∆

1 n − mn

with Un,obs =

1 n−mn

n ∑ ⟨Xi , .⟩δi Θ Xi + i=1

1 n − mn

5

n ∑ ˆn,obs + Un,obs , ⟨Xi , .⟩δi εi = Θ Γ

6

i=1

n ∑ ⟨Xi , .⟩δi εi . Then, ε being independent from X and δ (MAR assumption), we deduce

7

i=1

E Yℓ,imp − ⟨θ , Xℓ ⟩

(

)2

( ) ˆkn ,obs Xℓ − Θ Xℓ 2 = E ΘΠ )2 ( n ∑ )−1 ( 1 ˆ ˆ Xℓ ⟩δi εi +E ⟨Xi , Πkn ,obs Γn,obs n − mn i=1 ) ( ˆkn ,obs Xℓ − ΘΠkn ,obs Xℓ 2 ≤ 2E Θ Π )2 ( + 2E ΘΠkn ,obs Xℓ − Θ Xℓ ( )2 n ∑ )−1 ( 1 ˆn,obs ˆkn ,obs Γ +E , Xℓ ⟩δi εi ⟨X i , Π n − mn

8

9

10 11

12

i=1

where Πkn ,obs is the projection onto the subspace span(v1,obs , . . . , vkn ,obs ) where v1,obs , . . . , vkn ,obs are the kn first eigenfunctions of the covariance operator Γn,obs . For a single imputation, the end of the proof of Theorem 2.1 is based on the following lemmas. For the aggregate error term of mn imputed values, it is just a sum of mn terms that behave like the term for single imputation. Lemma 5.1. We have

(

λkn k2n kn + n − mn n − mn

)

.

1 n − mn

15 16

18

Lemma 5.2. We have

E

14

17

( ) ˆkn ,obs Xℓ − ΘΠkn ,obs Xℓ 2 = o E ΘΠ

(

13

19

)2

n

∑ ( ) ˆkn ,obs Γ ˆn,obs −1 Xℓ ⟩δi εi ⟨X i , Π i=1

σε kn kn +o n − mn n − mn 2

=

(

)

.

Lemma 5.3. We have

20

21

+∞ ∑ ( )2 ( )2 E ΘΠkn ,obs Xℓ − Θ Xℓ = ΘΓ 1/2 vj . j=kn +1

Please cite this article as: C. Crambes and Y. Henchiri, Regression imputation in the functional linear model with missing values in the response. Journal of Statistical Planning and Inference (2018), https://doi.org/10.1016/j.jspi.2018.12.004.

22

JSPI: 5694

14

1

C. Crambes and Y. Henchiri / Journal of Statistical Planning and Inference xxx (xxxx) xxxx

5.2. Proof of Lemma 5.1

2

Writing Xℓ in the basis (vj )j≥1 , we obtain

3

ˆkn ,obs Xℓ − ΘΠkn ,obs Xℓ E ΘΠ

4

(

=

)2

+∞ ∑ +∞ ∑ [ ( ) ( ) ] ˆkn ,obs − Πkn ,obs vj Θ Π ˆkn ,obs − Πkn ,obs vj′ . E ⟨Xℓ , vj ⟩⟨Xℓ , vj′ ⟩Θ Π j=1 j′ =1

5

ˆkn ,obs , we get Noticing that the variable Xℓ corresponds to the missing data Yℓ hence independent of Π ˆkn ,obs Xℓ − ΘΠkn ,obs Xℓ E ΘΠ (

6

7

=

)2

+∞ ∑ +∞ ∑ [ ( ) ( ) ] ˆkn ,obs − Πkn ,obs vj Θ Π ˆkn ,obs − Πkn ,obs vj′ ⟨Γ vj , vj′ ⟩E Θ Π j=1 j′ =1

8

=

+∞ ∑

[ ( ) ] ˆkn ,obs − Πkn ,obs vj 2 . λj E Θ Π

j=1 9 10 11 12

13

14

15

Now, following the proof of Proposition 15 in Crambes and Mas (2013), for any m ≥ 1 we denote Bm the oriented circle of the complex plane with center λm and radius ρm /2 where ρm = min (λm − λm+1 , λm−1 − λm ) for m ≥ 2 and ρ1 = λ2 − λ1 . With the convexity assumption (A.1), we actually have ρm = λm − λm+1 for all m ≥ 1. With these notations, denoting by ι ˆkn ,obs and Πkn ,obs can be written the complex number such that ι2 = −1, the difference between the projection operators Π

ˆkn ,obs − Πkn ,obs = Π

kn ∫ 1 ∑

2πι

m=1

) ( ˆn,obs Λ(z)dz , Λ(z) Γ − Γ Bm

where Λ(z) = (zI − Γ )−1 . Noticing that Λ(z)vj =

( ) ˆkn ,obs − Πkn ,obs vj = Θ Π

kn 1 ∑

2πι

=

17

18

20

m=1

vj , we deduce

( ) ˆn,obs Λ(z) Γ − Γ Bm

) +∞ ( ∑ ˆn,obs vj , vj′ ⟩vj′ ⟨ Γ −Γ Θ dz . (z − λj′ )(z − λj ) Bm ′ ∫

j =1

kn ∫ ∑

Bm

⎧ 0, ⎪ ⎨ dz 0, = −1 ⎪ (z − λj′ )(z − λj ) ⎩ (λj − λj′ )−1 , (λj′ − λj ) ,

if j, j′ > kn , if j, j′ ≤ kn , if j ≤ kn < j′ , if j′ ≤ kn < j.

hence we deduce

⎡ ⎛ ⎞2 ⎤ ( ) kn +∞ ∑ ∑ ˆ ′ ( )2 ⟨ Γ − Γn,obs vj , vj ⟩ 1 ⎥ ˆkn ,obs Xℓ − ΘΠkn ,obs Xℓ = E ⎢ E ΘΠ λj ⎝ Θ vj′ ⎠ ⎦ ⎣ 2 ′ 4π λ − λ j j ′ j=1

j =kn +1





j=kn +1

22

24

25

⎞2 ⎤

kn +∞ ∑ ˆn,obs vj , vj′ ⟩ ⟨ Γ −Γ ⎥ ⎢ 1 ∑ +E⎣ 2 λj ⎝ Θ vj′ ⎠ ⎦ . ′ − λj 4π λ j ′

21

23

dz z − λj

Still using the results from Crambes and Mas (2013), we have

m=1

19

1 ∑ 2πι



m=1 kn

16

Θ

1 z −λj

(

)

j =1

In the following, C corresponds to a generic constant. We denote E(A) and E(B) the above two terms. We start with the computation of E(A). Using the same technique as in Crambes and Mas (2013), we get the following bound

ˆn,obs vj , vj′ ⟩⟨ Γ − Γ ˆn,obs vj , vr ⟩ ≤ E ⟨ Γ −Γ ((

)

(

)

)

C n − mn

√ √ λj λj′ λr ,

noticing that the n rate of convergence given in Crambes and Mas (2013) is here transformed into the n − mm rate because ˆn,obs with n − mm observed data. Hence we deduce we use Γ Please cite this article as: C. Crambes and Y. Henchiri, Regression imputation in the functional linear model with missing values in the response. Journal of Statistical Planning and Inference (2018), https://doi.org/10.1016/j.jspi.2018.12.004.

JSPI: 5694

C. Crambes and Y. Henchiri / Journal of Statistical Planning and Inference xxx (xxxx) xxxx

15

)2 ( ( ) +∞ ∑ ˆn,obs vj , vj′ ⟩ ⟨ Γ −Γ Θ vj′ = E λj − λj′ ′

(( ) ( ) ) +∞ ∑ ˆn,obs vj , vj′ ⟩⟨ Γ − Γ ˆn,obs vj , vr ⟩ E ⟨ Γ −Γ Θ vj′ Θ vr (λj − λj′ )(λj − λr ) j =kn +1 r =kn +1 ⎛ ⎞2 √ +∞ ∑ λj C λj ⎝ ≤ Θ vj′ ⎠ . n − mn λ j − λ j′ ′

1

2

j =kn +1

Coming back to the computation of E(A), we can write (using Lemma 12 in Crambes and Mas, 2013)

λ2j λkn +1

kn ∑



+∞ ∑

⎞2

C λkn +1 Θ v j′ ⎠ ≤ ( )2 ⎝ n − mn j=1 λj − λkn +1 j′ =kn +1 ⎞2 ⎛ kn +∞ ∑ C λkn +1 (kn + 1)2 ∑ 1 ⎝ ≤ Θ v j′ ⎠ . n − mn j2 ′ C

E(A) ≤

n − mn

j=1

kn ∑ j=1

(kn + 1)

2

(kn + 1 − j)2



+∞ ∑



⎞2 Θ vj′ ⎠

C λkn k2n n − mn

⎛ ⎞2 ⎞2 kn +∞ ∑ ∑ ′ λ ′ λ C j j E(B) ≤ λ2j ⎝ Θ v j′ ⎠ ≤ λj ⎝ Θ v j′ ⎠ n − mn λ j′ − λ j n − mn λj′ − λj ′ ′ j=kn +1 j=kn +1 j =1 j =1 ⎛ ⎞2 ( ) kn +∞ 2 ∑ ∑ C j ⎝ ≤ Θ vj′ ⎠ . λj n − mn j − kn ′ +∞ ∑



kn ∑

j=kn +1

(

)2

j j − kn

j=kn +1

Ckn n − mn

≤ Ckn bn ,

bn ,

5.3. Proof of Lemma 5.2

12

13

14

16

18

1 n − mn

n ∑ ( ) ˆkn ,obs Γ ˆn,obs −1 Xℓ ⟩δi εi . ⟨X i , Π

19

i=1

We can write

20

1 (n − mn )2

+

11

17

Let us denote

Tn2 =

10

15

and this achieves the proof of Lemma 5.1.

Tn =

9

j =1

with (bn )n≥1 going to zero as n goes to infinity (see Crambes and Mas, 2013 p.19 in the proof of Proposition 15), we conclude

E(B) ≤

8



Now, again with the integrability of θ and the fact that

λj

6

7

where (an )n≥1 is a sequence of real numbers going to zero as n goes to infinity. We are now interested in the computation of E(B). Beginning in the same way as E(A) and still using Lemma 12 in Crambes and Mas (2013), we get

+∞ ∑

5

j =kn +1

an ,

C

4

j′ =kn +1

As θ ∈ L2 ([0, 1]) (hence θ is integrable), we finally get

E(A) ≤

3

n ∑ ( ) ˆkn ,obs Γ ˆn,obs −1 Xℓ ⟩2 δi2 εi2 ⟨X i , Π n n ∑ ∑

1 (n − mn

21

i=1

)2

( ) ( ) ˆkn ,obs Γ ˆn,obs −1 Xℓ ⟩⟨Xi′ , Π ˆkn ,obs Γ ˆn,obs −1 Xℓ ⟩δi δi′ εi εi′ . ⟨Xi , Π

i=1 i′ =1 i′ ̸ =i

Please cite this article as: C. Crambes and Y. Henchiri, Regression imputation in the functional linear model with missing values in the response. Journal of Statistical Planning and Inference (2018), https://doi.org/10.1016/j.jspi.2018.12.004.

22

JSPI: 5694

16

1

2

3

C. Crambes and Y. Henchiri / Journal of Statistical Planning and Inference xxx (xxxx) xxxx

From the independence between ε and X and the MAR assumption, the expectation of the second term above is zero, hence

] [ ( ) σε2 ˆkn ,obs Γ ˆn,obs −1 Xℓ ⟩2 δi2 , E ⟨X i , Π n − mn n − mn the index i corresponding to an observed data in the sample (and consequently δi = 1 for this observation). We finally get ( )

E Tn2 =

( )

4

5

1

E Tn2 =

[

ˆkn ,obs Γ ˆn,obs E ⟨X i , Π (

]

Xℓ ⟩2 δi2 εi2 =

[ ] ( ) σε2 ˆkn ,obs Γ ˆn,obs −1 Xℓ ⟩2 . E ⟨X i , Π n − mn

Following the same lines of the proof of Proposition 17 and Lemma 19 in Crambes and Mas (2013), we obtain

[

ˆkn ,obs Γ ˆn,obs E ⟨Xi , Π

6

(

)−1

]

Xℓ ⟩2 = kn + o (kn ) ,

7

which achieves the proof of the lemma.

8

5.4. Proof of Lemma 5.3

9

)−1

The proof of this lemma is quite immediate, noticing that

E ΘΠkn ,obs Xℓ − Θ Xℓ

(

10

)2

+∞ ∑ )2 (( ) ) ( ) ( = E ⟨ Πkn ,obs − I Xℓ , θ⟩2 = ⟨ Πkn ,obs − I Γ θ, θ⟩ = ΘΓ 1/2 vj . j=kn +1

11

12 13

5.5. Proof of Theorem 2.2 From Theorem 2.1, the last term in the asymptotic development is negligible, so we just have to achieve the usual trade-off between the square bias and the variance. Given that +∞ ∑ (

14

ΘΓ 1/2 vj

)2

=

j=kn +1

+∞ ∑

L2 ϕ (j),

j=kn +1

15

we approximate this sum with the integral

16

5.6. Proof of Theorem 2.3

17 18

21

x

L2 ϕ (t) dt, which gives the desired result.

First, if we follow the same lines of the proof of Lemmas 5.1 and 5.3 in Theorem 2.1 but with all the sample X1 , . . . , Xn , we get

ˆkn Xnew − ΘΠkn Xnew E ΘΠ (

19

20

∫ +∞

)2

( =o

λkn k2n n

+

kn

)

n

,

(15)

and

E ΘΠkn Xnew − Θ Xnew

(

)2

=

+∞ ∑ ( )2 ΘΓ 1/2 vj .

(16)

j=kn +1 22 23

24

25

26

27 28

Now, let us denote, for i = 1, . . . , n, εi,imp = Yi,imp − ⟨θ, Xi ⟩, and εi⋆ = δi εi + (1 − δi )εi,imp . We immediately can write εi,imp = εi + Yi,imp − Yi , and εi⋆ = εi + (1 − δi )(Yi,imp − Yi ). Then, following the proof of Lemma 5.2 in Theorem 2.1, we denote Sn =

n 1∑

( ) ˆkn Γ ˆn −1 Xnew ⟩εi⋆ , ⟨Xi , Π

n

i=1

whence, Sn2 =

n 1 ∑

n2

n n ( ) ( ) ( ) ( ) 1 ∑∑ ˆkn Γ ˆn −1 Xnew ⟩2 εi⋆ 2 + ˆkn Γ ˆn −1 Xnew ⟩⟨Xi′ , Π ˆkn Γ ˆn −1 Xnew ⟩εi⋆ εi⋆′ . ⟨X i , Π ⟨Xi , Π 2

n

i=1

i=1 i′ =1 i′ ̸ =i

We notice that, for i ̸ = i′ , we have

E εi⋆ εi⋆′ ≤ 4E(Yi,imp − Yi )2 ≤ 8 E(Yi,imp − ⟨θ, Xi ⟩)2 + σε2 .

(

)

[

]

Please cite this article as: C. Crambes and Y. Henchiri, Regression imputation in the functional linear model with missing values in the response. Journal of Statistical Planning and Inference (2018), https://doi.org/10.1016/j.jspi.2018.12.004.

JSPI: 5694

C. Crambes and Y. Henchiri / Journal of Statistical Planning and Inference xxx (xxxx) xxxx

17

This bound and the lines of the proof of Lemma 5.2 give

( E

)2

n

1∑ n

( ) ˆkn Γ ˆn −1 Xnew ⟩εi⋆ ⟨Xi , Π

i=1

( =O

(n − mn )kn n2

1

) 2

+

m2n kn n2

.

(17)

Now, combining relations (15)–(17) and the fact that mn = o(n) and m2n kn = O(n) (due to a2n n = o(1)), we get the desired result. Appendix A. Supplementary data Supplementary material related to this article can be found online at https://doi.org/10.1016/j.jspi.2018.12.004. References Bosq, D., 2000. Linear Processes in Function Spaces: Theory and Applications (First edition). NY: Springer, New York. Bugni, F.A., 2012. Specification test for missing functional data. Econ. Theory 28, 959–1002. Cai, T.T., Hall, P., 2006. Prediction in functional linear regression. Ann. Statist. 34, 2159–2179. Cardot, H., Ferraty, F., Sarda, P., 1999. Functional linear model. Statist. Probab. Lett. 45, 11–22. Cardot, H., Ferraty, F., Sarda, P., 2003. Spline estimators for the functional linear model. Statist. Sinica 13, 571–591. Cardot, H., Johannes, J., 2010. Thresholding projection estimators in functional linear models. J. Multivariate Anal. 101, 5395–5408. Cheng, P.E., 1994. Nonparametric estimation of mean functionals with data missing at random. J. Amer. Statist. Assoc. 89, 81–87. Chiou, J-M., Zhang, Y-C., Chen, W-H., Chang, C-W., 2014. A functional data approach to missing value imputation and outlier detection for traffic flow data. Transportmetrica B: Transp. Dyn. 2, 106–129. Chu, C.K., Cheng, P.E., 1995. Nonparametric regression estimation with missing data. J. Statist. Plann. Inference 48, 85–99. Crambes, C., Kneip, A., Sarda, P., 2009. Smoothing splines estimators for functional linear regression. Ann. Statist. 37, 35–72. Crambes, C., Mas, A., 2013. Asymptotics of prediction in functional linear regression with functional outputs. Bernoulli 19, 2627–2651. De Boor, C., 1978. A practical guide to splines. In: Applied Mathematical Sciences. Springer, New York. Ferraty, F., Sued, M., Vieu, P., 2013. Mean estimation with data missing at random for functional covariables. Statistics 47, 688–706. Ferraty, F., Vieu, P., 2006. Nonparametric functional data analysis: Theory and practice. NY: Springer-Verlag, New York. Graham, J.W., 2012. Missing Data Analysis and Design. NY: Springer, New York. Hall, P., Horowitz, J.L., 2007. Methodology and convergence rates for functional linear regression. Ann. Statist. 35, 70–91. He, Y., Yucel, R., Raghunathan, T.E., 2011. A functional multiple imputation approach to incomplete longitudinal data. Stat. Med. 30, 1137–1156. Horváth, L., Kokoszka, P., 2012. Inference for Functional Data with Applications. NY: Springer-Verlag, New York. Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., Kolehmainen, M., 2004. Methods for imputation of missing values in air quality datasets. Atmos. Environ. 38, 2895–2907. Little, R.J.A., Rubin, D.B., 2002. Statistical Analysis with Missing Data (Second Edition). NY: John Wiley, New York. Manski, C.F., 1995. Identification Problems in the Social Sciences. Harvard University Press. Manski, C.F., 2003. Partial Identification of Probability Distributions. Springer-Verlag. Mojirsheibani, M., 2007. Nonparametric curve estimation with missing data: A general empirical process approach. J. Statist. Plann. Inference 137, 2733– 2758. Preda, C., Saporta, G., Hadj, M.M.H., 2010. The NIPALS Algorithm for Functional Data. Revue Roumaine de Mathématique Pures et Appliquées 55, 315–326. Ramsay, J.O., Dalzell, C., 1991. Some tools for functional data analysis. J. R. Stat. Soc. B 53, 539–572. Ramsay, J.O., Hooker, G., Graves, S., 2009. Functional Data Analysis with R and MATLAB (Fisrt edition). NY: Springer Publishing Company, New York. Ramsay, J.O., Silverman, B.W., 2005. Functional Data Analysis (Second edition). NY: Springer-Verlag, New York. Shi, J.Q., Choi, T., 2011. Gaussian Process Regression Analysis for Functional Data. Chapman and Hall (CRC Press), London. Van Buuren, S., 2012. Flexible Imputation of Missing Data. NJ: Chapman and Hall (CRC Press), Hoboken. Wang, Q., Linton, O., Härdle, W., 2004. Semiparametric regression analysis with missing response at random. J. Amer. Statist. Assoc. 99, 334–345. Yuan, M., Cai, T.T., 2010. A reproducing kernel Hilbert space approach to functional linear regression. Ann. Statist. 38, 412–444. Zhang, J.T., 2014. Analysis of Variance for Functional Data. NY: Chapman and Hall, New York.

Please cite this article as: C. Crambes and Y. Henchiri, Regression imputation in the functional linear model with missing values in the response. Journal of Statistical Planning and Inference (2018), https://doi.org/10.1016/j.jspi.2018.12.004.

2

3 4

5

6 7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43