Probing the geometry of data with diffusion Fréchet functions

Appl. Comput. Harmon. Anal. 47 (2019) 935–947 Contents lists available at ScienceDirect Applied and Computational Harmonic Analysis www.elsevier.com...

Download PDF

929KB Sizes 0 Downloads 20 Views

Report

PDF Reader
Full Text

Appl. Comput. Harmon. Anal. 47 (2019) 935–947

Contents lists available at ScienceDirect

Applied and Computational Harmonic Analysis www.elsevier.com/locate/acha

Probing the geometry of data with diﬀusion Fréchet functions Diego H. Díaz Martínez a , Christine H. Lee b , Peter T. Kim b,c , Washington Mio a,∗ a

Department of Mathematics, Florida State University, Tallahassee, FL 32306-4510, USA Department of Pathology and Molecular Medicine McMaster University, St Joseph’s Healthcare, 50 Charlton Avenue E, 424 Luke Wing, Hamilton, ON L8N4A6, Canada c Department of Mathematics and Statistics, University of Guelph, Guelph, ON N1G 2W1, Canada b

a r t i c l e

i n f o

Article history: Received 10 May 2016 Received in revised form 24 November 2017 Accepted 31 January 2018 Available online 13 February 2018 Communicated by Dominique Picard MSC: 62-07 92C50 Keywords: Distributions on networks Fréchet functions Diﬀusion distances Community detection

a b s t r a c t The state of many complex systems, such as ecosystems formed by multiple microbial taxa that interact in intricate ways, is often summarized as a probability distribution on the nodes of a weighted network. This paper develops methods for modeling the organization of such data, as well as their Euclidean counterparts, across spatial scales. Using the notion of diﬀusion distance, we introduce diﬀusion Fréchet functions and diﬀusion Fréchet vectors associated with probability distributions on Euclidean space and the vertex set of a weighted network, respectively. We prove that these functional statistics are stable with respect to the Wasserstein distance between probability measures, thus yielding robust descriptors of their shapes. We provide several examples that illustrate the geometric characteristics of a distribution that are captured by multi-scale Fréchet functions and vectors. © 2018 Elsevier Inc. All rights reserved.

1. Introduction The primary goal of this paper is to study the shape of probability distributions on the nodes of weighted networks such as those describing the evolving states of complex ecosystems. For example, a microbial community may be modeled on a network in which each node represents a taxon and the main interactions between pairs of taxa are represented by weighted edges, the weights reﬂecting levels of interaction. In this formulation, the relative abundance of microbial taxa in a sample may be viewed as a probability distribution on the vertex set of the network. Modeling and quantifying the organization and variation of communities across samples and scales are core problems. Similar questions arise in other contexts such as probability measures deﬁned on Euclidean spaces or other metric spaces. We develop methods that (i) capture the geometry of data and probability measures across a continuum of spatial scales and (ii) integrate information * Corresponding author. E-mail address: [email protected] (W. Mio). https://doi.org/10.1016/j.acha.2018.01.003 1063-5203/© 2018 Elsevier Inc. All rights reserved.

936

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

obtained at all scales. In this paper, we focus on the Euclidean and network cases. We analyze distributions with the aid of a 1-parameter family of diﬀusion distances dt , t > 0, where t is treated as the scale parameter [1]. The deﬁnition of dt is recalled in Section 2. For (Borel) probability measures α on Rd , our approach is based on a functional statistic Vα,t : Rd → R derived from dt and termed the diﬀusion Fréchet function of α at scale t. Similarly, we deﬁne diﬀusion Fréchet vectors Fξ,t associated with a distribution ξ on the vertex set of a network. The (classical) Fréchet function of a random variable y ∈ Rd with ﬁnite second moment and distributed according to a probability measure α is deﬁned as

Vα (x) = Eα y − x

2

ˆ y − x2 dα(y) .

=

(1)

Rd

It is well-known that the mean (or expected value) of y is the unique minimizer of Vα . This is a useful statistic, however, it fails to capture the most basic characteristics of the shape of multimodal or more complex distributions. Aiming at more eﬀective descriptors, we introduce diﬀusion Fréchet functions and show that they capture richer information about the shape of α. At scale t > 0, we deﬁne the diﬀusion Fréchet function Vα,t : Rd → R by Vα,t (x) = Eα d2t (y, x) =

ˆ d2t (y, x) dα(y) ,

(2)

Rd

simply replacing the Euclidean distance in (1) with the diﬀusion distance dt . Vα,t (x) may be viewed as a localized second moment of α about x. In the network setting, Fréchet vectors are deﬁned in a similar manner through the diﬀusion kernel associated with the graph Laplacian. Unlike Vα , diﬀusion Fréchet functions and vectors typically exhibit rich proﬁles that reﬂect the shape of the distribution at diﬀerent scales. This point is illustrated with simulated data in Section 3. We prove stability theorems for diﬀusion Fréchet functions and vectors with respect to the Wasserstein distance between probability measures [2]. The results ensure that both Vα,t and Fξ,t yield robust descriptors, useful for data analysis. Community detection in networks has been studied with a variety of methods (cf. [3–5]), including the problem of selecting particular scales that seem to be most relevant to a problem. Over the years, there has been a growing trend in data analysis that takes the somewhat dual view (cf. [6–8]) that multiscale data summaries are especially informative, as important features of a distribution may be revealed at diﬀerent scales. This is the point of view adopted in this paper, but not to say that identifying important scales is relegated to a secondary plane. Rather, the view is that machine learning and other techniques may be applied to a multiscale summary to identify combinations of scales that are most informative. Another important angle in a multi-scale formulation is that persistence of features across scales is a powerful indicator of their relevance. We provide a simple illustration of this point in Section 4. We also point out that the present method is tailored to probability measures on weighted networks, not just structures solely characterized by the underlying network. This is a key diﬀerence that allows us to use the method, for example, to model variation in the organization of ecosystems such as the microbiome of diﬀerent individuals. As a matter of fact, this paper was largely motivated by the problem of understanding changes in the composition of the human gut microbiome related to Clostridium diﬃcile infection and treatment with fecal microbiota transplantation [9]. The rest of the paper is organized as follows. Section 2 reviews the concept of diﬀusion distance in the Euclidean and network settings. We deﬁne multiscale diﬀusion Fréchet functions in Section 3 and prove that they are stable with respect to the Wasserstein distance between probability measures in Euclidean spaces. The network counterpart is developed in Section 4. We conclude with a summary and some discussion in Section 5.

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

937

2. Diﬀusion distances Our multiscale approach to probability measures uses a formulation derived from the notion of diﬀusion distance [1]. In this section, we review diﬀusion distances associated with the heat kernel on Rd , followed by a discrete analogue for weighted networks. For t > 0, let Gt : Rd × Rd → R be the diﬀusion (heat) kernel Gt (x, y) =

y − x2 1 exp − , Cd (t) 4t

(3)

where Cd (t) = (4πt)d/2 . If an initial distribution of mass on Rd is described by a Borel probability measure α and its time evolution is governed by the heat equation ∂t u = Δu, then the distribution at time t has a smooth density function αt (y) = u(y, t) given by the convolution ˆ u(y, t) =

Gt (x, y) dα(x) .

(4)

Rd

For an initial point mass located at x ∈ R, α is the Dirac measure δx and u(y, t) = Gt (x, y). The diﬀusion map Ψt : Rd → L2 (Rd ), given by Ψt (x) = Gt (x, ·), embeds Rd into L2 (Rd ). The diﬀusion distance dt on Rd is deﬁned as dt (x1 , x2 ) = Ψt (x1 ) − Ψ2 (x2 )2 ,

(5)

the metric that Rd inherits from L2 (Rd ) via the embedding ψt . A calculation shows that d2t (x1 , x2 ) = Ψt (x1 )22 + Ψt (x2 )22 − 2G2t (x1 , x2 ) 1 − G2t (x1 , x2 ) , =2 Cd (2t)

(6)

where we used the fact that Ψt (x)22 = 1/Cd (2t), for every x ∈ Rd . Note that this implies that the metric space (Rd , dt ) has ﬁnite diameter. Diﬀusion distances on weighted networks may be deﬁned in a similar way by invoking the graph Laplacian and the associated diﬀusion kernel. Let v1 , . . . , vn be the nodes of a weighted network K. The weight of the edge between vi and vj is denoted wij , with the convention that wij = 0 if there is no edge between the nodes. We let W be the n × n matrix whose (i, j)-entry is wij . The graph Laplacian is the n × n n matrix Δ = D − W , where D is the diagonal matrix with dii = k=1 wik , the sum of the weights of all edges incident with the node vi (cf. [10]). This deﬁnition is based on a ﬁnite diﬀerence discretization of the Laplacian, except for a sign that makes Δ a positive semi-deﬁnite, symmetric matrix. The diﬀusion (or heat) kernel at t > 0 is the matrix e−tΔ . Let ξ be a probability distribution on the vertex set V of K. If ξi ≥ 0, 1 ≤ i ≤ n, is the probability of the vertex vi , we write ξ as the vector ξ = [ξ1 . . . ξn ]T ∈ Rn , where T denotes transposition and ξ1 +. . .+ξn = 1. Note that u = e−tΔ ξ solves the heat equation ∂t u = −Δu with initial condition ξ. (The negative sign is due to the convention made in the deﬁnition of Δ.) Mass initially distributed according to ξ, whose diﬀusion is governed by the heat equation, has time t distribution ut = e−tΔ ξ. A point mass at the ith vertex is described by the vector ei ∈ Rn , whose jth entry is δij . Thus, e−tΔ ei may be viewed as a network analogue of the Gaussian Gt (x, ·). For t > 0, the diﬀusion mapping ψt : V → Rn is deﬁned by ψt (vi ) = e−tΔ ei and the time t diﬀusion distance between the vertices vi and vj by dt (i, j) = e−tΔ ei − e−tΔ ej ,

(7)

938

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

where · denotes Euclidean norm. If 0 = λ1 ≤ . . . ≤ λn are the eigenvalues of Δ with orthonormal eigenvectors φ1 , . . . , φn , then d2t (i, j) =

n

e−2λk t (φk (i) − φk (j)) , 2

(8)

k=1

where φk (i) denotes the ith component of φk [1]. 3. Diﬀusion Fréchet functions The classical Fréchet function Vα of a probability measure α on Rd with ﬁnite second moment is a useful statistic. However, for complex distributions, such as those with a multimodal proﬁle or more intricate geometry, their Fréchet functions usually fail to provide a good description of their shape. Aiming at more eﬀective descriptors, we introduce diﬀusion Fréchet functions and show that they encode a wealth of information across a full range of spatial scales. At scale t > 0, deﬁne the diﬀusion Fréchet function Vα,t : Rd → R by ˆ d2t (y, x) dα(y) .

Vα,t (x) =

(9)

Rd

Note that Vα,t is deﬁned for any Borel probability measure α, not just those with ﬁnite second moment, because (Rd , dt ) has ﬁnite diameter. Moreover, Vα,t is uniformly bounded. We also point out that this construction is distinct from the standard kernel trick that pushes α forward to a probability measure (ψt )∗ (α) on L2 (Rd ), where the usual Fréchet function of (ψt )∗ (α) can be used to deﬁne such notions as an extrinsic mean of α. Vα,t is an intrinsic statistic and typically a function on Rd with a complex proﬁle, as we shall see in the examples below. From (2) and (6), it follows that Vα,t/2 (x) =

2 −2 Cd (t)

ˆ Gt (x, y) dα(y) =

Rd

2 − 2αt (x) , Cd (t)

(10)

where αt (x) = u(x, t), the solution of the heat equation ∂t u = Δu with initial condition α. If p1 , . . . , pn ∈ Rd are data points sampled from α, then the diﬀusion Fréchet function of the empirical n n measure αn = n1 i=1 δpi is closely related to the Gaussian density estimator α n,t = n1 i=1 Gt (x, pi ) derived from the sample points. Indeed, it follows from (10) that 2 −2 Vαn ,t/2 (x) = Cd (t)

ˆ Gt (x, y) dαn (y) =

Rd

2 − 2 αn,t (x) . Cd (t)

(11)

Thus, diﬀusion Fréchet functions provide a new interpretation of Gaussian density estimators, essentially as second moments with respect to diﬀusion distances. This opens up interesting new perspectives. For example, the classical Fréchet function Vα : Rd → R (see (1)) may be viewed as the trace of the covariance tensor ﬁeld Σα : Rd → Rd ⊗ Rd given by ˆ Σα (x) = Eα [(y − x) ⊗ (y − x)] =

(y − x) ⊗ (y − x) dα(y) .

(12)

Rd

Thus, it is natural to ask whether it is possible to deﬁne diﬀusion covariance tensor ﬁelds Σα,t that capture the modes of variation of α, about each x ∈ Rd at all scales, bearing a close relationship to Vα,t . A multiscale approach to covariance along related lines has been developed in [11,12].

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

939

Fig. 1. Evolution of the diﬀusion Fréchet function across scales.

The following examples illustrate that Fréchet functions encode valuable structural and dynamical information about datasets. Example 1. Here we consider the dataset highlighted in blue in Fig. 1, comprising n = 400 points p1 , . . . , pn on the real line, grouped into two clusters. The ﬁgure shows the evolution of the diﬀusion Fréchet function n for the empirical measure αn = i=1 δpi /n for increasing values of the scale parameter. At scale t = 0.1, the Fréchet function has two well deﬁned “valleys”, each corresponding to a data cluster. At smaller scales, Vαn ,t captures ﬁner properties of the organization of the data, whereas at larger scales Vαn ,t essentially views the data as comprising a single cluster. For probability distributions on the real line, the Fréchet √ function levels oﬀ to 1/ 2πt, as |x| → ∞. The local minima of Vα,t provide a generalization of the mean of the distribution and a summary of the organization of the data into sub-communities across multiple scales. Elucidating such organization in data in an easily interpretable manner is one of our principal aims. Example 2. In this example, we consider the dataset formed by n = 1000 points in R2 , shown on the ﬁrst panel of Fig. 2. The other panels show the diﬀusion Fréchet functions for the corresponding empirical measure αn , calculated at increasing scales, and the gradient ﬁeld −∇Vαn at t = 2, whose behavior reﬂects the organization of the data at that scale. Example 3. This example illustrates the evolution of a dataset consisting of n = 3158 points under the (negative) gradient ﬂow of the Fréchet function of the associated empirical measure at scale t = 0.2. Panel (a) in Fig. 3 shows the original data and the other panels show various stages of the evolution towards the attractors of the system. This may be viewed as a technique for simplifying a dataset at a given scale. We now prove stability results for Vα,t and its gradient ﬁeld ∇Vα,t , which provide a basis for their use as robust functional statistics in data analysis. We begin by reviewing the deﬁnition of the Wasserstein distance between two probability measures. A coupling between two probability measures α and β is a probability measure μ on Rd × Rd such that (p1 )∗ (μ) = α and (p2 )∗ (μ) = β. Here, p1 and p2 denote the projections onto the ﬁrst and second coordinates, respectively. The set of all such couplings is denoted Γ(α, β).

940

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

Fig. 2. Data points in R2 , heat maps of their diﬀusion Fréchet functions at increasing scales, and the gradient ﬁeld at t = 2.

Fig. 3. Panel (a) shows the original data and panels (b)–(f) display various stages of the evolution of the data towards the attractors of the gradient ﬂow of the Fréchet function at scale t = 0.2.

For each p ∈ [1, ∞), let Pp (Rd ) denote the collection of all Borel probability measures α on Rd whose ´ pth moment Mp (α) = yp dα(y) is ﬁnite. Deﬁnition 1 (cf. [2]). Let α, β ∈ Pp (Rd ). The p-Wasserstein distance Wp (α, β) is deﬁned as ⎛

¨

⎜ Wp (α, β) = ⎝inf μ

⎞1/p ⎟ x − yp dμ(x, y)⎠

,

(13)

Rd ×Rd

where the inﬁmum is taken over all μ ∈ Γ(α, β). It is well-known that the inﬁmum in (13) is realized by some μ ∈ Γ(α, β) [2]. Moreover, if p > q ≥ 1, then Pp (Rd ) ⊂ Pq (Rd ) and Wq (α, β) ≤ Wp (α, β).

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

941

The following lemma will be useful in the proof of the stability results. Part (a) of the lemma is a special case of Lemma 1 of [12]. Let Kt : Rd → R be the heat kernel centered at zero. In the notation used in (3), Kt (y) = Gt (0, y). Lemma 1. Let Cd (t) = (4πt)d/2 . For any y1 , y2 ∈ Rd and t > 0, we have: y1 − y2 √ ; Cd (t) 2t · e e+2 y1 − y2 . (b) y1 Kt (y1 ) − y2 Kt (y2 ) ≤ eCd (t) (a) |Kt (y1 ) − Kt (y2 )| ≤

Proof. (a) For each ξ ∈ [0, 1], let y(ξ) = ξy1 + (1 − ξ)y2 . Then, 1 ˆ ˆ1 d d Kt (y(ξ)) dξ ≤ Kt (y(ξ)) dξ Kt (y1 ) − Kt (y2 ) = dξ dξ 0

0

ˆ1 |∇Kt (y(ξ)) · (y1 − y2 )| dξ

=

(14)

0

ˆ1 ≤ y1 − y2

∇Kt (y(ξ)) dξ . 0

Note that y2 1 y y √ exp − ≤ ∇Kt (y) = Kt (y) ≤ . 2t 2tCd (t) 4t Cd (t) 2t · e

(15)

2 In the last inequality we used the fact that y exp − y ≤ 2t 4t e . Hence, |Kt (y1 ) − Kt (y2 )| ≤

y1 − y2 √ . Cd (t) 2t · e

(b) Using the same notation as in (a), Kt (y) d [yKt (y)] = Kt (y)(y1 − y2 ) − y y · (y2 − y2 ) . dξ 2t

(16)

2 d [yKt (y)] ≤ Kt (y) + y Kt (y) y2 − y2 dξ 2t 2 1 1+ y1 − y2 . ≤ Cd (t) e

(17)

Hence,

In the last inequality we used the facts that Kt (y) ≤ 1/Cd (t) and y2 Kt (y) ≤ 4t/eCd (t). Writing ˆ1 y1 Kt (y1 ) − y2 Kt (y2 ) = 0

d [yKt (y)] dξ , dξ

(18)

942

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

it follows from (17) and (18) that y1 Kt (y1 ) − y2 Kt (y2 ) ≤

1 Cd (t)

1+

2 e

y1 − y2 ,

(19)

as claimed. 2 Theorem 1 (Stability of Fréchet functions). Let α and β be Borel probability measures on Rd with diﬀusion Fréchet functions Vα,t and Vβ,t , t > 0, respectively. If α, β ∈ P1 (Rd ), then 1 √ W1 (α, β) . Cd (2t) t · e

Vα,t − Vβ,t ∞ ≤

Proof. Fix t > 0 and x ∈ Rd . Let μ ∈ Γ(α, β) be a coupling such that ¨ z1 − z2 dμ(z1 , z2 ) = W1 (α, β) .

(20)

Rd ×Rd

Then, we may write ¨ Vα,t (x) =

d2t (z1 , x) dμ(z1 , z2 )

(21)

d2t (z2 , x) dμ(z1 , z2 ).

(22)

Rd ×Rd

and ¨ Vβ,t (x) = Rd ×Rd

By (6), d2t (z, x) = 2/Cd (2t) − 2G2t (z, x). Hence, ¨ Vα,t (x) − Vβ,t (x) = −2

(G2t (z1 , x) − G2t (z2 , x)) dμ(z1 , z2 ) ,

(23)

|G2t (z1 , x) − G2t (z2 , x)| dμ(z1 , z2 ) .

(24)

Rd ×Rd

which implies that ¨ |Vα,t (x) − Vβ,t (x)| ≤ 2 Rd ×Rd

After translating α and β, we may assume that x = 0. Thus, by Lemma 1(a), |Vα,t (x) − Vβ,t (x)| ≤

1 √ Cd (2t) t · e

¨ z1 − z2 dμ(z1 , z2 ) Rd ×Rd

(25)

1 √ = W1 (α, β) , Cd (2t) t · e as claimed. 2 Theorem 2 (Stability of gradient ﬁelds). Let α and β be Borel probability measures on Rd with diﬀusion Fréchet functions Vα,t and Vβ,t , t > 0, respectively. If α, β ∈ P1 (Rd ), then

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

sup ∇Vα,t (x) − ∇Vβ,t (x) ≤ x∈Rd

943

e+2 W1 (α, β) . eCd (2t)

Proof. Let μ ∈ Γ(α, β) be a coupling that realizes W1 (α, β). From (23), it follows that ¨ ∇Vα,t (x) − ∇Vβ,t (x) ≤ 2 ∇x G2t (z1 , x) − ∇x G2t (z2 , x) dμ(z1 , z2 ) ¨ 1 (x − z1 )G2t (z1 , x) − (x − z2 )G2t (z2 , x) dμ(z1 , z2 ) . ≤ 2t

(26)

We may assume that x = 0, so we rewrite the previous inequality as ∇Vα,t (x) − ∇Vβ,t (x) ≤

1 2t

¨ z1 K2t (z1 ) − z2 K2t (z2 ) dμ(z1 , z2 ) .

(27)

Therefore, by Lemma 1(b), ¨ e+2 z1 − z2 dμ(z1 , z2 ) eCd (2t) e+2 = W1 (α, β) , eCd (2t)

∇Vα,t (x) − ∇Vβ,t (x) ≤

(28)

which proves the theorem. 2 Remark. It is well known that the Wasserstein distance metrizes weak convergence of probability measures. This implies that if y1 , . . . , yn are independent random variables distributed according to α ∈ P1 (Rd ) and αn is the corresponding empirical measure, then w1 (αn , α) → 0 almost surely, as n → ∞ [13]. Hence, the stability theorem ensures that the Fréchet function of α and its gradient ﬁeld may be estimated reliably from suﬃcient, independently sampled data. Another basic question is how well one may estimate the Fréchet function of α at a given sample size n. Stability combined with convergence rate estimates by Fournier and Guillin [14] provide insights in this direction (cf. [12]). 4. Diﬀusion Fréchet vectors on networks In this section, we deﬁne a network analogue of diﬀusion Fréchet functions and prove a stability theorem. Let ξ = [ξ1 . . . ξn ]T ∈ Rn represent a probability distribution on the vertex set V = {v1 , . . . , vn } of a weighted network K. For t > 0, deﬁne the diﬀusion Fréchet vector (DFV) as the vector Fξ,t ∈ Rn whose ith component is

Fξ,t (i) =

n

d2t (i, j)ξj .

(29)

j=1

Example 4. We illustrate the behavior of DFVs across scales for three distributions on a weighted network with six nodes, as shown in Figs. 4(a) and 4(b). The DFVs may be viewed as paths in R6 parameterized by the scale parameter t > 0. Fig. 4(c) shows a 3D reduction via principal component analysis that let us visualize the paths. As may be observed, some of the main contrasts between the distributions are observed at diﬀerent scales suggesting that they may be learned from training data in applications. It also becomes apparent that some diﬀerences persist over a whole range of scales and the level of persistence may provide valuable information.

944

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

Fig. 4. Diﬀusion Fréchet vectors: (a) a weighted network; (b) three distributions on the node set; (c) 3D PCA representation of the DFVs viewed as paths in R3 parameterized by the scale parameter.

To state a stability theorem for DFVs, we ﬁrst deﬁne the Wasserstein distance between two probability distributions on the vertex set V of a weighted network K. This requires a base metric on V to play a role similar to that of the Euclidean metric in (13). We use the commute-time distance between the nodes of a weighted network (cf. [15]). Deﬁnition 2. Let K be a connected, weighted network with nodes v1 , . . . , vn , and let φ1 , φ2 , . . . , φn be an orthonormal basis of Rn formed by eigenvectors of the graph Laplacian with eigenvalues 0 = λ1 < λ2 ≤ . . . ≤ λn . The commute-time distance between vi and vj is deﬁned as dCT (i, j) =

n 1 2 (φk (i) − φk (j)) λk

1/2 .

(30)

k=2

It is simple to verify that d2CT (i, j) = 2

´∞ 0

d2t (i, j) dt.

Deﬁnition 3. Let ξ, ζ ∈ Rn represent probability distributions on V . The p-Wasserstein distance, p ≥ 1, between ξ and ζ (with respect to the commute-time distance on V ) is deﬁned as

Wp (ξ, ζ) =

min

μ∈Γ(α,β)

⎛ ⎞1/p n n ⎝ dpCT (i, j)μij ⎠ ,

(31)

j=1 i=1

where Γ(ξ, ζ) is the set of all probability measures μ on V × V satisfying (p1 )∗ (μ) = ξ and (p2 )∗ (μ) = ζ. Theorem 3. Let ξ, ζ ∈ Rn be probability distributions on the vertex set of a connected, weighted network. For each t > 0, their diﬀusion Fréchet vectors satisfy Fξ,t − Fζ,t ∞ ≤ 4

Tr e−2tΔ − 1 W1 (ξ, ζ) . 2et

(32)

Proof. Fix t > 0 and a node v . Let μ ∈ Γ(ξ, ζ) be such that n n

dCT (i, j)μ(i, j) = W1 (ξ, ζ) .

(33)

j=1 i=1

Since μ has marginals α and β, we may write Fξ,t ( ) =

n n j=1 i=1

d2t (i, )μi,j

and Fζ,t ( ) =

n n j=1 i=1

d2t (j, )μi,j ,

(34)

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

945

which implies that |Fξ,t ( ) − Fζ,t ( )| ≤

n n 2 dt (i, ) − d2t (j, ) μi,j .

(35)

j=1 i=1

By Lemma 2 below, n n |Fξ,t ( ) − Fζ,t ( )| ≤4 Tr e−2Δt − 1 dt (i, j)μi,j .

(36)

j=1 i=1

Observe that d2t (i, j) =

n

e−2λk t (φk (i) − φk (j))

2

k=1

=

n

(37) λk e

−2λk t

k=1

since λk e−2λk t ≤

1 2et .

1 2 1 2 d (i, j) (φk (i) − φk (j)) ≤ λk 2et CT

The theorem follows from (36) and (37).

2

Lemma 2. Let vi , vj , v be nodes of a connected, weighted network. For any t > 0, 2 dt (i, ) − d2t (j, ) ≤ 4 Tr e−2tΔ − 1 dt (i, j) , where Tr denotes the trace operator. Proof. Since the eigenfunction φ1 is constant, we may write the diﬀusion distance as d2t (i, ) =

n

e−2λk t φ2k (i) − 2φk (i)φk ( ) + φ2k ( ) .

(38)

k=2

Thus, d2t (i, ) − d2t (j, ) =

n

e−2λk t φ2k (i) − φ2k (j)

k=2

−2

n

e−2λk t φk ( ) (φk (i) − φk (j))

k=2

=

n

e

(39) −2λk t

(φk (i) + φk (j))(φk (i) − φk (j))

k=2

−2

n

e−2λk t φk ( ) (φk (i) − φk (j)) .

k=2

Since each φk has unit norm, |φk ( )| ≤ 1 and |φk (i) + φk (j)| ≤ 2 . Therefore, n 2 dt (i, ) − d2t (j, ) ≤ 4 e−2λk t |φk (i) − φk (j)| . k=2

(40)

946

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

The Cauchy–Schwarz inequality, applied to the vectors a = e−λ2 t , . . . , e−λn t and b = e−λ2 t |φ2 (i) − φ2 (j)| , . . . , e−λn t |φn (i) − φn (j)| , yields n

n n 2 −2λk t −2λ t k e |φk (i) − φk (j)| ≤ e e−2λk t (φk (i) − φk (j))

k=2

= The lemma follows from (40) and (41).

k=2

k=2

(41)

Tr e−2tΔ − 1 dt (i, j) .

2

5. Concluding remarks We introduced diﬀusion Fréchet functions and diﬀusion Fréchet vectors associated with probability measures on Euclidean space and weighted networks that integrate geometric information about their distributions across spatial scales, yielding a pathway to the quantiﬁcation of their shapes. We proved that these functional statistics are stable with respect to the Wasserstein distance, providing a theoretical basis for their use in data analysis. To demonstrate the usefulness of these concepts, we provided examples using simulated data that illustrate some of their key properties. Diﬀusion Fréchet functions may be deﬁned in the broader context of metric measure spaces, however their stability and other properties in this general setting remains to be investigated. As remarked in Section 3, the present work suggests the development of reﬁnements of diﬀusion Fréchet functions such as diﬀusion covariance tensor ﬁelds for Borel measures on Riemannian manifolds to make their geometric properties more readily accessible. Acknowledgments This research was supported in part by NSF grants DBI-1262351 and DMS-1418007, CIHR-NSERC Collaborative Health Research Projects (413548-2012), and NSF grant DMS-1127914 to SAMSI. References [1] R.R. Coifman, S. Lafon, Diﬀusion maps, Appl. Comput. Harmon. Anal. 21 (1) (2006) 5–30, https://doi.org/10.1016/ j.acha.2006.04.006. [2] C. Villani, Optimal Transport, Old and New, Springer, 2009. [3] M. Schaub, J.-C. Delvenne, S.S. Yaliraki, M. Barahona, Markov dynamics as a zooming lens for multiscale community detection: non clique-like communities and the ﬁeld-of-view limit, PLoS ONE 7 (2012) e32210. [4] R. Lambiotte, Multi-scale modularity and dynamics in complex networks, in: A. Mukherjee, M. Choudhury, F. Peruani, N. Ganguly, B. Mitra (Eds.), Dynamics On and Of Complex Networks, in: Modeling and Simulation in Science, Engineering and Technology, vol. 2, Birkhäuser, New York, NY, 2013, pp. 125–141. [5] N. Tremblay, P. Borgnat, Graph wavelets for multiscale community mining, IEEE Trans. Signal Process. 62 (2014) 5227–5239. [6] T. Lindeberg, Scale Space Theory in Computer Vision, Kluwer, Boston, 1994. [7] P. Chaudhuri, J. Marron, Scale space view of curve estimation, Ann. Statist. 28 (2) (2000) 408–428. [8] G. Carlsson, Topology and data, Bull. Amer. Math. Soc. 46 (2) (2009) 255–308. [9] C. Lee, J. Belanger, Z. Kassam, M. Smieja, D. Higgins, G. Broukhanski, P. Kim, The outcome and long-term follow-up of 94 patients with recurrent and refractory Clostridium diﬃcile infection using single to multiple fecal microbiota transplantation via retention enema, Eur. J. Clin. Microbiol. Infect. Dis. 33 (8) (2014) 1425–1428, https://doi.org/10.1007/ s10096-014-2088-9. [10] U. von Luxburg, A tutorial on spectral clustering, Stat. Comput. 17 (4) (2007) 395–416, https://doi.org/10.1007/ s11222-007-9033-z. [11] D. Díaz Martínez, F. Mémoli, W. Mio, Multiscale covariance ﬁelds, local scales, and shape transforms, in: F. Nielsen, F. Barbaresco (Eds.), Geometric Science of Information, in: Lecture Notes in Computer Science, vol. 8085, Springer, 2013, pp. 794–801. [12] D. Díaz Martínez, F. Mémoli, W. Mio, The shape of data and probability measures, arXiv:1509.04632, 2015.

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

947

[13] R.M. Dudley, Real Analysis and Probability, Cambridge Studies in Advanced Mathematics, vol. 74, Cambridge University Press, Cambridge, England, 2002. [14] N. Fournier, A. Guillin, On the rate of convergence in Wasserstein distance of the empirical measure, Probab. Theory Related Fields 162 (2015) 707–738. [15] L. Lovász, Random walks on graphs: a survey, in: Combinatorics, Paul Erdös is Eighty, vol. 2, 1996, pp. 353–398.

Probing the geometry of data with diffusion Fréchet functions

Probing the geometry of data with diffusion Fréchet functions

Recommend Documents